Skip to content

logs14:Normalize Reward

Higepon Taro Minowa edited this page May 10, 2018 · 3 revisions

Normalize Reward

1: What specific output am I working on right now?

According to Why do we normalize the discounted rewards when doing policy gradient reinforcement learning? - Data Science Stack Exchange, we should standlize reward so that half of the actions are positive and the other half is negative.

neg_log_prob = tf.nn.softmax_cross_entropy_with_logits(logits=logits, labels=labels)
            loss = tf.reduce_mean(neg_log_prob * self.discounted_episode_rewards_norm)  # reward guided loss

We need tf.mul(neg_log_pro, rewards)

2: Thinking out loud - e.g. hypotheses about the current problem, what to work on next, how can I verify

3: A record of currently ongoing runs along with a short reminder of what question each run is supposed to answer

  • run1: title

4: Results of runs (TensorBoard graphs, any other significant observations), separated by type of run (e.g. by the environment the agent is being trained in)

run1
  • hparams >
  • mega.nz directory: 20180430rl_test_medium7
seq2seq
RL

Clone this wiki locally