logs8:Understand policy entropy TO REVISIT

Understand policy entropy TO REVISIT 2018/04/21

Log 1: what specific output am I working on right now?
- Understand policy entropy
- Actionable steps for log and implement policy entropy
Log 2: thinking out loud - e.g. hypotheses about the current problem, what to work on next
- Read Lessons Learned Reproducing a Deep Reinforcement Learning Paper
  - I’ve found policy entropy in particular to be a good indicator of whether training is going anywhere - much more sensitive than per-episode rewards.
  - Examples of unhealthy and healthy policy entropy graphs. Failure mode 1 (left): convergence to constant entropy (random choice among a subset of actions). Failure mode 2 (centre): convergence to zero entropy (choosing the same action every time). Right: policy entropy from a successful Pong training run.
- Read williamFalcon/DeepRLHacks: Hacks for training RL systems from John Schulman's lecture at Deep RL Bootcamp (Aug 2017)
  - Entropy in ACTION space Care more about entropy in state space, but don't have good methods for calculating that. If going down too fast, then policy becoming deterministic and will not explore. If NOT going down, then policy won't be good because it is really random. Can fix by: KL penalty Keep entropy from decreasing too quickly.Add entropy bonus.
  - How to measure entropy. For most policies can compute entropy analytically. If continuous, it's usually a Gaussian, so can compute differential entropy. 　　　- high entropy means super random and garbage.
- Understand what entropy is in general.
  - Shannon Entropy and Information Gain - YouTube
- Understand what entropy is in this context.
  - entropy in action space. Action space = reply space. If the model can have a more diverse reply, entropy is higher.
  - Start thinking of the very basic example, Pong game. Action space is to move either left or right.
    - if p(left) = 0.4 and p(right) = 0.6 then entropy
      - -0.4 * log2(0.4) - 0.6 * log2(0.6) = 0.9709505944546686
      - -0.1 * log2(0.1) - 0.9 * log2(0.9) = 0.4689955935892812
  - Asking in Quora: http://qr.ae/TU13dz
  - One answer by Sam: https://twitter.com/samuelmaskell/status/987594631943684097
- Find a way to calculate it.
One more thoughts, appearntly graphs logs7:Test RL with small model indicates entropy is going down quickly, where model is deterministic.
Maybe good to do with medium size model.
Log 3: a record of currently ongoing runs along with a short reminder of what question each run is supposed to answer
Log 4: results of runs (TensorBoard graphs, any other significant observations), separated by type of run (e.g. by the environment the agent is being trained in)

logs8:Understand policy entropy TO REVISIT

Understand policy entropy TO REVISIT 2018/04/21

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally