Skip to content

Commit d835155

Browse files
committed
Working on advantages
1 parent b040d09 commit d835155

1 file changed

Lines changed: 13 additions & 0 deletions

File tree

src/ppo/main.clj

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -626,6 +626,19 @@
626626
;; Here for example we are sampling 3 consecutives states of the pendulum.
627627
(sample-environment pendulum-factory (indeterministic-act actor) 3)
628628

629+
;; ### Advantages
630+
;;
631+
;; If we are in state $s_t$ and take an action $a_t$ at timestep $t$, we end up in state $s_{t+1}$ and receive reward $r_t$.
632+
;; The cumulative reward for state $s_t$ is
633+
;;
634+
;; $r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \ldots$
635+
;;
636+
;; The critic $V$ estimates the expected cumulative reward for starting from the specified state.
637+
;;
638+
;; $V(s_t) = \mathop{\hat{\mathbb{E}}} [ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \ldots ]$
639+
;;
640+
;; In particular, the difference between discounted rewards can be used to get an estimate for the individual reward:
641+
;;
629642
;; $\hat{A}_{T-1} = -V(S_{T-1}) + r_{T-1} + \gamma V(S_T)$
630643
;;
631644
;; $\hat{A}_{T-2} = -V(S_{T-2}) + r_{T-2} + \gamma r_{T-1} + \gamma^2 V(S_T)$

0 commit comments

Comments
 (0)