File tree Expand file tree Collapse file tree
Expand file tree Collapse file tree Original file line number Diff line number Diff line change 626626; ; Here for example we are sampling 3 consecutives states of the pendulum.
627627(sample-environment pendulum-factory (indeterministic-act actor) 3 )
628628
629+ ; ; ### Advantages
630+ ; ;
631+ ; ; If we are in state $s_t$ and take an action $a_t$ at timestep $t$, we end up in state $s_{t+1}$ and receive reward $r_t$.
632+ ; ; The cumulative reward for state $s_t$ is
633+ ; ;
634+ ; ; $r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \ldots$
635+ ; ;
636+ ; ; The critic $V$ estimates the expected cumulative reward for starting from the specified state.
637+ ; ;
638+ ; ; $V(s_t) = \mathop{\hat{\mathbb{E}}} [ r_t + \gamma r_{t+1} + \gamma^2 r_{t+2} + \gamma^3 r_{t+3} + \ldots ]$
639+ ; ;
640+ ; ; In particular, the difference between discounted rewards can be used to get an estimate for the individual reward:
641+ ; ;
629642; ; $\hat{A}_{T-1} = -V(S_{T-1}) + r_{T-1} + \gamma V(S_T)$
630643; ;
631644; ; $\hat{A}_{T-2} = -V(S_{T-2}) + r_{T-2} + \gamma r_{T-1} + \gamma^2 V(S_T)$
You can’t perform that action at this time.
0 commit comments