Skip to content

Commit 677d070

Browse files
committed
Review and improve text a bit
1 parent e6a3fc5 commit 677d070

1 file changed

Lines changed: 32 additions & 28 deletions

File tree

src/ppo/main.clj

Lines changed: 32 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -30,23 +30,23 @@
3030
;; ## Motivation
3131
;;
3232
;; Recently I started to look into the problem of reentry trajectory planning in the context of developing the [sfsim](https://store.steampowered.com/app/3687560/sfsim/) space flight simulator.
33-
;; I had looked into reinforcement learning before and tried out Q-learning using the [lunar lander reference environment of OpenAI's gym library](https://gymnasium.farama.org/environments/box2d/lunar_lander/) (now maintained by the Farama Foundation).
34-
;; However I had stability issues.
35-
;; The algorithm would learn a strategy and then suddenly diverge again.
33+
;; I had looked into reinforcement learning before and even tried out Q-learning using the [lunar lander reference environment of OpenAI's gym library](https://gymnasium.farama.org/environments/box2d/lunar_lander/) (now maintained by the Farama Foundation).
34+
;; However it had stability issues.
35+
;; The algorithm would converge on a strategy and then suddenly diverge again.
3636
;;
3737
;; More recently (2017) the [Proximal Policy Optimization (PPO) algorithm was published](https://arxiv.org/abs/1707.06347) and it has gained in popularity.
3838
;; PPO is inspired by Trust Region Policy Optimization (TRPO) but is much easier to implement.
39-
;; Most importantly PPO can handle continuous observation and action spaces.
39+
;; Also PPO handles continuous observation and action spaces which is important for control problems.
4040
;; The [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3) Python library has a implementation of PPO, TRPO, and other reinforcement learning algorithms.
4141
;; However I found [XinJingHao's PPO implementation](https://github.com/XinJingHao/PPO-Continuous-Pytorch/) which I found easier to follow.
4242
;;
43-
;; In order to use PPO with a simulation environment in Clojure and also in order to get a better understanding of PPO, I dediced to do an implementation of PPO in Clojure.
43+
;; In order to use PPO with a simulation environment implemented in Clojure and also in order to get a better understanding of PPO, I dediced to do an implementation of PPO in Clojure.
4444
;;
4545
;; ## Pendulum Environment
4646
;;
4747
;; ![screenshot of pendulum environment](pendulum.png)
4848
;;
49-
;; First we implement a simple pendulum environment to test the PPO algorithm.
49+
;; To validate the implementation, we will implement the classical [pendulum](https://gymnasium.farama.org/environments/classic_control/pendulum/) environment in Clojure.
5050
;; In order to be able to switch environments, we define a protocol according to the environment abstract class used in OpenAI's gym.
5151
(defprotocol Environment
5252
(environment-update [this action])
@@ -81,8 +81,8 @@
8181
:t 0.0})
8282

8383
;; Same as in OpenAI's gym the angle is zero when the pendulum is pointing up.
84-
;; Here a pendulum is initialised to be pointing down and with an angular velocity of 0.5.
85-
(setup (/ PI 2) 0.5)
84+
;; Here a pendulum is initialised to be pointing down and have an angular velocity of 0.5 radians per second.
85+
(setup PI 0.5)
8686

8787
;; ### State Updates
8888
;;
@@ -98,7 +98,7 @@
9898
(pendulum-gravity 9.81 2.0 (/ PI 2))
9999

100100
;; The motor is controlled using an input value between -1 and 1.
101-
;; This value is simply multiplied with the maximum acceleration provided by the motor.
101+
;; This value is simply multiplied with the maximum angular acceleration provided by the motor.
102102
(defn motor-acceleration
103103
"Angular acceleration from motor"
104104
[control motor-acceleration]
@@ -127,7 +127,8 @@
127127
;; ### Observation
128128
;;
129129
;; The observation of the pendulum state uses cosinus and sinus of the angle to resolve the wrap around problem of angles.
130-
;; The angular speed is normalized to be between -1 and 1.
130+
;; The angular speed is normalized to be between -1 and 1 as well.
131+
;; This so called [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) is done in order to improve convergence.
131132
(defn observation
132133
"Get observation from state"
133134
[{:keys [angle velocity]} {:keys [max-speed]}]
@@ -138,12 +139,12 @@
138139
(observation {:angle 0.0 :velocity 0.5} config)
139140
(observation {:angle (/ PI 2) :velocity 0.0} config)
140141

141-
;; Note that the observation needs to capture all information required for achieving the objective, because it the only information available to the policy for deciding on the next action.
142+
;; Note that the observation needs to capture all information required for achieving the objective, because it is the only information available to the actor for deciding on the next action.
142143

143144
;; ### Action
144145
;;
145146
;; The action of a pendulum is a vector with one element between 0 and 1.
146-
;; The following method converts it to a action hashmap used by the pendulum environment.
147+
;; The following method clips it and converts it to an action hashmap used by the pendulum environment.
147148
(defn action
148149
"Convert array to action"
149150
[array]
@@ -249,7 +250,7 @@
249250

250251
;; ### Animation
251252
;;
252-
;; The following method animates the pendulum and facilitates mouse control.
253+
;; With Quil we can create an animation of the pendulum and react to mouse input.
253254
(defn run []
254255
(let [done-chan (async/chan)
255256
last-action (atom {:control 0.0})]
@@ -280,10 +281,10 @@
280281
;;
281282
;; ### Import Pytorch
282283
;;
283-
;; For implementing the neural networks and backpropagation, I am using the Python-Clojure bridge [libpython-clj2](https://github.com/clj-python/libpython-clj) and [Pytorch](https://pytorch.org/).
284+
;; For implementing the neural networks and backpropagation, we can use the Python-Clojure bridge [libpython-clj2](https://github.com/clj-python/libpython-clj) and the [Pytorch](https://pytorch.org/) machine learning library.
284285
;; The Pytorch library is quite comprehensive, is free software, and you can find a lot of documentation on how to use it.
285286
;; The default version of [Pytorch on pypi.org](https://pypi.org/project/torch/) comes with CUDA (Nvidia) GPU support.
286-
;; There is also a [Pytorch wheel on AMD's website](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#use-a-wheels-package) which comes with [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) support.
287+
;; There are also [Pytorch wheels provided by AMD](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#use-a-wheels-package) which come with [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) support.
287288
;; Here we are going to use a CPU version of Pytorch which is a much smaller install.
288289
;;
289290
;; You need to install [Python 3.10](https://www.python.org/) or later.
@@ -387,8 +388,8 @@
387388

388389
;; ### Critic Network
389390
;;
390-
;; The critic network is a fully connected neural network with an input layer of size `observation-size` and two hidden layers of size `hidden-units` with `tanh` activation functions.
391-
;; The critic output is a single value (an estimate for the expected cumulative return achievable by the given observed state.
391+
;; The critic network is a neural network with an input layer of size `observation-size` and two fully connected hidden layers of size `hidden-units` with `tanh` activation functions.
392+
;; The critic output is a single value (an estimate for the expected cumulative return achievable by the given observed state).
392393
(def Critic
393394
(py/create-class
394395
"Critic" [nn/Module]
@@ -432,7 +433,7 @@
432433
(py. no-grad# ~'__exit__ nil nil nil)))))
433434

434435
;; Now we can create a network and try it out.
435-
;; Note that the network creates non-zero outputs because Pytorch performs random initialisation of ther weights for us.
436+
;; Note that the network creates non-zero outputs because Pytorch performs random initialisation of the weights for us.
436437
(def critic (Critic 3 64))
437438
(without-gradient
438439
(toitem (critic (tensor [-1 0 0]))))
@@ -452,7 +453,7 @@
452453
;; Training a neural network is done by defining a loss function.
453454
;; The loss of the network then is calculated for a mini-batch of training data.
454455
;; One can then use Pytorch's backpropagation to compute the gradient of the loss value with respect to every single parameter of the network.
455-
;; The gradient then is used to perform gradient descent steps.
456+
;; The gradient then is used to perform a gradient descent step.
456457
;; A popular gradient descent method is the [Adam optimizer](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam).
457458

458459
;; Here is a wrapper for the Adam optimizer.
@@ -473,7 +474,8 @@
473474
(def criterion (mse-loss))
474475
(def mini-batch [(tensor [[-1 0 0]]) (tensor [1.0])])
475476
(let [prediction (critic (first mini-batch))
476-
loss (criterion prediction (second mini-batch))]
477+
expected (second mini-batch)
478+
loss (criterion prediction expected)]
477479
(py. optimizer zero_grad)
478480
(py. loss backward)
479481
(py. optimizer step))
@@ -522,7 +524,7 @@
522524

523525
;; Furthermore the actor network has a method `get_dist` to return a [Torch distribution](https://docs.pytorch.org/docs/stable/distributions.html) object which can be used to sample a random action or query the current log-probability of an action.
524526
;; Here (as the default in XinJingHao's PPO implementation) we use the [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) with parameters `alpha` and `beta` both greater than 1.0.
525-
;; See [here](https://mathlets.org/mathlets/beta-distribution/) for an interactive visualization.
527+
;; See [here](https://mathlets.org/mathlets/beta-distribution/) for an interactive visualization of the Beta distribution.
526528
(defn indeterministic-act
527529
"Sample action using actor network returning distribution"
528530
[actor]
@@ -537,13 +539,13 @@
537539
(def actor (Actor 3 64 1))
538540
;; One can then use the network to:
539541
;;
540-
;; a) get the parameters of the distribution for a given observation.
542+
;; a. get the parameters of the distribution for a given observation.
541543
(without-gradient (actor (tensor [-1 0 0])))
542544

543-
;; b) choose the expectation value of the distribution as an action.
545+
;; b. choose the expectation value of the distribution as an action.
544546
(without-gradient (py. actor deterministic_act (tensor [-1 0 0])))
545547

546-
;; c) sample a random action from the distribution and get the associated log-probability.
548+
;; c. sample a random action from the distribution and get the associated log-probability.
547549
((indeterministic-act actor) [-1 0 0])
548550

549551
;; We can also query the current log-probability of a previously sampled action.
@@ -557,10 +559,12 @@
557559
;; Here is a plot of the probability density function (PDF) actor output for a single observation.
558560
(without-gradient
559561
(let [actions (range 0.0 1.01 0.01)
560-
scatter (tc/dataset {:x actions
561-
:y (map (fn [action]
562-
(exp (first (tolist ((logprob-of-action actor) (tensor [-1 0 0]) (tensor [action]))))))
563-
actions)})]
562+
logprob (fn [action]
563+
(tolist
564+
((logprob-of-action actor) (tensor [-1 0 0]) (tensor action))))
565+
scatter (tc/dataset
566+
{:x actions
567+
:y (map (fn [action] (exp (first (logprob [action])))) actions)})]
564568
(-> scatter
565569
(plotly/base {:=title "Actor output for a single observation" :=mode :lines})
566570
(plotly/layer-point {:=x :x :=y :y}))))

0 commit comments

Comments
 (0)