|
30 | 30 | ;; ## Motivation |
31 | 31 | ;; |
32 | 32 | ;; Recently I started to look into the problem of reentry trajectory planning in the context of developing the [sfsim](https://store.steampowered.com/app/3687560/sfsim/) space flight simulator. |
33 | | -;; I had looked into reinforcement learning before and tried out Q-learning using the [lunar lander reference environment of OpenAI's gym library](https://gymnasium.farama.org/environments/box2d/lunar_lander/) (now maintained by the Farama Foundation). |
34 | | -;; However I had stability issues. |
35 | | -;; The algorithm would learn a strategy and then suddenly diverge again. |
| 33 | +;; I had looked into reinforcement learning before and even tried out Q-learning using the [lunar lander reference environment of OpenAI's gym library](https://gymnasium.farama.org/environments/box2d/lunar_lander/) (now maintained by the Farama Foundation). |
| 34 | +;; However it had stability issues. |
| 35 | +;; The algorithm would converge on a strategy and then suddenly diverge again. |
36 | 36 | ;; |
37 | 37 | ;; More recently (2017) the [Proximal Policy Optimization (PPO) algorithm was published](https://arxiv.org/abs/1707.06347) and it has gained in popularity. |
38 | 38 | ;; PPO is inspired by Trust Region Policy Optimization (TRPO) but is much easier to implement. |
39 | | -;; Most importantly PPO can handle continuous observation and action spaces. |
| 39 | +;; Also PPO handles continuous observation and action spaces which is important for control problems. |
40 | 40 | ;; The [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3) Python library has a implementation of PPO, TRPO, and other reinforcement learning algorithms. |
41 | 41 | ;; However I found [XinJingHao's PPO implementation](https://github.com/XinJingHao/PPO-Continuous-Pytorch/) which I found easier to follow. |
42 | 42 | ;; |
43 | | -;; In order to use PPO with a simulation environment in Clojure and also in order to get a better understanding of PPO, I dediced to do an implementation of PPO in Clojure. |
| 43 | +;; In order to use PPO with a simulation environment implemented in Clojure and also in order to get a better understanding of PPO, I dediced to do an implementation of PPO in Clojure. |
44 | 44 | ;; |
45 | 45 | ;; ## Pendulum Environment |
46 | 46 | ;; |
47 | 47 | ;;  |
48 | 48 | ;; |
49 | | -;; First we implement a simple pendulum environment to test the PPO algorithm. |
| 49 | +;; To validate the implementation, we will implement the classical [pendulum](https://gymnasium.farama.org/environments/classic_control/pendulum/) environment in Clojure. |
50 | 50 | ;; In order to be able to switch environments, we define a protocol according to the environment abstract class used in OpenAI's gym. |
51 | 51 | (defprotocol Environment |
52 | 52 | (environment-update [this action]) |
|
81 | 81 | :t 0.0}) |
82 | 82 |
|
83 | 83 | ;; Same as in OpenAI's gym the angle is zero when the pendulum is pointing up. |
84 | | -;; Here a pendulum is initialised to be pointing down and with an angular velocity of 0.5. |
85 | | -(setup (/ PI 2) 0.5) |
| 84 | +;; Here a pendulum is initialised to be pointing down and have an angular velocity of 0.5 radians per second. |
| 85 | +(setup PI 0.5) |
86 | 86 |
|
87 | 87 | ;; ### State Updates |
88 | 88 | ;; |
|
98 | 98 | (pendulum-gravity 9.81 2.0 (/ PI 2)) |
99 | 99 |
|
100 | 100 | ;; The motor is controlled using an input value between -1 and 1. |
101 | | -;; This value is simply multiplied with the maximum acceleration provided by the motor. |
| 101 | +;; This value is simply multiplied with the maximum angular acceleration provided by the motor. |
102 | 102 | (defn motor-acceleration |
103 | 103 | "Angular acceleration from motor" |
104 | 104 | [control motor-acceleration] |
|
127 | 127 | ;; ### Observation |
128 | 128 | ;; |
129 | 129 | ;; The observation of the pendulum state uses cosinus and sinus of the angle to resolve the wrap around problem of angles. |
130 | | -;; The angular speed is normalized to be between -1 and 1. |
| 130 | +;; The angular speed is normalized to be between -1 and 1 as well. |
| 131 | +;; This so called [feature scaling](https://en.wikipedia.org/wiki/Feature_scaling) is done in order to improve convergence. |
131 | 132 | (defn observation |
132 | 133 | "Get observation from state" |
133 | 134 | [{:keys [angle velocity]} {:keys [max-speed]}] |
|
138 | 139 | (observation {:angle 0.0 :velocity 0.5} config) |
139 | 140 | (observation {:angle (/ PI 2) :velocity 0.0} config) |
140 | 141 |
|
141 | | -;; Note that the observation needs to capture all information required for achieving the objective, because it the only information available to the policy for deciding on the next action. |
| 142 | +;; Note that the observation needs to capture all information required for achieving the objective, because it is the only information available to the actor for deciding on the next action. |
142 | 143 |
|
143 | 144 | ;; ### Action |
144 | 145 | ;; |
145 | 146 | ;; The action of a pendulum is a vector with one element between 0 and 1. |
146 | | -;; The following method converts it to a action hashmap used by the pendulum environment. |
| 147 | +;; The following method clips it and converts it to an action hashmap used by the pendulum environment. |
147 | 148 | (defn action |
148 | 149 | "Convert array to action" |
149 | 150 | [array] |
|
249 | 250 |
|
250 | 251 | ;; ### Animation |
251 | 252 | ;; |
252 | | -;; The following method animates the pendulum and facilitates mouse control. |
| 253 | +;; With Quil we can create an animation of the pendulum and react to mouse input. |
253 | 254 | (defn run [] |
254 | 255 | (let [done-chan (async/chan) |
255 | 256 | last-action (atom {:control 0.0})] |
|
280 | 281 | ;; |
281 | 282 | ;; ### Import Pytorch |
282 | 283 | ;; |
283 | | -;; For implementing the neural networks and backpropagation, I am using the Python-Clojure bridge [libpython-clj2](https://github.com/clj-python/libpython-clj) and [Pytorch](https://pytorch.org/). |
| 284 | +;; For implementing the neural networks and backpropagation, we can use the Python-Clojure bridge [libpython-clj2](https://github.com/clj-python/libpython-clj) and the [Pytorch](https://pytorch.org/) machine learning library. |
284 | 285 | ;; The Pytorch library is quite comprehensive, is free software, and you can find a lot of documentation on how to use it. |
285 | 286 | ;; The default version of [Pytorch on pypi.org](https://pypi.org/project/torch/) comes with CUDA (Nvidia) GPU support. |
286 | | -;; There is also a [Pytorch wheel on AMD's website](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#use-a-wheels-package) which comes with [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) support. |
| 287 | +;; There are also [Pytorch wheels provided by AMD](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/3rd-party/pytorch-install.html#use-a-wheels-package) which come with [ROCm](https://rocm.docs.amd.com/projects/install-on-linux/en/latest/install/quick-start.html) support. |
287 | 288 | ;; Here we are going to use a CPU version of Pytorch which is a much smaller install. |
288 | 289 | ;; |
289 | 290 | ;; You need to install [Python 3.10](https://www.python.org/) or later. |
|
387 | 388 |
|
388 | 389 | ;; ### Critic Network |
389 | 390 | ;; |
390 | | -;; The critic network is a fully connected neural network with an input layer of size `observation-size` and two hidden layers of size `hidden-units` with `tanh` activation functions. |
391 | | -;; The critic output is a single value (an estimate for the expected cumulative return achievable by the given observed state. |
| 391 | +;; The critic network is a neural network with an input layer of size `observation-size` and two fully connected hidden layers of size `hidden-units` with `tanh` activation functions. |
| 392 | +;; The critic output is a single value (an estimate for the expected cumulative return achievable by the given observed state). |
392 | 393 | (def Critic |
393 | 394 | (py/create-class |
394 | 395 | "Critic" [nn/Module] |
|
432 | 433 | (py. no-grad# ~'__exit__ nil nil nil))))) |
433 | 434 |
|
434 | 435 | ;; Now we can create a network and try it out. |
435 | | -;; Note that the network creates non-zero outputs because Pytorch performs random initialisation of ther weights for us. |
| 436 | +;; Note that the network creates non-zero outputs because Pytorch performs random initialisation of the weights for us. |
436 | 437 | (def critic (Critic 3 64)) |
437 | 438 | (without-gradient |
438 | 439 | (toitem (critic (tensor [-1 0 0])))) |
|
452 | 453 | ;; Training a neural network is done by defining a loss function. |
453 | 454 | ;; The loss of the network then is calculated for a mini-batch of training data. |
454 | 455 | ;; One can then use Pytorch's backpropagation to compute the gradient of the loss value with respect to every single parameter of the network. |
455 | | -;; The gradient then is used to perform gradient descent steps. |
| 456 | +;; The gradient then is used to perform a gradient descent step. |
456 | 457 | ;; A popular gradient descent method is the [Adam optimizer](https://en.wikipedia.org/wiki/Stochastic_gradient_descent#Adam). |
457 | 458 |
|
458 | 459 | ;; Here is a wrapper for the Adam optimizer. |
|
473 | 474 | (def criterion (mse-loss)) |
474 | 475 | (def mini-batch [(tensor [[-1 0 0]]) (tensor [1.0])]) |
475 | 476 | (let [prediction (critic (first mini-batch)) |
476 | | - loss (criterion prediction (second mini-batch))] |
| 477 | + expected (second mini-batch) |
| 478 | + loss (criterion prediction expected)] |
477 | 479 | (py. optimizer zero_grad) |
478 | 480 | (py. loss backward) |
479 | 481 | (py. optimizer step)) |
|
522 | 524 |
|
523 | 525 | ;; Furthermore the actor network has a method `get_dist` to return a [Torch distribution](https://docs.pytorch.org/docs/stable/distributions.html) object which can be used to sample a random action or query the current log-probability of an action. |
524 | 526 | ;; Here (as the default in XinJingHao's PPO implementation) we use the [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) with parameters `alpha` and `beta` both greater than 1.0. |
525 | | -;; See [here](https://mathlets.org/mathlets/beta-distribution/) for an interactive visualization. |
| 527 | +;; See [here](https://mathlets.org/mathlets/beta-distribution/) for an interactive visualization of the Beta distribution. |
526 | 528 | (defn indeterministic-act |
527 | 529 | "Sample action using actor network returning distribution" |
528 | 530 | [actor] |
|
537 | 539 | (def actor (Actor 3 64 1)) |
538 | 540 | ;; One can then use the network to: |
539 | 541 | ;; |
540 | | -;; a) get the parameters of the distribution for a given observation. |
| 542 | +;; a. get the parameters of the distribution for a given observation. |
541 | 543 | (without-gradient (actor (tensor [-1 0 0]))) |
542 | 544 |
|
543 | | -;; b) choose the expectation value of the distribution as an action. |
| 545 | +;; b. choose the expectation value of the distribution as an action. |
544 | 546 | (without-gradient (py. actor deterministic_act (tensor [-1 0 0]))) |
545 | 547 |
|
546 | | -;; c) sample a random action from the distribution and get the associated log-probability. |
| 548 | +;; c. sample a random action from the distribution and get the associated log-probability. |
547 | 549 | ((indeterministic-act actor) [-1 0 0]) |
548 | 550 |
|
549 | 551 | ;; We can also query the current log-probability of a previously sampled action. |
|
557 | 559 | ;; Here is a plot of the probability density function (PDF) actor output for a single observation. |
558 | 560 | (without-gradient |
559 | 561 | (let [actions (range 0.0 1.01 0.01) |
560 | | - scatter (tc/dataset {:x actions |
561 | | - :y (map (fn [action] |
562 | | - (exp (first (tolist ((logprob-of-action actor) (tensor [-1 0 0]) (tensor [action])))))) |
563 | | - actions)})] |
| 562 | + logprob (fn [action] |
| 563 | + (tolist |
| 564 | + ((logprob-of-action actor) (tensor [-1 0 0]) (tensor action)))) |
| 565 | + scatter (tc/dataset |
| 566 | + {:x actions |
| 567 | + :y (map (fn [action] (exp (first (logprob [action])))) actions)})] |
564 | 568 | (-> scatter |
565 | 569 | (plotly/base {:=title "Actor output for a single observation" :=mode :lines}) |
566 | 570 | (plotly/layer-point {:=x :x :=y :y})))) |
|
0 commit comments