Skip to content

Commit af38629

Browse files
committed
Add more fixes
1 parent 925482b commit af38629

1 file changed

Lines changed: 41 additions & 18 deletions

File tree

src/ppo/main.clj

Lines changed: 41 additions & 18 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@
66
:description "A Clojure port of XinJingHao's PPO implementation using libpython-clj2, PyTorch, and Quil"
77
:image "pendulum.png"
88
:type :post
9-
:date "2026-04-18"
9+
:date "2026-04-22"
1010
:category :ml
1111
:tags [:physics :machine-learning :optimization :ppo :control]}}}
1212

@@ -20,13 +20,6 @@
2020
[libpython-clj2.require :refer (require-python)]
2121
[libpython-clj2.python :refer (py.) :as py]))
2222

23-
(require-python '[builtins :as python]
24-
'[torch :as torch]
25-
'[torch.nn :as nn]
26-
'[torch.nn.functional :as F]
27-
'[torch.optim :as optim]
28-
'[torch.distributions :refer (Beta)]
29-
'[torch.nn.utils :as utils])
3023
;; ## Motivation
3124
;;
3225
;; Recently I started to look into the problem of reentry trajectory planning in the context of developing the [sfsim](https://store.steampowered.com/app/3687560/sfsim/) space flight simulator.
@@ -38,10 +31,36 @@
3831
;; PPO is inspired by Trust Region Policy Optimization (TRPO) but is much easier to implement.
3932
;; Also PPO handles continuous observation and action spaces which is important for control problems.
4033
;; The [Stable Baselines3](https://github.com/DLR-RM/stable-baselines3) Python library has a implementation of PPO, TRPO, and other reinforcement learning algorithms.
41-
;; However I found [XinJingHao's PPO implementation](https://github.com/XinJingHao/PPO-Continuous-Pytorch/) which I found easier to follow.
34+
;; However I found [XinJingHao's PPO implementation](https://github.com/XinJingHao/PPO-Continuous-Pytorch/) which is easier to follow.
4235
;;
4336
;; In order to use PPO with a simulation environment implemented in Clojure and also in order to get a better understanding of PPO, I dediced to do an implementation of PPO in Clojure.
4437
;;
38+
;; ## Dependencies
39+
;;
40+
;; For this project we are using the following `deps.edn` file.
41+
;; The Python setup is shown further down in this article.
42+
;;
43+
;; ```Clojure
44+
;; {:deps
45+
;; {org.clojure/clojure {:mvn/version "1.12.4"}
46+
;; clj-python/libpython-clj {:mvn/version "2.026"}
47+
;; quil/quil {:mvn/version "4.3.1563"}
48+
;; org.clojure/core.async {:mvn/version "1.9.865"}}
49+
;; }
50+
;; ```
51+
;;
52+
;; The dependencies can be pulled in using the following statement.
53+
;;
54+
;; ```Clojure
55+
;; (require '[clojure.math :refer (PI cos sin exp to-radians)]
56+
;; '[clojure.core.async :as async]
57+
;; '[tablecloth.api :as tc]
58+
;; '[scicloj.tableplot.v1.plotly :as plotly]
59+
;; '[quil.core :as q]
60+
;; '[quil.middleware :as m]
61+
;; '[libpython-clj2.require :refer (require-python)]
62+
;; '[libpython-clj2.python :refer (py.) :as py])
63+
;; ```
4564
;; ## Pendulum Environment
4665
;;
4766
;; ![screenshot of pendulum environment](pendulum.png)
@@ -104,7 +123,7 @@
104123
[control motor-acceleration]
105124
(* control motor-acceleration))
106125

107-
;; A simulation step of the pendulum is implemented as follows.
126+
;; A simulation step of the pendulum is implemented using Euler integration.
108127
(defn update-state
109128
"Perform simulation step of pendulum"
110129
([{:keys [angle velocity t]}
@@ -357,7 +376,8 @@
357376
'[torch.nn :as nn]
358377
'[torch.nn.functional :as F]
359378
'[torch.optim :as optim]
360-
'[torch.distributions :refer (Beta)])
379+
'[torch.distributions :refer (Beta)]
380+
'[torch.nn.utils :as utils])
361381

362382
;; ### Tensor Conversion
363383
;;
@@ -539,7 +559,7 @@
539559
;; Here (as the default in [XinJingHao's PPO implementation](https://github.com/XinJingHao/PPO-Continuous-Pytorch/)) we use the [Beta distribution](https://en.wikipedia.org/wiki/Beta_distribution) with parameters `alpha` and `beta` both greater than 1.0.
540560
;; See [here](https://mathlets.org/mathlets/beta-distribution/) for an interactive visualization of the Beta distribution.
541561
(defn indeterministic-act
542-
"Sample action using actor network returning distribution"
562+
"Sample action using actor network returning random action and log-probability"
543563
[actor]
544564
(fn indeterministic-act-with-actor [observation]
545565
(without-gradient
@@ -582,7 +602,7 @@
582602
(plotly/base {:=title "Actor output for a single observation" :=mode :lines})
583603
(plotly/layer-point {:=x :x :=y :y}))))
584604

585-
;; Finally we also can also query the entropy of the distribution.
605+
;; Finally we can also query the entropy of the distribution.
586606
;; By incorporating the entropy into the loss function later on, we can encourage exploration and prevent the probability density function from collapsing.
587607
(defn entropy-of-distribution
588608
"Get entropy of distribution"
@@ -738,7 +758,7 @@
738758
0.0
739759
(reverse (map vector deltas dones truncates)))))))
740760

741-
;; For example if using an discount factor of 0.5, the advantages approach 2.0 assymptotically when going backwards in time.
761+
;; For example when all rewards are 1.0 and if using an discount factor of 0.5, the advantages approach 2.0 assymptotically when going backwards in time.
742762
(advantages {:dones [false false false] :truncates [false false false]}
743763
[1.0 1.0 1.0]
744764
0.5
@@ -786,7 +806,7 @@
786806

787807
;; ### Actor Loss Function
788808
;;
789-
;; The core of the actor loss function relies on the probability ratio of the actions using the current and the updated policy.
809+
;; The core of the actor loss function relies on the action probability ratio of using the updated and the old policy (actor network output).
790810
;; The ratio is defined as $r_t(\theta)=\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\operatorname{old}}}(a_t|s_t)}$.
791811
;; Note that $r_t(\theta)$ here refers to the probability ratio as opposed to the reward of the previous section.
792812
;;
@@ -802,7 +822,7 @@
802822
;;
803823
;; $L^{CPI}(\theta) = \mathop{\hat{\mathbb{E}}}_t [\frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{\operatorname{old}}}(a_t|s_t)} \hat{A}_t] = \mathop{\hat{\mathbb{E}}}_t [r_t(\theta) \hat{A}_t]$
804824
;;
805-
;; In order to increase stability, the loss function uses clipped probability ratios.
825+
;; The core idea of PPO is to use clipped probability ratios for the loss function in order to increase stability, .
806826
;; The probability ratio is clipped to stay below $1+\epsilon$ for positive advantages and to stay above $1-\epsilon$ for negative advantages.
807827
;;
808828
;; $L^{CLIP}(\theta) = \mathop{\hat{\mathbb{E}}}_t [\min(r_t(\theta) \hat{A}_t, \mathop{\operatorname{clip}}(r_t(\theta), 1-\epsilon, 1+\epsilon) \hat{A}_t)]$
@@ -1040,8 +1060,8 @@
10401060
(max -1.0
10411061
(- 1.0 (/ (q/mouse-x)
10421062
(/ (q/width) 2.0)))))})
1043-
state (update-state state action)]
1044-
(when (done? state) (async/close! done-chan))
1063+
state (update-state state action config)]
1064+
(when (done? state config) (async/close! done-chan))
10451065
(reset! last-action action)
10461066
state))
10471067
:draw #(draw-state % @last-action)
@@ -1051,5 +1071,8 @@
10511071
(System/exit 0))
10521072

10531073
;; Here is a small demo video of the pendulum being controlled using the actor network.
1074+
;; You can find a repository with the code of this article as well as unit tests at [github.com/wedesoft/ppo](https://github.com/wedesoft/ppo).
10541075
;;
10551076
;; ![automatically controlled pendulum](automatic.gif)
1077+
;;
1078+
;; Enjoy!

0 commit comments

Comments
 (0)