AIMLSAGA
1,250 XP · Lv.13
AR
Lv.13
50/100 XP · 50 to Lv.14
7 day streak
Proximal Policy Optimization
rl
intermediate
Lesson 1 of 10% complete
Lesson 1
Policy Gradient Methods
+150 XP

Policy Gradient Methods

Policy gradient methods optimize the policy directly by ascending the gradient of expected return.

REINFORCE Algorithm

The policy gradient theorem states:

$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$$

Where $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ is the discounted return.

PPO Clip Objective

PPO constrains updates using a clipped surrogate objective:

$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$

Where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio.


Code Sandbox
Python 3.11
Simulated Runtime
sandbox.py
python