Lesson 1 of 10% complete
Lesson 1
+150 XPPolicy Gradient Methods
Policy Gradient Methods
Policy gradient methods optimize the policy directly by ascending the gradient of expected return.
REINFORCE Algorithm
The policy gradient theorem states:
$$\nabla_\theta J(\pi_\theta) = \mathbb{E}{\tau \sim \pi\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$$
Where $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ is the discounted return.
PPO Clip Objective
PPO constrains updates using a clipped surrogate objective:
$$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$$
Where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio.
Code Sandbox
Python 3.11
Simulated Runtime
sandbox.py
python