Policy Gradient Methods

Policy gradient methods optimize the policy directly by ascending the gradient of expected return.

REINFORCE Algorithm

The policy gradient theorem states:

$\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]$

Where $G_t = \sum_{k=t}^T \gamma^{k-t} r_k$ is the discounted return.

PPO constrains updates using a clipped surrogate objective:

$L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]$

Where $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$ is the probability ratio.