AIMLSAGA
Course Content
Module 1 of 1

Proximal Policy Optimization

0/1
0% complete
1. Policy Gradient Methods+150 XP
Module 1/1 · Lesson 1/1
Policy Gradient Methods
rl
intermediate
+150 XP

Policy Gradient Methods

Policy gradient methods optimize the policy directly by ascending the gradient of expected return.

REINFORCE Algorithm

The policy gradient theorem states:

θJ(πθ)=Eτπθ[t=0Tθlogπθ(atst)Gt]\nabla_\theta J(\pi_\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^T \nabla_\theta \log \pi_\theta(a_t|s_t) \cdot G_t\right]

Where Gt=k=tTγktrkG_t = \sum_{k=t}^T \gamma^{k-t} r_k is the discounted return.

PPO Clip Objective

PPO constrains updates using a clipped surrogate objective:

LCLIP(θ)=E^t[min(rt(θ)A^t,clip(rt(θ),1ϵ,1+ϵ)A^t)]L^{CLIP}(\theta) = \hat{\mathbb{E}}_t\left[\min\left(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t\right)\right]

Where rt(θ)=πθ(atst)πθold(atst)r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)} is the probability ratio.