Policy Gradients¶

Here is a puzzle: a large language model emits a 1,000-token mathematical proof, and only after the last period does a reward arrive—“correct” or “wrong.” Every one of those 1,000 token choices played a part in the outcome, but the binary signal says nothing about which tokens deserved credit. Humans solved a similar problem for years by tracing back where the reward came from; policy gradients say you do not have to. Instead of modelling the whole environment, you simply ask: “If I shift the policy parameters slightly, how would that reward change?” This page walks from that question to a full implementation of Group Relative Policy Optimization on a tiny transformer, so that by the end you know not only why policy gradients avoid the credit assignment trap, but how to target token-level behavior with gradients you can compute on a Colab T4.

The territory¶

Reinforcement learning splits roughly into two tribes: value-based methods that learn a proxy for the expected return and plug it into a greedy policy, and policy-based methods that adjust the policy’s parameters directly. Policy gradients belong to the second tribe, and their raison d’être is bypassing the need to model either the transition dynamics or the stationary state distribution. When the action space is discrete but large, or when the actions are entire sequences like sentences, estimating value functions becomes brittle; policy gradients instead exploit the log-derivative trick to differentiate through the sampling process. This makes them the preferred tool when the action space is high-dimensional or combinatorial—common in robotics, dialogue, and LLM alignment.

The policy-gradient family ranges from basic REINFORCE to actor-critic hybrids, but all share one structure: they pose the expected return as a function of the policy parameters and then ascend that landscape. The challenges that follow—high variance, non-stationary rewards, and exploration—then guide the auxiliary techniques we attach, such as baselines, entropy bonuses, or structured rollouts. The next section walks through the exact mechanics of that ascent, explains why the derivative only depends on quantities we can simulate, and shows how GRPO keeps a reward-free critic from becoming a bottleneck. How does it actually work?

How it works¶

The first step is the objective. A policy \(\pi_\theta\) parametrized by \(\theta\) induces a distribution over trajectories \(\tau = (s_0,a_0,s_1,a_1,\dots)\) in an environment. The cumulative return of a trajectory is \(R(\tau) = \sum_{t=0}^{\infty} \gamma^t r_t\), where \(r_t\) is the scalar reward at timestep \(t\) and \(\gamma \in [0,1)\) is the discount factor. The expected return under the policy is

\[ J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}[R(\tau)]. \]

Here \(J(\theta)\) measures the average long-term reward, \(\tau \sim \pi_\theta\) means the trajectory is sampled by rolling out the policy, and \(R(\tau)\) sums the rewards that follow the policy’s decisions. The optimization goal is to adjust \(\theta\) so that \(J(\theta)\) increases.

The crucial step is rewriting the gradient \(\nabla_\theta J(\theta)\) without differentiating through the environment. The Policy Gradient Theorem derived by Sutton et al. (2000) [https://www.cis.upenn.edu/~mkearns/finread/Sutton.pdf] states that

\[ \nabla_\theta J(\theta) = \mathbb{E}_{\tau \sim \pi_\theta}\left[\sum_{t=0}^{\infty} \nabla_\theta \log \pi_\theta(a_t|s_t) Q^{\pi}(s_t, a_t)\right], \]

where \(Q^{\pi}(s_t, a_t)\) is the expected return starting from state \(s_t\) after taking action \(a_t\), \(\log \pi_\theta(a_t|s_t)\) is the log-probability of sampling that action, and the expectation is over trajectories generated by the current policy. This equation shows that the gradient only depends on quantities we can sample: we can roll out a trajectory, compute the cumulative reward (or a suitable estimate of \(Q^{\pi}\)), and plug it into the sum. The gradient does not require differentiating the transition dynamics or the term \(\rho^{\pi}(s)\) for the state visitation frequencies—those derivatives vanish under mild regularity conditions. That insight is why policy gradients are feasible in environments where the state distribution drifts unpredictably, including the self-generated states of autoregressive transformers.

REINFORCE and variance reduction¶

At its simplest, the estimator of the gradient is REINFORCE, where \(Q^{\pi}(s_t, a_t)\) is replaced by the return from timestep \(t\), \(G_t = \sum_{k=t}^{\infty} \gamma^{k-t} r_k\). The REINFORCE update is

\[ \Delta \theta \propto \sum_{t=0}^{T} \nabla_\theta \log \pi_\theta(a_t|s_t) G_t, \]

where the sum runs over \(T\) steps in a finite rollout and \(G_t\) is the empirical return after \(t\). Each term nudges the parameters toward actions that preceded high returns. However, this estimator suffers from high variance because each gradient term is multiplied by the noisy return \(G_t\). Sutton et al. proposed subtracting a baseline function \(b(s_t)\), resulting in

\[ \nabla_\theta J(\theta) = \mathbb{E}\left[\sum_t \nabla_\theta \log \pi_\theta(a_t|s_t) [Q^{\pi}(s_t,a_t) - b(s_t)]\right], \]

which retains unbiasedness while reducing variance if \(b(s_t)\) approximates the value of the state. Common choices are learned value functions or even the running average of returns within a batch. The difference \(Q^{\pi}(s_t,a_t) - b(s_t)\) centers the gradient estimates, letting the optimizer focus on relative improvements rather than absolute scale.

Sutton’s derivation assumed trajectories could terminate or go on indefinitely; Baxter et al. (2001) [http://www.cs.cmu.edu/afs/cs/project/jair/pub/volume15/baxter01a.pdf] extended the analysis to the infinite-horizon setting with discounted rewards, showing that the policy-gradient estimator converges even when the horizon is unbounded. They constructed a consistent estimator for \(\nabla_\theta J(\theta)\) by defining eligibility traces and demonstrating that under ergodicity assumptions the expected gradient remains finite. This theoretical assurance is essential because modern language models never really terminate—they generate until a stop token or until the user interrupts—so we treat generation as an infinite-horizon process with discounting ensuring convergence.

Actor-critic and natural gradient for robotics¶

While REINFORCE works on toy problems, the variance still can be crippling for deep policies. Actor-critic algorithms keep the gradient form but replace \(Q^{\pi}(s_t,a_t)\) with the critic’s prediction \(Q_w(s_t,a_t)\), where \(w\) are the critic’s parameters. The policy (actor) still updates as \(\nabla_\theta \log \pi_\theta(a_t|s_t) A_w(s_t,a_t)\), where \(A_w\) is an estimate of the advantage. Thus the policy update remains a policy gradient, but the critic provides a richer, lower-variance signal by estimating the long-term value function.

Policy Gradient Methods for Robotics by Peters and Schaal (2006) [https://people.eecs.berkeley.edu/~pabbeel/cs287-fa09/readings/PetersSchaal-policy-gradient-for-robotics_IROS2006.pdf] introduced two practical innovations. First, they advocated the natural gradient, which preconditions the update with the inverse Fisher information matrix so that steps respect the geometry of the policy manifold. The Fisher matrix \(F(\theta) = \mathbb{E}_{s,a}[\nabla_\theta \log \pi_\theta(a|s) \nabla_\theta \log \pi_\theta(a|s)^\top]\) captures how much the policy distribution changes when \(\theta\) moves, and the natural gradient direction is \(F(\theta)^{-1} \nabla_\theta J(\theta)\). Second, the paper emphasized structured policy representations (e.g., Gaussian policies with learnable covariance) that are smooth and differentiable, which suits robotic control where the action space is continuous. These insights paved the way for stable, low-variance updates in high-dimensional action spaces like robotic arm movement; the same geometry-aware mindset later resurfaced in the LLM setting when people realized that a naive gradient update can easily destroy a model’s pre-training.

GRPO and critic-free scaling¶

The classic actor-critic loop still requires training a critic network, which is expensive when the policy itself is already huge—think of an LLM with billions of parameters. Shao et al. (2024) [https://ar5iv.labs.arxiv.org/html/2401.13662] introduced Group Relative Policy Optimization (GRPO) to sidestep the critic entirely. GRPO keeps the policy-gradient structure but normalizes rewards across groups of rollouts to approximate a baseline. Specifically, when collecting \(B\) rollouts in parallel, GRPO standardizes each rollout’s total reward by subtracting the group mean \(\mu_B\) and dividing by the group standard deviation \(\sigma_B\), so that the update becomes

\[ \Delta \theta \propto \sum_{i=1}^{B} \left(\frac{R_i - \mu_B}{\sigma_B}\right) \sum_{t=0}^{T_i} \nabla_\theta \log \pi_\theta(a_t^{(i)}|s_t^{(i)}), \]

where \(R_i\) is the cumulative reward of rollout \(i\), \(T_i\) is its length, and \((s_t^{(i)}, a_t^{(i)})\) are the transitions. The normalized reward acts as a multiple of the baseline, but it is simpler to compute because it only uses statistics from a small batch rather than training a separate value network. GRPO also leverages the fact that in autoregressive generation, all rollouts share the same starting state (the prompt), so the group statistics are especially stable. This is what makes GRPO practical for LLM fine-tuning: the compute overhead is limited to a few extra reductions over the batch instead of retraining a massive critic.

Entropy regularization is another stabilization knob. Keeping an entropy bonus \(H(\pi_\theta) = -\mathbb{E}[\log \pi_\theta(a|s)]\) in the objective encourages exploration and prevents premature collapse to deterministic policies. Shao et al. also propose using reward-to-entropy ratios as a heuristic for "cognitive effort"—the idea being that high-entropy policies on hard reasoning prompts suggest the model is still exploring, while low entropy may signal early hallucination. By coupling the entropy bonus with dynamic weighting (e.g., scaling the bonus inversely with the reward variance across the group), the algorithm balances exploitation and exploration without explicit estimate of \(Q^{\pi}\).

Token-level credit assignment in LLMs¶

Applying policy gradients to short-horizon tasks like robotic control is one thing; applying them to token generation with sparse, delayed rewards is another. The standard trick is to reshape the reward into a per-token signal by distributing the trajectory reward back over the steps. You can define per-token rewards \(r_t\) such that \(\sum_t \gamma^{t} r_t = R(\tau)\), or you can keep the reward terminal but rely on the gradient to propagate back through the log-probabilities. In either case, the policy’s gradient already includes the full sequence of log-probabilities, so if the reward is zero until the final step, the only nonzero contributions in the sum are from \(\nabla_\theta \log \pi_\theta(a_T|s_T)\). That means basic REINFORCE does not credit earlier tokens. The fix is to supply intermediate rewards or to use episodic rollout with Monte Carlo returns—both approaches require engineering.

Another lever is to compute the gradient of the entropy of the policy, \(\nabla_\theta H(\pi_\theta)\), and add it to the gradient estimate. The entropy gradient encourages exploration even when rewards are sparse, which keeps the model from pinning to the most probable next token. Finally, importance sampling corrections can reuse batches with off-policy data, so that you can evaluate many prompts without extra forward passes. When the reward function reflects some rule (e.g., "the output must have 12 tokens" or "the last token must be a prime number"), the policy gradient update still acts as if the rule were differentiable: the log-probability of actions that satisfy the rule increases, even though the rule itself is a black box.

Mechanism summary¶

Putting it together, policy gradients operate in five steps: (1) define the scalar reward for each trajectory, (2) roll out the policy to collect sequences \((s_t,a_t)\) and cumulative returns \(R_i\), (3) optionally standardize or baseline the returns to reduce variance, (4) compute \(\nabla_\theta \log \pi_\theta(a_t|s_t)\) for every action, and (5) take the weighted sum to adjust \(\theta\). GRPO simplifies step (3), entropy regularization supports exploration, and natural gradients (or trust-region constraints) can keep the update stable even when \(\theta\) is a billion-dimensional embedding table. Because nothing in this pipeline requires the transition dynamics or the state visitation distribution explicitly, policy gradients are uniquely suited for LLM fine-tuning tasks in which rewards are sparse, sequences are long, and the action is generating an entire sentence.

Where the field is now¶

The modern policy-gradient stack blends the original theorem with gradient clipping, normalized rewards, and scalable optimizers. GRPO (Shao et al. 2024) is the clearest recent addition: by normalizing rewards across a group of rollouts, it approximates the baseline without training a critic, enabling consumer-grade compute to adjust billion-parameter policies. Empirically, GRPO matches PPO baselines on text classification and simple reasoning tasks while reducing GPU hours by up to 40% because it eliminates the extra forward/backward passes for the value network (Shao et al. 2024 describe those experiments in detail). This work also introduces dynamic entropy weighting, which lowers the entropy bonus when the reward distribution is sharp and raises it when rewards are noisy, guiding the policy into smooth yet exploratory behaviors.

On the research frontier, a direct descendant of GRPO is the growing interest in leveraging policy entropy as a proxy for “cognitive effort.” In large reasoning tasks, the shape of the entropy curve tracks whether the model is still searching for a good completion or has latched onto the first plausible sequence. Papers in 2024 and early 2025 (see Shao et al. 2024 for context and follow-up preprints on scalars controlling entropy) explore coupling gradient updates with entropy-derived coefficients to regularize reasoning: high entropy pushes the policy to explore more, while low entropy only persists if the reward justifies it. These experiments are extending the classical robotics insight from Peters and Schaal—that regularized gradients follow the contours of the policy manifold in a beneficial way—into the generative modeling domain.

From an engineering standpoint, policy gradients power much of RLHF. OpenAI’s overview of learning from human feedback [https://openai.com/research/learning-from-human-feedback] describes how a base LLM is first fine-tuned with supervised data, then updated via policy gradients where the reward model scores completions, and finally distilled through reinforcement learning. The gradient computation in step two is a classic policy-gradient update with an entropy bonus to keep the model from collapsing, and the rollout grouping and batching strategies are directly inspired by GRPO’s reward normalization. That system runs on clusters of A100 GPUs and regularly updates models with thousands of prompts per minute, so the engineering challenge focuses on efficient rollout sampling and reward normalization rather than the core mathematics.

Overall, the current SotA moves the field toward critic-free, entropy-aware gradients that target reasoning while keeping the policy’s pre-training intact. Both the research papers and the large-scale deployments emphasize two tensions: keeping variance low without adding a critic, and nudging the policy toward higher-quality generations without forgetting the base knowledge. The next section spells out the open questions left in resolving those tensions.

What's still open¶

How can policy gradient updates be regularized to amplify reasoning capabilities without degrading the model’s pre-existing factual knowledge base? Current entropy regularizers and reward normalization techniques either push the policy toward novelty (risking hallucination) or toward conservatism (missing creative solutions). An open experiment is to treat the pre-trained distribution as a “soft prior” and constrain the KL divergence of the updated policy to be functionally dependent on the gradient magnitude and reward variance. Does a reward-aware KL schedule allow the policy to take larger steps when the reward is confident and shrink them when it is not?

Can group reward statistics generalize to the multi-turn, multi-agent prompts found in interactive assistants? GRPO assumes that a batch of rollouts shares the same prompt, but real assistants must handle diverse prompts simultaneously. Does standardizing rewards across prompt clusters retain the variance reduction benefits, or does it introduce bias because some prompts are inherently easier? Answering this requires measuring the bias-variance trade-off when mixing prompts with different reward distributions and designing grouping heuristics that minimize the resulting bias.

Finally, is there a principled way to integrate policy gradients with retrieval-augmented policies without doubling compute? Retrieval enriches the state but also lengthens rollouts, so rewards become sparse and expensive to evaluate. Investigating whether the policy gradient update can be decomposed into retrieval-dependent and retrieval-independent components—so that the retrieval module sees only the gradients that cross the reward signal—could unlock efficient RLHF for retrieval-augmented LLMs.

Where to read next¶

If you want the value-function counterpart to this treatment, → Actor-Critic explains how separate critics and actors interact and why those critics are often hard to scale for LLMs. The engineering counterpart is → Proximal Policy Optimization which remains the most widely deployed reinforcement learner thanks to trust-region clipping and stable entropy bonuses. For the probabilistic foundation that policy gradients exploit, → Score matching and its derivations show how gradients of log-densities turn sampling problems into tractable objectives.

Build it¶

This build proves that a critic-free policy gradient variant can steer a small causal transformer toward rule-governed text output using only scalar rewards derived from a deterministic checker, demonstrating that the log-probability gradients alone suffice for credit assignment when the reward is normalized.

What you're building: Group Relative Policy Optimization (GRPO) fine-tuned on Qwen2.5-0.5B-Instruct to produce outputs that match a target character count while satisfying a rule-based arithmetic check.

Why this is valuable: The build exercises the policy-gradient objective end-to-end—reward computation, group normalization, entropy scaling, and gradient ascent—so the learner feels how critic-free updates propagate a scalar reward back to billions of parameters.

Stack: - Model: Qwen2.5-0.5B-Instruct (HuggingFace, open-source, ~10M downloads) - Dataset: openwebtext (HuggingFace dataset, ~40GB; sample 100k prompts) - Framework: PyTorch 2.2 + accelerate (from Hugging Face) + trlx==0.5.0 - Compute: Single Colab T4 (16 GB VRAM); ~3 hours for 5 epochs over the sampled portion

The recipe: 1. Install PyTorch 2.2, transformers, accelerate, trlx, and datasets via pip install torch==2.2.0 transformers accelerate trlx datasets. 2. Load Qwen2.5-0.5B-Instruct with AutoModelForCausalLM and tokenize openwebtext prompts chopped to 128 tokens, keeping prompt+completion pairs for the reward check. 3. Define the reward function: reward = 1 if the generated completion ends with the target character count (e.g., 64 characters) and passes a digit-sum check encoding a simple arithmetic constraint; otherwise 0. Collect rollouts in batches of 16 prompts, compute the returns, and normalize each batch by subtracting the mean and dividing by the standard deviation. 4. Train with GRPO: for each token, compute \(\nabla_\theta \log \pi_\theta(a_t|s_t)\), multiply with the normalized batch reward, add an entropy bonus scaled by the inverse of the batch reward variance, and update the policy with AdamW (learning rate 1e-5, weight decay 0.01). Clip gradients at norm 1.0 and use gradient accumulation to fit the batch on the T4. 5. Evaluate by sampling 100 new prompts, measuring the fraction that meet both the length and arithmetic constraints, and logging the average reward; expect accuracy > 70% after 5 epochs.

Expected outcome: A fine-tuned Qwen2.5-0.5B-Instruct checkpoint that reliably produces rule-compliant generations, along with logs showing normalized rewards, entropy weights, and pass rates.

CS student: Run the same recipe with gpt2 on a single RTX 4070, reduce batch size to 8, and lower the target character count to 32 to fit the smaller model while still observing normalized GRPO updates.
Applied engineer: After the Colab run, quantize the checkpoint with bitsandbytes 8-bit quantization, deploy via vLLM on an A10 instance, and measure p50 completion latency with caching; the normalized reward ensures consistent behavior under load.
Applied researcher: Ablate the entropy-scaling rule by fixing the bonus and compare convergence rates; the hypothesis is that dynamically scaled entropy outperforms the constant bonus in sparse-reward scenarios measured by higher reward variance reduction.
Frontier researcher: Probe the open question in §What's still open by tying the entropy regularizer to a KL constraint against the pre-trained policy; treat the falsifier as any prompt where the KL-constrained policy yields a lower reward than GRPO, and report whether the constraint preserves factual outputs while improving reasoning.

If this build worked for you — a ⭐ on GitHub is the only signal we collect.