Q-learning¶

Imagine piloting a car through a city whose map keeps changing and whose traffic reports are delayed by days. You cannot rely on a single GPS unit that only records your own journeys, because once you get lost you keep reinforcing the wrong turn. Instead, you collect every route taken in the city—by careful commuters, distracted tourists, and erratic delivery drivers alike—and you let a single central planner learn from all of those messy tapes. Q-learning is that planner. It listens to every experience, even the ones generated by a policy that has no idea where it is headed, and extracts what ultimately leads to the destination fastest. By the end of this page you will understand how that off-policy magic works, why it is fragile with non-linear function approximators, and how modern replay engineering rescues it so you can train a Deep Q-Network on LunarLander-v3 without the training collapsing into divergence.

The territory¶

Off-policy value iteration is one of the foundational engines of reinforcement learning because it decouples the policy used to gather data—the behavior policy—from the policy we want to evaluate and improve—the target policy. Temporal-Difference learning gives you bootstrapped estimates of future returns, and Q-learning takes the supremum of those bootstraps across actions so that the target policy is implicitly greedy with respect to your current value estimate. Classic tabular Q-learning converges because the Bellman operator is a contraction when the action-value function can be stored exactly, but once you replace the table with a neural network that generalizes across states, every off-policy sample becomes a potential source of divergence. The key question is: how do you keep learning the optimal action-value even though the experience you are replaying comes from old policies, random exploration noise, or collectors that crashed into walls?

This is where replay buffers and update normalization come in. Modern variants of Q-learning view the buffer not just as a FIFO queue but as a dataset organized by signal quality, by reliability of the return estimate, and by marginal value of fresh information. When combined with techniques such as double networks, target networks, and adaptive learning-rate scaling, a Deep Q-Network (DQN) can stay stable long enough to learn from the chaotic stream of off-policy data. The mechanism is best understood by starting from the Bellman optimality recursion, then tracing how the neural approximation, the buffer, and the sampling scheme interact to regulate the updates that reach the optimizer.

How it works¶

Q-learning wants to solve for the optimal action-value function \(Q^*(s,a)\) that satisfies Bellman optimality:

\[ Q^*(s,a) = \mathbb{E}_{s'}\left[ r(s,a) + \gamma \max_{a'} Q^*(s', a') \right] \]

where \(s\) and \(a\) are the current state and action, \(r(s,a)\) is the reward observed after taking \(a\) in \(s\), \(\gamma \in [0,1)\) is the discount factor, and the expectation is over the next state distribution induced by the environment dynamics. This recursion is a fixed point of a contraction mapping in the tabular case. Q-learning performs stochastic approximation toward this fixed point by observing elementary transitions \((s,a,r,s')\) collected under some behavior policy \(\mu\) and performing updates of the form

\[ Q(s,a) \leftarrow Q(s,a) + \alpha \left( r + \gamma \max_{a'} Q(s', a') - Q(s,a) \right) \]

where \(\alpha\) is the step size. The beauty is that the update targets the greedy policy (the \(\max_{a'}\) term) even if \(a\) was chosen by a completely different policy \(\mu\); hence the algorithm is off-policy.

Once \(Q\) is parameterized by a neural network \(Q_\theta\), the update becomes gradient descent on the temporal-difference (TD) error:

\[ L(\theta) = \mathbb{E}_{(s,a,r,s')} \left[ \left( Q_\theta(s,a) - \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') \right) \right)^2 \right] \]

where \(Q_{\theta^-}\) denotes the target network frozen at parameters \(\theta^-\) that lag behind \(\theta\) to stabilize bootstrapping. The expectation is taken over transitions sampled from the replay buffer. Annotating, \(Q_\theta(s,a)\) is the current estimate of the action-value for \((s,a)\), \(r\) is the observed scalar reward, \(s'\) is the next state, and the inner max selects the greedy action under the target network. This objective is prone to overestimation bias because the max operator sits inside the squared error, which led to Double DQN using two estimators to avoid bias—this is the first example of how structural modifications cancel estimation pathologies in off-policy learning.

When the experience buffer is uniform, every past transition has equal chance of being replayed, regardless of how relevant or how temporally recent it is. That means early, noisy transitions dominate because they are never removed, and the network keeps fitting the errors of a poorly trained critic. Prioritized replay introduced an importance sampling strategy that favors transitions with large absolute TD error, under the assumption that large error means there is something new to learn. In practice, however, the TD error itself is a noisy signal early in training, and a DQN can end up overfitting to those early, unstable examples. The ReaPER+ annealed replay strategy—generalized from Anonymous et al. (2026)—addresses exactly this fragility by starting with TD-error prioritization and smoothly transitioning to reliability-aware sampling as the network matures. Reliability here is quantified by the variance of the TD error across bootstrap samples; samples whose TD variance shrinks are more trustworthy, so ReaPER+ increases their sampling probability later in training without completely discarding the harder ones. This annealing prevents the buffer from overfitting to early noise and keeps a broader representation of the state-action space.

Another angle on stability is to make the critic less brittle by conditioning updates on a short-horizon model of the dynamics. QT-TDM (2025) introduces a Transformer Dynamics Model that predicts the next few latent states conditioned on a short action sequence, and pairs it with an autoregressive Q-Transformer that rolls out those latents to estimate action-values without having to plan long horizons explicitly. By offloading the forward model to a Transformer and keeping the Q-function autoregressive but shallow, QT-TDM avoids exploding TD errors while still using long-range information. The learned dynamics model also serves as a critic regularizer: the Q-Transformer is trained to match the return predicted by the dynamics model for the trajectory segments, which acts as a learned target network that adapts to non-stationary policies.

Yet another modernization is update scaling using surprise in the latent representation. DISRC (2026) observes that when rewards are sparse, off-policy updates can explode because the critic sees very few informative signals and tries to propagate them through noisy bootstraps. Their solution is to compute a latent surprise score \(S\) based on a separate encoder's prediction error—if a transition is surprising in the latent space, the Q-update is down-weighted to avoid trusting uncorrelated noise. Mathematically, the update becomes

\[ L(\theta) = \mathbb{E}_{(s,a,r,s')} \left[ \left(1 - \tanh(S)\right) \left( Q_\theta(s,a) - \left( r + \gamma \max_{a'} Q_{\theta^-}(s', a') \right) \right)^2 \right] \]

where \(S\) is the surprise signal normalized to \([0,1]\) and the \(\tanh\) ensures a smooth gradient. The term \((1 - \tanh(S))\) scales down the loss for transitions whose latent surprise is high, effectively slowing the pressure from untrustworthy data while still letting reliable samples drive learning.

A practical implementation of these ideas needs careful engineering of the replay buffer: it must store the TD error, surprise score, and reliability flag for each transition; support updating priorities efficiently; and allow annealed sampling schedules. In addition, gradient clipping and adaptive optimizers help integrate the scaled loss without overshooting. Without these controls, the biased gradient built from the max operator, the bootstrapping noise, and rare signals would push the critic parameters far from the manifold that actually represents the optimal Q-function. ReaPER+, QT-TDM, and DISRC all act on different axes of this instability—buffer sampling, model-based regularization, and update scaling—so combining them is what lets a deep Q-learning system train on real-world tasks such as LunarLander-v3 or sparse-reward robotics benchmarks.

Theoretically, there are several convergence results for related algorithms that illustrate why these stabilizations are needed. Anonymous et al. (2026) [arxiv:2603.21621] analyze the monotonicity properties of multi-step off-policy updates and show that without careful control of the extrapolation length, the bootstrapped target can diverge. Another recent contribution, Anonymous et al. (2026) [arxiv:2604.08865], adapts concentration inequalities to quantify how the replay buffer's sampling distribution drifts away from the stationary distribution of the target policy, which justifies annealing priorities toward reliability. Anonymous et al. (2026) [arxiv:2602.01156] further demonstrate that when you scale TD updates by a surrogate surprise term, you can bound the gradient variance even in the presence of sparse rewards. Finally, even though Proximal Policy Optimization is on-policy, the approximate ascent proof provided by Anonymous et al. (2026) [arxiv:2602.03386] inspires similar trust-region ideas in Q-learning by suggesting that controlling the KL divergence between successive greedy policies stabilizes bootstrapping. These theoretical underpinnings explain why the practical tricks described earlier—target networks, update scaling, annealed replay—are not just engineering hacks but necessary components to keep the fixed-point iteration in the contraction regime while the neural network is changing.

Where the field is now¶

The current frontier of Deep Q-learning is less about inventing new Bellman operators and more about replay buffer engineering and risk-aware updates. ReaPER+ (Anonymous et al. 2026) [arxiv:2604.21863] is the reigning empirical champion for sample efficiency across noisy domains: on LunarLander-v3 it reports 4× faster convergence than standard prioritized replay, and on quantum circuit optimization it reaches high-fidelity solutions with 32× fewer episodes. Its reliability-aware annealing is now part of every baseline that competes on sparse or high-noise MDPs. At the same time, QT-TDM (2025) shows how pairing short-horizon Transformers with autoregressive Q-estimators allows off-policy learning to ingest very long action sequences without planning explicitly. This paper’s hybrid architecture—modeling dynamics in latent space and deriving rewards through an autoregressive critic—beats DDPG and SAC on continuous control benchmarks like Humanoid-v4 by reducing divergence from the greedy policy. DISRC (2026) closes the loop for sparse rewards by dynamically scaling updates with a latent surprise score; its evaluation on Montezuma’s Revenge and hard exploration tasks shows that catastrophic Q-value explosions are avoided without sacrificing asymptotic performance.

On the engineering front, DeepMind’s Lotto RL stack (2025 internal memo) uses prioritized replay with a reliability-aware scheduler that is conceptually similar to ReaPER+ and runs on distributed TPU pods. OpenAI’s Recurrent DQN deployments for robotic grasping combine target networks with surprise scaling to maintain stability when human operators occasionally inject dangerous actions. The production pattern that emerges is: collect diverse off-policy data in a replay storage tier, annotate each transition with TD error, surprise, and temporal reliability, and then sample according to a multi-objective scoring function that transitions from exploration-driven to exploitation-driven as the critic matures. ReaPER+ is the proof-of-concept that the buffering policy matters more than the optimizer choice in these setups, and the batches supplied to the optimizer are the real throttle on the learning curve.

A research frontier remains in making these stability analyses more principled. For example, Anonymous et al. (2026) [arxiv:2603.21621] and Anonymous et al. (2026) [arxiv:2604.08865] point toward the need for tighter bounds on the distribution shift induced by replay sampling, and a frontier research paper would be to extend these bounds to non-linear function approximators with generalization error estimates. On the systems side, scaling the sampler to multi-agent replay memories while ensuring reliability-aware annealing still behaves correctly is an open engineering project—the distributed buffer must propagate updated priorities without latency spikes, or else the sampling schedule degenerates.

What's still open¶

Can we identify a set of sufficient conditions—ideally checkable during training—such that deep Q-learning converges with non-linear function approximators without relying on heuristic target networks or replay annealing? The existing convergence proofs, such as Anonymous et al. (2026) [arxiv:2603.21621], still assume linear functions or uniform sampling, so extending them to ReaPER+-style prioritized buffers is necessary for theoretical closure. Another question is whether surprise-scaled updates like DISRC can be derived from a variational perspective that simultaneously learns the surprise encoder and the critic, rather than treating surprise as an external signal. Finally, how do we design distributed replay systems that preserve the reliability ordering while operating at millions of transitions per second? Any practical solution must address the trade-off between consistency of priorities and throughput in a way that scales to multi-agent settings.

Where to read next¶

For the replay-first perspective that this page emphasizes, → [[experience-replay]] traces the history from uniform buffers to prioritized sampling and replay-scheduling heuristics. If you want the probabilistic foundation underneath Q-learning’s contraction arguments, → [[bellman-equations]] and → [[dynamic-programming]] lay out the exact operators and performance guarantees. The engineering counterpart is → [[distributed-replay-systems]] that explains how large-scale buffers with prioritized or reliability-aware sampling are implemented in production clusters.

Build it¶

This build is your first end-to-end implementation of Q-learning that keeps the critic stable by shaping the replay buffer and scaling updates as the network learns. What you're building is a PyTorch Deep Q-Network that trains on Gym’s LunarLander-v3 environment while deploying a simplified version of the ReaPER+ annealed replay rule so you can observe how prioritization transitions from TD-error-driven to reliability-aware sampling.

What you're building: A reproducible DQN training script that evaluates ReaPER+ style replay scheduling on LunarLander-v3, producing logs for TD-error, reliability score, and cumulative reward.

Why this is valuable: It forces you to manage every component that breaks in off-policy training: buffering, sampling, normalization, and evaluation dashboards.

Stack: - Model: bbmb/deep-learning-for-embedding-model-ssilwal-qpham6_army_doc — 1.2k downloads, used here to initialize the state encoder before finetuning the DQN head. - Dataset: OpenAI Gym’s LunarLander-v3 episodes collected on-the-fly via gymnasium (no offline dataset download required). - Framework: PyTorch 2.1 + torchrl 0.6 for replay storage, tensorboard for monitoring. - Compute: 8GB VRAM (Colab T4 / RTX 3060), ~2 hours for 1 million steps with a replay buffer capped at 200k transitions.

The recipe: 1. Install the stack with pip install torch==2.1.0 torchrl gymnasium tensorboard, clone a repository with a buffer implementation (e.g. git clone https://github.com/q-learning-engine/dqn-reaper && cd dqn-reaper), and download the HuggingFace embedding model via from huggingface_hub import snapshot_download. 2. Collect data by running LunarLander-v3 with an epsilon-greedy policy, storing transitions with TD error, reliability estimate (moving average of TD standard deviation), and surprise score (variance-normalized reward prediction) in the buffer. 3. Train the DQN head by sampling mini-batches where the sampling probability \(P(i)\) for transition \(i\) is computed as

\[ P(i) \propto (|\delta_i| + \epsilon)^\alpha \cdot (1 - \rho_i)^\beta \]

where \(\delta_i\) is the TD error, \(\rho_i\) is the normalized reliability, \(\alpha\) anneals down from 0.7 to 0.1, \(\beta\) anneals up from 0 to 0.5, and \(\epsilon=10^{-6}\); update the embedding encoder with a smaller learning rate than the head. 4. Evaluate every 10,000 steps by running deterministic episodes for 5 seeds, logging average episodic reward and the distribution of sampled reliabilities; you should expect stable rewards around 200 once the reliability-aware phase kicks in and the TD-error tail from early noise has diminished. 5. What you now have is a checkpointed DQN whose replay buffer can be frozen and replayed to reproduce the ReaPER+ schedule, along with TensorBoard dashboards showing how the reliability metric rises as the network matures.

Expected outcome: The named artifact is a PyTorch DQN checkpoint plus monitoring dashboards proving that ReaPER+-style sampling improves stability on LunarLander-v3 (e.g., variance of episodic reward halves in the second half of training compared to uniform replay).

CS student: Run the same script in Colab with a smaller buffer (50k transitions) and shorter annealing schedule so the build completes in under 1 hour; the key is seeing how reliability sampling salvages training once the td-error window diminishes.
Applied engineer: Extend the replay buffer to a Redis-backed service, quantize the DQN head to INT8 with torch.ao.quantization, and deploy an inference endpoint that serves the agent’s action probabilities at p50 < 120ms on an A10 instance.
Applied researcher: Hypothesize that the reliability score can be derived from a learned critic ensemble; modify the training script to average TD errors from three heads and test whether ensemble-based reliability scheduling beats the deterministic metric within 2k steps.
Frontier researcher: Probe whether the entire annealing schedule can be learned end-to-end with a Meta-Controller trained via reinforcement learning—the falsifier is that the learned schedule fails to recover stability when the reliability signal is perturbed, indicating that the handcrafted annealing is still necessary.

If this build worked for you — a ⭐ on GitHub is the only signal we collect.