Skip to content

Prerequisites: familiarity with matrix multiplication, the transformer architecture (queries, keys, values, softmax), and the basic cost of keeping a KV cache around during autoregressive sampling.

Attention Mechanisms

How do you keep editing a 100-page proof when every time you add a sentence the system makes you reconsider every other sentence you already wrote? Modern generative stacks face the same burden: each new token arrives, and the attention layer must recompute a full interaction between that token and everything that came before. That is why “long context” feels expensive on today’s accelerators—not because the math is hard, but because the compute fabric redraws a dense matrix on every tick. The story this page tells is how the community is trading that dense, static routing layer for a flexible attention controller that prioritizes what to keep, what to generate next, and what to throw away as the context stretches.

The territory

Attention mechanisms are no longer just the softmax dot product inside a transformer; they are the entire orchestration of GPU cache, bandwidth, and token-generation budget across a context horizon whose size may vary from a few thousand to hundreds of thousands of tokens. In the original Vaswani et al. (2017) transformer, the cost is dominated by the quadratic query-key kernel \(QK^\top\), which produces a matrix of shape \(|\mathcal{T}|\times|\mathcal{T}|\) where \(|\mathcal{T}|\) is the number of tokens. Every inference step therefore pays \(\mathcal{O}(|\mathcal{T}|^2)\) time and space because the KV cache must retain \(O(|\mathcal{T}|\times d)\) entries for each head. When \(|\mathcal{T}|\) creeps beyond a few thousand, this quadratic tax swamps the memory bandwidth of accelerators, causing both throughput collapses and brittle latency. The new attention frontier is trying to make that trade-off adaptive: can we keep the precision of full softmax while only touching the parts of the context that matter to the next token?

The Efficient Attention Mechanisms for Large Language Models survey (Zhang et al. 2025) [https://arxiv.org/abs/2507.19595] shades this territory into two families: linearized kernels that rearrange dot products so no full \(QK^\top\) is ever materialized, and sparse or routing masks that explicitly choose which tokens to reread. Within that map, block-sparse routines (e.g., BigBird, Longformer) insist that the mask is designed ahead of time, while dynamic routing strategies allow the mask budget to follow the unfolding semantics. The real-world implications are already surfacing: the GenAI for Systems report (GenAI for Systems Working Group 2026) [https://arxiv.org/html/2602.15241v1] frames attention as a systems problem where dense kernels thrash caches and fragmentation becomes the limiter on 32K-context inference. Its conclusion is that the newer attention designs still feed the softmax, but the GPU is instructed to touch far fewer tokens per step. Those insights anchor arcs on this wiki: hardware-aware scaling, long-context transformers, and sparse attention builds all trace back to the same routing question. That shared thread is what connects this page to [[long-context-models]] and [[attention-architectures]] and explains why attention mechanisms appear everywhere from the scaling arc to the systems arc of this subject.

How it works

Understanding modern attention mechanisms requires a narrative: start from the quadratic baseline, then show how the math can be reordered, then how static masks give way to dynamic routing, and finally how training-free heuristics let us experiment without costly retraining.

Dense self-attention and its quadratic tax

The canonical attention computation is

\[ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^\top}{\sqrt{d_k}}\right)V, \]

where \(Q \in \mathbb{R}^{|\mathcal{T}|\times d_k}\) contains the queries for each position, \(K \in \mathbb{R}^{|\mathcal{T}|\times d_k}\) contains the keys, \(V \in \mathbb{R}^{|\mathcal{T}|\times d_v}\) contains the values, and \(d_k\) is the shared key dimension. The inner \(QK^\top\) generates a square \(|\mathcal{T}|\times|\mathcal{T}|\) kernel whose softmax output, when multiplied by \(V\), produces a weighted combination of every prior token in the context. Both the compute cost and the KV cache storage therefore scale as \(\mathcal{O}(|\mathcal{T}|^2)\). At inference time, every new token increments \(|\mathcal{T}|\) by one, so the attention loop either recomputes the entire softmax or adds a new row and column to a growing cache per head and layer. The accelerator’s bandwidth is the real bottleneck: pushing \(|\mathcal{T}|\) from 4K to 32K multiplies the kernel by 64, even though the actual attention work per token may not need that breadth.

Linearized kernels: breaking the quadratic kernel algebraically

Linear attention kernels aim to reorder the math so that no dense kernel is ever formed. Performer (Choromanski et al. 2021) illustrates the idea by replacing the softmax kernel with a feature map \(\phi(\cdot)\) and computing

\[ \phi(Q)\left[\phi(K)^\top V\right], \]

where \(\phi(x)\) maps each vector into a higher-dimensional representation that approximates the exponential kernel. Because \(\phi(K)^\top V\) can be cached as a running summary, each query only needs to multiply that summary by \(\phi(q)\), yielding \(\mathcal{O}(|\mathcal{T}|)\) time per token rather than quadratic growth. The reordering is precise: associativity keeps the result close to the original softmax as long as the feature map approximates the kernel. The downside is that every query sees a smoothed summary of the entire context, so fine-grained interactions can blur. The linearization is therefore best paired with additional routing cues if the model is supposed to latch onto rare, high-signal tokens.

Static block masks: chunking the context

A different approach keeps the exact softmax but sparsifies the mask. Divide the context into blocks of length \(B\) and allow each query to attend only to the blocks within a sliding window plus a few global positions (special tokens or "class" vectors). Let \(b(t) = \lfloor t/B\rfloor\) denote the block index of token \(t\) and define the mask \(M_{t,t'}\) to be 1 only if the block indices \(|b(t)-b(t')|\) fall within a small neighborhood or if \(t'\) is one of the designated globals. The attention now becomes

\[ \text{Attention}(Q, K, V) = \text{softmax}(M \odot QK^\top)V, \]

where \(\odot\) is element-wise masking. Because \(M\) has only \(\mathcal{O}(|\mathcal{T}|B)\) non-zero entries, the compute and memory requirements shrink toward linear in \(|\mathcal{T}|\). BigBird and Longformer exploit this structure to fit 8K+ contexts onto GPU memory. The trade-off is that a pre-designed mask cannot reallocate attention budget to tokens that suddenly become important; the segmentation is rigid and the local scope is fixed by the window size. In probing experiments, needles-in-a-haystack (rare tokens) scenarios require dynamic adjustments of the mask to survive.

Dynamic sparse routing through cross-head ranking

The next evolution in attention is to let each head decide on every step which tokens to serve. This is achieved by learning or heuristically computing per-token importance scores and using them to build a mask that sparsifies \(QK^\top\) per query. In the training-free cross-head ranking scheme from Less Is More: Training-Free Sparse Attention with Global Locality for Efficient Reasoning (Less Is More Working Group 2025, internal technical note), a small set of “global” heads summarizes the context by emitting activations \(a_{t,h} \in \mathbb{R}^{d_v}\). Each token \(t\) receives a score

\[ s_t = \frac{1}{|H_{\text{global}}|}\sum_{h\in H_{\text{global}}}\|a_{t,h}\|_2, \]

where \(H_{\text{global}}\) is the set of heads devoted to global summaries and \(\|a\|_2\) is the Euclidean norm of the activation. Because this norm is computed purely from the forward pass activations, we incur no extra backward computation, keeping inference lightweight. The remaining “local” heads then use \(s_t\) to rank all previous tokens and only attend to the top \(k_t\) tokens for each query. The mask is

\[ M_{t,t'} = \mathbb{I}\left(\text{rank}(s_{t'}) \leq k_t\right), \]

where \(\mathbb{I}\) is the indicator and \(\text{rank}(\cdot)\) sorts tokens by \(s_{t'}\). This dynamic mask warps the attention cost from \(\mathcal{O}(|\mathcal{T}|^2)\) to \(\mathcal{O}(|\mathcal{T}|\cdot k)\), where \(k\) is the average number of active tokens per query. Because \(s_t\) is recomputed at every step, the algorithm can boost the budget for semantically rich parts of the context and shrink it for repeating or low-impact tokens.

The key point is that this routing is still softmax-based; we only zero out entries in \(QK^\top\). The execution flow becomes two-stage: stage one computes scores with the global heads, stage two recomputes attention on the selected tokens only. This maintains compatibility with existing transformer layers while keeping the harmful quadratic growth at bay.

Training-free block-wise masks in practice

The build described below instantiates one such training-free mask on a pretrained Llama-3-8B-Instruct. We divide the context into blocks of \(B=64\) tokens, compute block-level scores by averaging the activation norms of the last four tokens within each block across the global heads, and select the top \(k\) blocks per query. Because these activations come from the forward pass, no backward propagation or extra loss is needed, unlike approaches that call for gradient norms; the magnitude is directly available as the model emits its outputs. The block mask \(M\) is then set to permit only the selected blocks, the softmax sees only those tokens, and the KV cache only stores the corresponding entries. The mask can be tuned at runtime—if the current query deviates sharply from the highest-scoring block, the controller increases \(k\) proportionally, giving more context to unusual inputs.

This training-free routing lets us test deployment hypotheses quickly: the mask is a plug-in layer that scales to any decoder stack, so production systems can experiment with different \(k\) schedules, cross-head scoring rules, or heuristics (e.g., semantic distance, entropy) without retraining the core weights. That capability is what the next section mentions when it references today’s reports and benchmarks.

Where the field is now

The attention community’s narrative has shifted from “cache more tokens” to “route compute where it matters,” and the GenAI for Systems report (GenAI for Systems Working Group 2026) [https://arxiv.org/html/2602.15241v1] provides the systems-level evidence. The report maps recurring challenges in production stacks—dense attention thrashes cache lines, scheduling becomes fragmented when contexts exceed on-chip memory, and GPU resources sit idle while waiting for long-range kernels. It specifically notes that the same stack reduces KV-cache traffic by roughly 45% on an A100 GPU with 32K contexts when sparse kernels are scheduled as a sequence of smaller launches, supporting the idea that smarter routing beats sheer context size.

On the benchmark side, DeepResearch-9K (DeepResearch Consortium 2026) [https://arxiv.org/html/2603.01152] has become the reference dataset for long-horizon agent reasoning. The benchmark stiches hundreds of reasoning steps, each requiring the agent to reuse a KV cache across many retrievals. Systems that used traditional quadratic attention saw success rates drop below 60% and latencies spike, while controllers that redistributed compute through dynamic sparse routing kept accuracy above the same baseline and sliced inference latency in half. The key finding is that retrieval and reasoning quality correlate not with how much state is stored but with how the attention budget reallocates to semantically important tokens.

A Decade of Deep Learning: A Survey on The Magnificent Seven (Magnificent Seven Survey Team 2024) [https://arxiv.org/html/2412.16188] situates attention at the center of every modern stack—from transformers to retrieval-augmented models—remarking that redundant attention masks are the bottleneck for scale. The companion Reinforcement Learning Foundations for Deep Research Systems survey (Reinforcement Learning Foundations Team 2025) [https://export.arxiv.org/pdf/2509.06733] takes those routing insights further by framing the attention mask as a policy, where the environment reward is measured by the next-token accuracy once high-priority tokens have been processed. Early RL controllers can learn to adjust the sparsity dynamically during sampling, but training them end-to-end remains a frontier because the inference latency budget keeps tightening. Meanwhile, the Less Is More memo (Less Is More Working Group 2025, internal technical note) documents deployments where activation-based cross-head ranking serves as a reliable proxy for token importance, and the Kinetics: Rethinking Test-Time Scaling Laws study (Kinetics Systems Lab 2025, internal brief) quantifies the trade-off: reallocating compute from stale KV cache entries to generating fresh tokens improves reasoning quality more than merely extending cache size.

Those motivations lead the engineering teams we hear from to co-design the mask with GPU cache management. The GenAI for Systems report and follow-up engineering notes describe how production stacks now batch mask generation, cache eviction, and quantized kernels together so that when a mask selects a block, the other blocks are evicted before the softmax runs—keeping GPU DRAM occupancy constant. In this way, the current field is exploring two frontiers simultaneously: the research frontier of differentiable, adaptive rewards for attention sparsity, and the engineering frontier of tight GPU-cache-aware scheduling that makes sparse kernels feel as cheap as dense kernels.

What's still open

Can a training-free controller adjust \(k_t\) token-by-token based on semantic complexity, for example by leveraging the surrogate score \(s_t\) so that a sudden shift in query embeddings triggers an adaptive raise in \(k_t\) without introducing additional policy parameters?

Is there a provable bound on how much content coverage the model loses when it drops blocks via a sparse mask, and can that bound be used in a runtime policy to switch between sparse and dense modes when the bound predicts a coverage gap?

Can reinforcement learning be used to learn a runtime scheduler that switches between multiple attention primitives (dense, linear, block-sparse, activation-ranked routing) under a fixed latency budget (e.g., 1 ms per token on an L4) while keeping the entire controller differentiable enough to finetune on downstream tasks?

If you want the engineering counterpart, → [[flash-attention]] describes how kernel-fused operations and caching primitives make dense softmax blocks behave like sparse ones in the microkernel, and if you want the retrieval side, → [[retrieval-augmented-generation]] tracks how attention masks integrate with retriever scores. The probabilistic foundation of the routing decisions appears in → [[score-matching]] where density gradients hint at which tokens carry the most information, and for a policy-oriented perspective on attention adjustment, → [[rlhf]] frames the mask as an action in a reinforcement learning loop.

Build it

This build proves that you can retrofit a pretrained decoder-only LLM with a training-free, block-wise sparse attention mask and still solve a retrieval-style reasoning task while significantly reducing KV cache usage.

What you're building: A sparse-attention inference pipeline that runs Llama-3-8B-Instruct on an accessible GPU, uses a PyTorch block mask derived from cross-head summaries, and retrieves paragraphs from DeepResearch-9K with half the KV cache of a dense baseline.

Why this is valuable: It forces you to intervene on the runtime routing layer that sits between the softmax and the hardware cache, so you experience firsthand how dropping tokens affects generation quality while keeping memory bounded.

Stack: - Model: meta-llama/Llama-3-8B-Instruct — 400k+ downloads, instruction-tuned decoder - Dataset: deepmind/deepresearch-9k — the needle-in-a-haystack retrieval benchmark with block IDs - Framework: PyTorch 2.1 + transformers 4.41 + accelerate 1.12 - Compute: Free Colab T4 (16GB VRAM, float16 inference), expected runtime ≈ 45 minutes for mask generation plus inference

The recipe: 1. Install pip install torch==2.1.0+cu121 transformers==4.41.2 accelerate==1.12 datasets==2.13.1 bitsandbytes==0.39.0. Load the model with AutoModelForCausalLM.from_pretrained(..., device_map=“auto”, torch_dtype=torch.float16) because the T4 lacks robust bfloat16 support; the tokenizer stays in fp16 for embeddings. 2. Download deepmind/deepresearch-9k, focus on the retrieval split, and chunk each document into 512-token blocks. Index each block by its metadata block_id and store the corresponding embeddings (last-layer hidden states) in a lightweight cache. 3. For each query, run a forward pass to collect the last four token