Glossary¶

Key terms used across the Frontier Wiki — defined precisely, not verbosely.

🚧 Agent-generated content pending. Core terms seeded manually; agents will expand this glossary as they generate new topic pages.

A¶

Attention — a mechanism that computes a weighted sum over input elements, where weights are learned based on pairwise similarity between query and key vectors. The core mechanism in transformers.

ELBO — Evidence Lower Bound. The variational lower bound on log p(x) optimized in VAEs and other variational models. ELBO = E[log p(x|z)] − KL(q(z|x) || p(z)).

Autoregressive model — a model that generates sequences by predicting one token at a time, conditioning each prediction on all previous tokens. GPT family models are autoregressive.

B¶

Backpropagation — the algorithm for computing gradients of a loss function with respect to all parameters in a neural network, using the chain rule of calculus. Makes gradient descent practical.

Bandwidth (memory bandwidth) — the rate at which data can be read from or written to memory. The primary bottleneck for many neural network operations on GPUs; see [[roofline-model]].

C¶

Causal masking — a mask applied to attention weights in decoder-only models (GPT family) that prevents tokens from attending to future positions. Enables autoregressive generation.

CFG (Classifier-Free Guidance) — an inference technique for conditional diffusion/flow models that trades sample diversity for quality by interpolating between conditional and unconditional score predictions.

Confounding — when a third variable (confounder) causally influences both the treatment and the outcome, creating a spurious correlation. Controlling for confounders is the central challenge of observational causal inference.

D¶

Do-operator / do(X=x) — Pearl's intervention operator: P(Y | do(X=x)) is the distribution of Y when X is externally set to x, unlike P(Y | X=x) which conditions on observing X=x. The difference captures causation vs. correlation.

Diffusion model — a generative model that learns to reverse a gradual noising process. Forward: add Gaussian noise across T steps. Reverse: learn to denoise; at inference, start from noise and iteratively denoise.

E¶

ELBO — see above.

Equivariance — a function f is equivariant to transformation g if f(g(x)) = g(f(x)). In molecular modeling, SE(3)-equivariant networks preserve 3D rotation/translation symmetry.

F¶

Flow matching — a training framework for continuous normalizing flows that learns a vector field transporting noise to data along simple paths, avoiding expensive simulation during training.

Fine-tuning — adapting a pretrained model to a specific task or domain by continuing training on task-specific data. PEFT methods (LoRA, adapters) fine-tune only a small fraction of parameters.

G¶

GAN (Generative Adversarial Network) — a generative model consisting of a generator and discriminator trained in a minimax game. Generator tries to fool discriminator; discriminator tries to distinguish real from fake.

Gradient checkpointing — a memory optimization: recompute activations during the backward pass instead of storing them. Trades compute for memory; essential for training large models.

H¶

HBM (High Bandwidth Memory) — the main memory on modern GPUs (e.g., H100 has 80GB HBM3). Slower than SRAM (the on-chip cache) but much larger. Most neural network operations are HBM-bandwidth-limited.

I¶

In-context learning — a model's ability to perform new tasks from a few examples provided in the prompt, without any gradient updates. GPT-3 demonstrated this emerges from scaling.

K¶

KL divergence — Kullback-Leibler divergence: KL(P || Q) measures how different distribution P is from Q. Always ≥ 0; equals 0 iff P = Q. Used as the regularization term in VAEs and RLHF.

KV cache — storing the key and value tensors for all previously generated tokens during autoregressive decoding, avoiding recomputation. Memory grows O(sequence_length × layers).

L¶

LoRA (Low-Rank Adaptation) — a PEFT method that adds low-rank update matrices ΔW = AB (A ∈ R^{d×r}, B ∈ R^{r×k}, r ≪ d) to pretrained weights. Fine-tunes ~0.1% of parameters with minimal quality loss.

Loss landscape — the surface defined by a model's loss as a function of its parameters. Neural networks have complex non-convex landscapes with many saddle points and local minima.

M¶

MDP (Markov Decision Process) — a mathematical framework for sequential decision-making: states S, actions A, transition function T(s'|s,a), reward function R(s,a). The foundation of RL.

Mixed precision — training with FP16 or BF16 for activations/gradients, FP32 for master weights. Reduces memory 2×, increases throughput 2-4× on modern hardware.

N¶

Neural operator — a neural network that learns maps between function spaces (infinite-dimensional), rather than between finite-dimensional vectors. DeepONet and FNO are neural operators.

Next-token prediction — the training objective for autoregressive LLMs: maximize log p(x_t | x_1, ..., x_{t-1}) over all positions t. Simple objective that produces rich representations.

O¶

Off-policy — in RL, learning a policy from data collected by a different (behavior) policy. Enables learning from historical data without online interaction.

Operator learning — learning a map from one function to another (e.g., from initial conditions to PDE solutions), rather than a map from vectors to vectors.

P¶

PEFT (Parameter-Efficient Fine-Tuning) — methods that fine-tune a small subset of parameters while keeping the pretrained model mostly frozen. Includes LoRA, adapters, prompt tuning.

Policy gradient — a class of RL algorithms that directly optimize the policy parameters by estimating the gradient of expected return. REINFORCE, PPO, and GRPO are policy gradient methods.

PPO (Proximal Policy Optimization) — a policy gradient algorithm with a clipped surrogate objective that prevents excessively large policy updates. The most widely used RL algorithm; used in RLHF.

Q¶

Quantization — reducing numerical precision of weights/activations (FP32 → FP16/INT8/INT4) to reduce memory and increase inference throughput. GPTQ, AWQ, and GGUF are quantization methods for LLMs.

R¶

RLHF — see [[rlhf]].

RoPE (Rotary Position Embedding) — encodes position via rotation in the query/key space; enables relative position computation naturally in dot-product attention; default in Llama, Mistral, DeepSeek.

Roofline model — a performance model showing whether a computation is memory-bandwidth-limited (memory-bound) or compute-limited (compute-bound) based on arithmetic intensity.

S¶

SCM (Structural Causal Model) — a triple (V, U, F) of endogenous variables, exogenous noise, and structural equations; Pearl's framework for representing causal relationships.

Score function — ∇_x log p(x): the gradient of the log-density with respect to x. Learned by score matching; the basis of score-based and diffusion generative models.

Self-attention — attention where queries, keys, and values all come from the same sequence. Every token attends to every other token; the core mechanism of transformers.

SFT (Supervised Fine-Tuning) — fine-tuning a pretrained model on labeled demonstration data. Stage 1 of RLHF; also the dominant fine-tuning approach for instruction-following.

SSM (State Space Model) — a model of sequential data via a hidden state: h_t = Ah_{t-1} + Bx_t, y_t = Ch_t. Mamba extends SSMs with input-selective (data-dependent) parameters.

T¶

Tensor parallelism — distributing model weight matrices across GPUs along one dimension; requires all-reduce operations for matrix multiplications. Used in Megatron-LM.

TF-IDF — Term Frequency-Inverse Document Frequency: a classical weighting scheme for document retrieval, superseded by dense retrieval (BERT embeddings + FAISS) for most applications.

Tokenization — the process of converting text to discrete tokens. BPE (Byte-Pair Encoding) and SentencePiece are dominant methods; typical LLMs use 30k-100k token vocabularies.

V¶

VAE (Variational Autoencoder) — a generative model combining an encoder q_φ(z|x) and decoder p_θ(x|z) trained via the ELBO. First deep latent variable model to enable smooth interpolation.

Vector field — in flow matching: v_θ(x, t) is a function that assigns a velocity vector to every point (x, t); integrating it traces a path from noise to data.

Z¶

ZeRO (Zero Redundancy Optimizer) — a distributed training technique that partitions optimizer states, gradients, and model parameters across GPUs instead of replicating them. Stage 3 enables trillion-parameter training.

Agents expand this glossary automatically when generating new topic pages: any [[term]] reference that doesn't exist in this file triggers an agent task to add the definition.