Compositionality¶
Imagine a single prompt where the model first solves a grade-school addition problem, then reverses the resulting number as a string, and finally describes why the reversal is correct. Large language models can answer each subproblem with near-human accuracy when asked in isolation, yet the moment the tasks are chained the overall hit rate collapses by roughly 30 percentage points. That sudden drop is the compositional cliff—the gap between knowing the parts and being able to stitch them back into a novel whole. By the end of this page you will understand why this cliff exists, how linguists and machine learning researchers quantify it, and how to measure the drop yourself so you can see the failure mode in any open-source model.
The territory¶
Compositionality answers the question: why can a speaker understand a sentence they have never heard before? The Stanford Encyclopedia of Philosophy (Winter 2013) defines the principle as the idea that the meaning of a complex expression depends on the meanings of its parts and the rules that combine them (Frege’s principle) [https://plato.stanford.edu/archives/win2013/entries/compositionality/]. Partee (1984) turned this into a formal constraint on semantic interpretation: if every phrase is built by applying functions to arguments, then the semantics are simply those functions applied to the constituent meanings [https://web.mit.edu/jda/www/teaching/6.884/readings/partee_1984.pdf]. In linguistics this is the bridge between syntactic derivation and meaning: once you know how to interpret the atomic pieces (nouns, verbs, numbers) and the combinators (function application, quantification), you can assign meaning to any sentence generated by the grammar.
Neural networks, however, rarely implement explicit compositional functions. They learn high-dimensional representations where combination is statistical co-occurrence rather than function application, and this short-circuits Frege’s guarantee. The distinction between compositional and systematic semantics in the classic work by the comps (cmp-lg/9503024) is at the heart of this gap: compositional semantics says that composed expressions have meanings derivable from their parts, but systematic semantics asks whether the system can apply that derivation across an unbounded range of novel combinations. Neural models often satisfy the first criterion within their training distribution but fail on the second because they never see the out-of-distribution merges in training. This is why the compositional cliff appears even when every primitive task (addition, string reversal, commonsense reasoning) is mastered in isolation—the model has never learned the combination rule itself.
The machine learning response has been to treat compositionality as a gap in systematic generalization. Andreas (2019) translated the semantic intuition into a metric space: if the distance between embedded meanings reflects the distance between their syntactic structures, then the representation is more compositional [https://arxiv.org/abs/1902.07181]. In other words, you can measure how much the algebra of functions is mirrored in the vector space. That metric is the empirical workhorse for today’s evaluations because it lets us operationalize the failure mode: we generate a compositional task (say, reverse the string after adding two digits), evaluate how well the model handles the primitive tasks alone, then measure how performance degrades when those tasks are composed. If the degradation is sharp, the system is not systematically generalizing. How does the mechanism actually work?
How it works¶
The starting point is the formal decomposability in linguistics: any expression \(E\) is built from subexpressions \(E_1, E_2, \dots\) through syntactic rules such as function application or modification. Partee formalized this by saying there exists a function \(f\) such that the meaning of \(E\) is \(f(\text{meaning}(E_1), \text{meaning}(E_2), \dots)\). The syntax tree specifies \(f\); semantics then selects how the meanings compose. When we translate this to neural representations, \(E\) becomes a sequence tokenized into embeddings \(x_1, x_2, \dots, x_n\), and the model computes a latent \(h(E)\) via layers of attention and feedforward transformations. The compositional hypothesis claims there exists a decoder \(g\) and combinator \(c\) such that \(h(E) \approx c(g(h(E_1)), g(h(E_2)), \dots)\). This is what Andreas’ metrics try to measure: the embedding of a composed expression should be predictable from the embeddings of its parts and the combinator.
One quantitative formulation is to treat the space of expressions as a metric space and compare structural distances with representation distances. Let \(S\) be the set of expressions in a dataset, and define a structural distance \(d_{\text{struct}}(E_i, E_j)\) based on the number of shared primitives and combinators. At the same time, define a representation distance \(d_{\text{repr}}(E_i, E_j)\) such as cosine distance between the embeddings \(h(E_i)\) and \(h(E_j)\). Andreas (2019) operationalizes compositionality with the Spearman correlation
where the pairs \((i,j)\) come from the dataset and the correlation is computed over all pairs. Here \(h\) is the encoder, \(d_{\text{struct}}\) captures syntactic composition, and \(d_{\text{repr}}\) captures semantic adjacency in latent space. A high topographic score indicates the model preserves compositional structure: structurally similar expressions live close together in representation space. This formalization is the bridge between the linguistic notion of function application and the embeddings learned by neural networks.
But the topographic score is only a proxy. What composes in practice is how the model produces answers when primitives are chained inside a single prompt. Consider two primitive tasks: \(T_1(x)=\text{reverse}(x)\) and \(T_2(x)=\text{add}(x)\) for simple decimal strings. Each task can be evaluated in isolation: train the model (or simply prompt it) to produce \(T_1\) and \(T_2\). Denote their accuracies \(A_1\) and \(A_2\). A compositional task \(T_{1\circ2}(x) = T_1(T_2(x))\) asks the model to apply \(T_2\) followed by \(T_1\). Define the compositional gap \(\Delta = \frac{A_1 + A_2}{2} - A_{1\circ2}\) where \(A_{1\circ2}\) is accuracy on the chained task. Empirically, \(\Delta\) is often large despite \(A_1, A_2 \approx 1.0\). This gap is what we call the compositional cliff. The novelty of Andreas’ measurement is that \(\Delta\) correlates with the low topographic score: if the embeddings do not respect the combinatorial structure, the model cannot compose the tasks even if it knows each individually.
Another intuition comes from systematic semantics versus compositional semantics (cmp-lg/9503024). In systematic semantics, the interpretation function is not merely compositional, it is systematic: the same meaning-combining rule applies uniformly to all allowable combinations. Neural models fail systematicity because they overfit specific combinations seen during training, instead of learning the underlying rule. For example, if the training data contains “add 3 and 4” and “reverse 34,” the model may memorize outputs rather than learn to apply “add” then “reverse” generically. The representation for “add 5 and 6” might live nowhere near the representation for “reverse 11,” so the prompt “reverse the sum of 5 and 6” fails. Worse, when we add new combinators such as conditional statements or multi-step reasoning, the same memorization trap causes cascades of failures.
The architectural consequence is that standard transformers do not implement explicit combinators. Each attention layer mixes tokens in ways that cannot be decomposed into interpretable functions. Training on massive corpora yields impressive pattern matching because there are abundant repeated contexts, but there are exponentially many unseen combinations. The model’s latent space simply has not seen enough coverage to generalize the combinator application. This also explains why finetuning on multi-step tasks can help: when we explicitly supervise the model on combinations, it learns the pattern. The goal is to extend this to zero-shot combinations, and that is where compositionality metrics and diagnostics become critical—they tell us whether generalization will hold before we add more data.
A practical way to inspect this is to instrument the inference stage. We treat each expression \(E\) as a prompt that describes a composition of two or more operations. We can gather a dataset of prompts \(P = \{p_1, \ldots, p_N\}\) such as “Add 12 to 45, then reverse the digits,” and evaluate the model’s output probability mass on the correct string. The model’s log-likelihood \(L(p)\) for prompt \(p\) is computed by the decoder, and we compute the accuracy \(A(p)\) for exact match. By comparing \(A(p)\) across pure and composite prompts, we estimate the systematic generalization. If \(A(p)\) on composites is much lower than the average \(A(p)\) on primitives, then the representation lacks systematicity. This ties back to Andreas’ metric because a representation space that mirrors the structure would assign similar scores to \(p\) and its decomposed versions; the lack of such similarity results in a compositional cliff at inference time.
The applied takeaway is that compositionality is not a single architectural trick but a set of evaluation and modeling practices: (1) define primitive tasks and compositional tasks explicitly, (2) measure the embedding distances (Andreas), (3) compute the compositional gap \(\Delta\) by running both primitives and composites, and (4) analyze failure cases to identify missing combinators or insufficient coverage in training.
Where the field is now¶
AgentCoMa (2024) [arxiv:2410.14817] is the latest research frontier in compositionality evaluation. It assembles benchmarks that purposely mix heterogeneous reasoning types—algebra, commonsense, and programmatic string manipulation—into single episodes. Its finding is sobering: state-of-the-art LLMs, including open checkpoints derived from the Llama-3 family, maintain near-perfect accuracy on the primitives but lose more than 25–35 points when those primitives must be sequenced in novel ways. The study attributes this to both representational miscues (low topographic scores) and decoding path failures; decoding beams that succeed on addition prompt still collapse when conditioned on the reversal instruction. The paper argues for specialized compositional diagnostics, which is what this page’s build recreates on a simpler scale.
The engineering frontier is marked by Meta AI’s release narrative around Llama 3 (2024) [https://ai.meta.com/research/llama-3]. The engineering write-up explains how the inference stack bundles multiple specialized reasoning contexts—arithmetic, summarization, and planning—into a single production endpoint. Their telemetry shows that guiding prompts through an instruction router reduces hallucination rates, but the report also candidly notes that accuracy still dilutes when unseen combinations surface in live chat. Deployers are already facing the same compositional cliff in mission-critical systems, so the practical response has been to build routing layers, safety filters, or retrieval modules that simplify the combination before feeding it to the core model. These architectural patches underscore that compositionality is not only an academic curiosity; it is a deployment challenge when a single endpoint must juggle arithmetic, rewriting, and reasoning simultaneously.
What's still open¶
Can we design inductive biases—either architectural or objective-based—that guarantee systematic generalization to out-of-distribution compositions, without relying on enumerating every possible combination in the training data? Is there a rigorous way to define a “compositional basis” in representation space such that the model can be certified to handle any permutation from that basis? How much of the compositional failure is due to decoding heuristics rather than representation, and can we construct probabilistic decoders whose loss matches the symbolic combinator semantics exactly? What would a benchmark look like that evaluates lifelong compositional learning, where the model continually incorporates new primitives and must compose them with the old ones without catastrophic forgetting?
Where to read next¶
If you are curious about the computational limits of systematic generalization, → [[systematic-generalization]] digs into the underlying probability theory behind extrapolating to novel combinations. The engineering counterpart is → [[reasoning-inference-pipelines]] which shows how routing meshes, retrieval, and adapter stacks form practical workarounds for the compositional cliff. For a different angle on representation evaluation, → [[representation-learning]] explains how disentanglement and factorization provide the substrate that compositionality diagnostics inspect next. The policy and alignment view is detailed in → [[agentic-composition]], where the focus is on verifying whether multi-agent systems can recombine skills safely.
Build it¶
Compositionality is visible in the wild as a measurable performance drop whenever primitives are chained: the build proves this directly by evaluating an open-source model on single-step tasks and their composition. The artifact is a Colab-ready evaluation script that runs on a free GPU and visualizes the cliff, so you can bring the concept to life rather than just read about it.
What you're building: an evaluation pipeline that quantifies the compositional gap \(\Delta\) for string reversal and addition tasks using meta-llama/Llama-3-8B-Instruct.
Why this is valuable: the script makes the compositional cliff concrete; the metric \(\Delta = \frac{A_1 + A_2}{2} - A_{1\circ2}\) exposes whether the model can maintain accuracy once the primitives are sequenced, and it isolates whether failure stems from representation (Andreas topographic scores) or from decoding (prompt chaining).
Stack:
- Model: meta-llama/Llama-3-8B-Instruct — 1.5M+ downloads
- Dataset: openai/mathematics_dataset — includes addition subtasks plus extensible templates for string manipulations
- Framework: transformers==4.42, datasets==2.13, accelerate==1.10, evaluate==0.4
- Compute: Free Google Colab T4 (16 GB) or equivalent, ~30 minutes runtime
The recipe:
1. pip install transformers==4.42 datasets==2.13 accelerate==1.10 evaluate==0.4 matplotlib and load the tokenizer plus meta-llama/Llama-3-8B-Instruct via AutoTokenizer and AutoModelForCausalLM with torch_dtype=torch.float16.
2. Download openai/mathematics_dataset, filter for decimal addition problems (operation == "addition") where the operands are two-digit numbers, and generate a parallel synthetic set of string reversal tasks (e.g., randomly sample 200 two-digit numbers, convert to strings, and reverse the characters). Tokenize the prompts with templates like “Add 12 and 34; answer only the sum.”
3. For evaluation, run inference on the addition prompts to compute accuracy \(A_2\), on reversal prompts for \(A_1\), and on composite prompts “Add 12 and 34, then reverse the digits of the sum” for \(A_{1\circ2}\); log the exact-match rates and the Spearman-based topographic score between the embedding distances of prompts and their structural derivations.
4. Plot the results: bar chart for \(A_1, A_2, A_{1\circ2}\) plus a line plot showing \(\Delta\). Annotate the chart with the computed topographic score to show whether embeddings respect structure. Capture the generated prompt strings and failures for manual inspection.
5. Package the outputs (CSV of accuracy per prompt, matplotlib figure) so you can share the compositional gap with teammates or embed it in a dashboard.
Expected outcome: a reproducible evaluation folder containing the compositional gap plot, the underlying evaluation table, and a short write-up summarizing whether meta-llama/Llama-3-8B-Instruct exhibits a >25% drop when combining addition and reversal.
- CS student: Run the same Colab recipe but extend the synthetic dataset to include conditional branching (e.g., “If the sum is even, reverse it; otherwise, append ‘odd’”) to see whether adding a new combinator increases \(\Delta\) on a single RTX 4070.
- Applied engineer: Wrap the evaluation script in a lightweight FastAPI service that receives prompts, measures \(A_1\), \(A_2\), and \(A_{1\circ2}\) at inference time, and serve it via vLLM on an A10; instrument latency and pipeline the plot to a monitoring dashboard to prove the composite task meets p50 < 400 ms.
- Applied researcher: Hypothesize that adding a small adapter fine-tuned on composed prompts will reduce \(\Delta\) by 10 points; implement low-rank adapters, fine-tune on 200 composite examples, and report the before/after gap along with the trade-off in primitive accuracy.
- Frontier researcher: Probe the open question from §What's still open by designing a contrastive decoder loss that explicitly penalizes \(\Delta\) across held-out combinators; your falsifier is that a new loss should reduce the drop without increasing decoding time by more than 15%.
If this build worked for you — a ⭐ on GitHub is the only signal we collect.