Seminal Papers¶

A curated list of papers that defined their area — filtered for lasting impact, not recency. Signal over noise.

Every paper here answers: "If someone asks me to read one paper to understand this topic, which one?"

Foundations¶

Paper	Year	Authors	Why it matters
Adam: A Method for Stochastic Optimization	2014	Kingma & Ba	The default optimizer for neural networks
Batch Normalization	2015	Ioffe & Szegedy	Normalized activations → faster, more stable training
Deep Residual Learning for Image Recognition	2015	He et al.	Skip connections solved vanishing gradients for very deep nets
Dropout	2014	Srivastava et al.	Regularization by randomly dropping activations

Paper	Year	Authors	Why it matters
Scaling Laws for Neural Language Models	2020	Kaplan et al.	Performance scales predictably with compute, data, and model size
Training Compute-Optimal Large Language Models (Chinchilla)	2022	Hoffmann et al.	Data scales equally with model size; Kaplan was wrong about the ratio

Paper	Year	Authors	Why it matters
Attention Is All You Need	2017	Vaswani et al.	The transformer — ended the RNN era
BERT	2018	Devlin et al.	Bidirectional pretraining; fine-tuning paradigm
Language Models are Few-Shot Learners (GPT-3)	2020	Brown et al.	Scaling + in-context learning; few-shot without gradient updates
An Image is Worth 16x16 Words (ViT)	2020	Dosovitskiy et al.	Transformer for vision without convolutions
FlashAttention	2022	Dao et al.	IO-aware attention; 2-4× speedup; how attention is actually computed
Mamba: Linear-Time Sequence Modeling	2023	Gu & Dao	Selective state spaces; O(N) alternative to attention

Paper	Year	Authors	Why it matters
Auto-Encoding Variational Bayes	2013	Kingma & Welling	VAEs — the foundation of deep generative modeling
Generative Adversarial Nets	2014	Goodfellow et al.	GANs — adversarial training; sharp image synthesis
Denoising Diffusion Probabilistic Models	2020	Ho et al.	DDPM — current dominant generative paradigm
Score-Based Generative Modeling through SDEs	2020	Song et al.	Unified SDE framework; connects score matching and diffusion
Flow Matching for Generative Modeling	2022	Lipman et al.	Simulation-free CNF training; the emerging standard
Flow Straight and Fast: Rectified Flow	2022	Liu et al.	Straight-line paths; FLUX and SD3 foundation

Paper	Year	Authors	Why it matters
InstructGPT	2022	Ouyang et al.	RLHF applied to LLMs; ChatGPT's origin
Constitutional AI	2022	Bai et al. (Anthropic)	AI feedback replaces human feedback; scales alignment
Direct Preference Optimization	2023	Rafailov et al.	DPO — removed the reward model; the practical alignment standard
Chain-of-Thought Prompting	2022	Wei et al.	Step-by-step reasoning emerges from prompting
DeepSeek-R1	2025	DeepSeek AI	Pure RL produces reasoning comparable to o1

Paper	Year	Authors	Why it matters
Playing Atari with Deep RL (DQN)	2013	Mnih et al.	Deep RL at scale; experience replay + target networks
Proximal Policy Optimization (PPO)	2017	Schulman et al.	Clipped surrogate objective; the most-used policy gradient algorithm
Mastering the Game of Go with Deep Neural Networks (AlphaGo)	2016	Silver et al.	Deep RL + MCTS beats world champion; proved the paradigm
Mastering Chess and Shogi by Self-Play (AlphaZero)	2017	Silver et al.	Tabula rasa RL via self-play; no human knowledge needed

Paper	Year	Authors	Why it matters
Causality (book Ch. 1-3)	2009	Pearl	SCMs, do-calculus, the ladder of causation
Deep RL from Human Preferences	2017	Christiano et al.	Original RLHF; reward learning from pairwise preferences
Double/Debiased Machine Learning	2016	Chernozhukov et al.	Causal inference at scale via cross-fitting
Towards Causal Representation Learning	2021	Schölkopf et al.	Bridge between causal theory and deep learning
Invariant Risk Minimization	2019	Arjovsky et al.	Learn features invariant across environments

Paper	Year	Authors	Why it matters
Physics-Informed Neural Networks	2017	Raissi et al.	PDEs as loss terms; mesh-free PDE solvers
Fourier Neural Operator	2020	Li et al.	Resolution-invariant operator learning; turbulence simulation
Highly accurate protein structure prediction with AlphaFold	2021	Jumper et al.	Solved protein folding; Nobel Prize 2024
Accurate structure prediction with AlphaFold 3	2024	Abramson et al.	Diffusion-based joint structure prediction for all biomolecule types
Learning skillful medium-range weather forecasting (GraphCast)	2022	Lam et al.	GNN-based global weather forecasting in <60 seconds

Paper	Year	Authors	Why it matters
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models	2019	Rajbhandari et al.	Shard optimizer/gradient/parameter across GPUs
Megatron-LM: Training Multi-Billion Parameter Language Models	2019	Shoeybi et al.	Tensor parallelism for LLMs
Efficient Memory Management for LLM Serving with PagedAttention	2023	Kwon et al.	KV cache as virtual memory; continuous batching

Maintained by: Editorial and pedagogical agents. Last updated: 2026-05-25. Source policy: arXiv, *.edu, official lab publications only.