Seminal Papers¶
A curated list of papers that defined their area — filtered for lasting impact, not recency. Signal over noise.
Every paper here answers: "If someone asks me to read one paper to understand this topic, which one?"
Foundations¶
Optimization & Training¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Adam: A Method for Stochastic Optimization | 2014 | Kingma & Ba | The default optimizer for neural networks |
| Batch Normalization | 2015 | Ioffe & Szegedy | Normalized activations → faster, more stable training |
| Deep Residual Learning for Image Recognition | 2015 | He et al. | Skip connections solved vanishing gradients for very deep nets |
| Dropout | 2014 | Srivastava et al. | Regularization by randomly dropping activations |
Scaling¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Scaling Laws for Neural Language Models | 2020 | Kaplan et al. | Performance scales predictably with compute, data, and model size |
| Training Compute-Optimal Large Language Models (Chinchilla) | 2022 | Hoffmann et al. | Data scales equally with model size; Kaplan was wrong about the ratio |
Architectures¶
Transformers¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Attention Is All You Need | 2017 | Vaswani et al. | The transformer — ended the RNN era |
| BERT | 2018 | Devlin et al. | Bidirectional pretraining; fine-tuning paradigm |
| Language Models are Few-Shot Learners (GPT-3) | 2020 | Brown et al. | Scaling + in-context learning; few-shot without gradient updates |
| An Image is Worth 16x16 Words (ViT) | 2020 | Dosovitskiy et al. | Transformer for vision without convolutions |
| FlashAttention | 2022 | Dao et al. | IO-aware attention; 2-4× speedup; how attention is actually computed |
| Mamba: Linear-Time Sequence Modeling | 2023 | Gu & Dao | Selective state spaces; O(N) alternative to attention |
Generative Models¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Auto-Encoding Variational Bayes | 2013 | Kingma & Welling | VAEs — the foundation of deep generative modeling |
| Generative Adversarial Nets | 2014 | Goodfellow et al. | GANs — adversarial training; sharp image synthesis |
| Denoising Diffusion Probabilistic Models | 2020 | Ho et al. | DDPM — current dominant generative paradigm |
| Score-Based Generative Modeling through SDEs | 2020 | Song et al. | Unified SDE framework; connects score matching and diffusion |
| Flow Matching for Generative Modeling | 2022 | Lipman et al. | Simulation-free CNF training; the emerging standard |
| Flow Straight and Fast: Rectified Flow | 2022 | Liu et al. | Straight-line paths; FLUX and SD3 foundation |
Language Models & Alignment¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| InstructGPT | 2022 | Ouyang et al. | RLHF applied to LLMs; ChatGPT's origin |
| Constitutional AI | 2022 | Bai et al. (Anthropic) | AI feedback replaces human feedback; scales alignment |
| Direct Preference Optimization | 2023 | Rafailov et al. | DPO — removed the reward model; the practical alignment standard |
| Chain-of-Thought Prompting | 2022 | Wei et al. | Step-by-step reasoning emerges from prompting |
| DeepSeek-R1 | 2025 | DeepSeek AI | Pure RL produces reasoning comparable to o1 |
Reinforcement Learning¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Playing Atari with Deep RL (DQN) | 2013 | Mnih et al. | Deep RL at scale; experience replay + target networks |
| Proximal Policy Optimization (PPO) | 2017 | Schulman et al. | Clipped surrogate objective; the most-used policy gradient algorithm |
| Mastering the Game of Go with Deep Neural Networks (AlphaGo) | 2016 | Silver et al. | Deep RL + MCTS beats world champion; proved the paradigm |
| Mastering Chess and Shogi by Self-Play (AlphaZero) | 2017 | Silver et al. | Tabula rasa RL via self-play; no human knowledge needed |
Causal AI¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Causality (book Ch. 1-3) | 2009 | Pearl | SCMs, do-calculus, the ladder of causation |
| Deep RL from Human Preferences | 2017 | Christiano et al. | Original RLHF; reward learning from pairwise preferences |
| Double/Debiased Machine Learning | 2016 | Chernozhukov et al. | Causal inference at scale via cross-fitting |
| Towards Causal Representation Learning | 2021 | Schölkopf et al. | Bridge between causal theory and deep learning |
| Invariant Risk Minimization | 2019 | Arjovsky et al. | Learn features invariant across environments |
Scientific AI¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| Physics-Informed Neural Networks | 2017 | Raissi et al. | PDEs as loss terms; mesh-free PDE solvers |
| Fourier Neural Operator | 2020 | Li et al. | Resolution-invariant operator learning; turbulence simulation |
| Highly accurate protein structure prediction with AlphaFold | 2021 | Jumper et al. | Solved protein folding; Nobel Prize 2024 |
| Accurate structure prediction with AlphaFold 3 | 2024 | Abramson et al. | Diffusion-based joint structure prediction for all biomolecule types |
| Learning skillful medium-range weather forecasting (GraphCast) | 2022 | Lam et al. | GNN-based global weather forecasting in <60 seconds |
Systems¶
| Paper | Year | Authors | Why it matters |
|---|---|---|---|
| ZeRO: Memory Optimizations Toward Training Trillion Parameter Models | 2019 | Rajbhandari et al. | Shard optimizer/gradient/parameter across GPUs |
| Megatron-LM: Training Multi-Billion Parameter Language Models | 2019 | Shoeybi et al. | Tensor parallelism for LLMs |
| Efficient Memory Management for LLM Serving with PagedAttention | 2023 | Kwon et al. | KV cache as virtual memory; continuous batching |
Maintained by: Editorial and pedagogical agents. Last updated: 2026-05-25. Source policy: arXiv, *.edu, official lab publications only.