Sutskever 30 - Complete Implementation Suite
Comprehensive toy implementations of the 30 foundational papers recommended by Ilya Sutskever
Buy the Google Colab Code for this Repo at Gumroad
Overview
This repository contains detailed, educational implementations of the papers from Ilya Sutskever's famous reading list - the collection he told John Carmack would teach you "90% of what matters" in deep learning.
Progress: 30/30 papers (100%) - COMPLETE! π
Each implementation:
- β Uses only NumPy (no deep learning frameworks) for educational clarity
- β Includes synthetic/bootstrapped data for immediate execution
- β Provides extensive visualizations and explanations
- β Demonstrates core concepts from each paper
- β Runs in Jupyter notebooks for interactive learning
Quick Start
# Navigate to the directory cd sutskever-30-implementations # Install dependencies pip install numpy matplotlib scipy # Run any notebook jupyter notebook 02_char_rnn_karpathy.ipynb
The Sutskever 30 Papers
Foundational Concepts (Papers 1-5)
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 1 | The First Law of Complexodynamics | β
01_complexity_dynamics.ipynb |
Entropy, Complexity Growth, Cellular Automata |
| 2 | The Unreasonable Effectiveness of RNNs | β
02_char_rnn_karpathy.ipynb |
Character-level models, RNN basics, Text generation |
| 3 | Understanding LSTM Networks | β
03_lstm_understanding.ipynb |
Gates, Long-term memory, Gradient flow |
| 4 | RNN Regularization | β
04_rnn_regularization.ipynb |
Dropout for sequences, Variational dropout |
| 5 | Keeping Neural Networks Simple | β
05_neural_network_pruning.ipynb |
MDL principle, Weight pruning, 90%+ sparsity |
Architectures & Mechanisms (Papers 6-15)
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 6 | Pointer Networks | β
06_pointer_networks.ipynb |
Attention as pointer, Combinatorial problems |
| 7 | ImageNet/AlexNet | β
07_alexnet_cnn.ipynb |
CNNs, Convolution, Data augmentation |
| 8 | Order Matters: Seq2Seq for Sets | β
08_seq2seq_for_sets.ipynb |
Set encoding, Permutation invariance, Attention pooling |
| 9 | GPipe | β
09_gpipe.ipynb |
Pipeline parallelism, Micro-batching, Re-materialization |
| 10 | Deep Residual Learning (ResNet) | β
10_resnet_deep_residual.ipynb |
Skip connections, Gradient highways |
| 11 | Dilated Convolutions | β
11_dilated_convolutions.ipynb |
Receptive fields, Multi-scale |
| 12 | Neural Message Passing (GNNs) | β
12_graph_neural_networks.ipynb |
Graph networks, Message passing |
| 13 | Attention Is All You Need | β
13_attention_is_all_you_need.ipynb |
Transformers, Self-attention, Multi-head |
| 14 | Neural Machine Translation | β
14_bahdanau_attention.ipynb |
Seq2seq, Bahdanau attention |
| 15 | Identity Mappings in ResNet | β
15_identity_mappings_resnet.ipynb |
Pre-activation, Gradient flow |
Advanced Topics (Papers 16-22)
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 16 | Relational Reasoning | β
16_relational_reasoning.ipynb |
Relation networks, Pairwise functions |
| 17 | Variational Lossy Autoencoder | β
17_variational_autoencoder.ipynb |
VAE, ELBO, Reparameterization trick |
| 18 | Relational RNNs | β
18_relational_rnn.ipynb |
Relational memory, Multi-head self-attention, Manual backprop (~1100 lines) |
| 19 | The Coffee Automaton | β
19_coffee_automaton.ipynb |
Irreversibility, Entropy, Arrow of time, Landauer's principle |
| 20 | Neural Turing Machines | β
20_neural_turing_machine.ipynb |
External memory, Differentiable addressing |
| 21 | Deep Speech 2 (CTC) | β
21_ctc_speech.ipynb |
CTC loss, Speech recognition |
| 22 | Scaling Laws | β
22_scaling_laws.ipynb |
Power laws, Compute-optimal training |
Theory & Meta-Learning (Papers 23-30)
| # | Paper | Notebook | Key Concepts |
|---|---|---|---|
| 23 | MDL Principle | β
23_mdl_principle.ipynb |
Information theory, Model selection, Compression |
| 24 | Machine Super Intelligence | β
24_machine_super_intelligence.ipynb |
Universal AI, AIXI, Solomonoff induction, Intelligence measures, Self-improvement |
| 25 | Kolmogorov Complexity | β
25_kolmogorov_complexity.ipynb |
Compression, Algorithmic randomness, Universal prior |
| 26 | CS231n: CNNs for Visual Recognition | β
26_cs231n_cnn_fundamentals.ipynb |
Image classification pipeline, kNN/Linear/NN/CNN, Backprop, Optimization, Babysitting neural nets |
| 27 | Multi-token Prediction | β
27_multi_token_prediction.ipynb |
Multiple future tokens, Sample efficiency, 2-3x faster |
| 28 | Dense Passage Retrieval | β
28_dense_passage_retrieval.ipynb |
Dual encoders, MIPS, In-batch negatives |
| 29 | Retrieval-Augmented Generation | β
29_rag.ipynb |
RAG-Sequence, RAG-Token, Knowledge retrieval |
| 30 | Lost in the Middle | β
30_lost_in_middle.ipynb |
Position bias, Long context, U-shaped curve |
Featured Implementations
π Must-Read Notebooks
These implementations cover the most influential papers and demonstrate core deep learning concepts:
Foundations
-
02_char_rnn_karpathy.ipynb- Character-level RNN- Build RNN from scratch
- Understand backpropagation through time
- Generate text
-
03_lstm_understanding.ipynb- LSTM Networks- Implement forget/input/output gates
- Visualize gate activations
- Compare with vanilla RNN
-
04_rnn_regularization.ipynb- RNN Regularization- Variational dropout for RNNs
- Proper dropout placement
- Training improvements
-
05_neural_network_pruning.ipynb- Network Pruning & MDL- Magnitude-based pruning
- Iterative pruning with fine-tuning
- 90%+ sparsity with minimal loss
- Minimum Description Length principle
Computer Vision
-
07_alexnet_cnn.ipynb- CNNs & AlexNet- Convolutional layers from scratch
- Max pooling and ReLU
- Data augmentation techniques
-
10_resnet_deep_residual.ipynb- ResNet- Skip connections solve degradation
- Gradient flow visualization
- Identity mapping intuition
-
15_identity_mappings_resnet.ipynb- Pre-activation ResNet- Pre-activation vs post-activation
- Better gradient flow
- Training 1000+ layer networks
-
11_dilated_convolutions.ipynb- Dilated Convolutions- Multi-scale receptive fields
- No pooling required
- Semantic segmentation
Attention & Transformers
-
14_bahdanau_attention.ipynb- Neural Machine Translation- Original attention mechanism
- Seq2seq with alignment
- Attention visualization
-
13_attention_is_all_you_need.ipynb- Transformers- Scaled dot-product attention
- Multi-head attention
- Positional encoding
- Foundation of modern LLMs
-
06_pointer_networks.ipynb- Pointer Networks- Attention as selection
- Combinatorial optimization
- Variable output size
-
08_seq2seq_for_sets.ipynb- Seq2Seq for Sets- Permutation-invariant set encoder
- Read-Process-Write architecture
- Attention over unordered elements
- Sorting and set operations
- Comparison: order-sensitive vs order-invariant
-
09_gpipe.ipynb- GPipe Pipeline Parallelism- Model partitioning across devices
- Micro-batching for pipeline utilization
- F-then-B schedule (forward all, backward all)
- Re-materialization (gradient checkpointing)
- Bubble time analysis
- Training models larger than single-device memory
Advanced Topics
-
12_graph_neural_networks.ipynb- Graph Neural Networks- Message passing framework
- Graph convolutions
- Molecular property prediction
-
16_relational_reasoning.ipynb- Relation Networks- Pairwise relational reasoning
- Visual QA
- Permutation invariance
-
18_relational_rnn.ipynb- Relational RNN- LSTM with relational memory
- Multi-head self-attention across memory slots
- Architecture demonstration (forward pass)
- Sequential reasoning tasks
- Section 11: Manual backpropagation implementation (~1100 lines)
- Complete gradient computation for all components
- Gradient checking with numerical verification
-
20_neural_turing_machine.ipynb- Memory-Augmented Networks- Content & location addressing
- Differentiable read/write
- External memory
-
21_ctc_speech.ipynb- CTC Loss & Speech Recognition- Connectionist Temporal Classification
- Alignment-free training
- Forward algorithm
Generative Models
17_variational_autoencoder.ipynb- VAE- Generative modeling
- ELBO loss
- Latent space visualization
Modern Applications
-
27_multi_token_prediction.ipynb- Multi-Token Prediction- Predict multiple future tokens
- 2-3x sample efficiency
- Speculative decoding
- Faster training & inference
-
28_dense_passage_retrieval.ipynb- Dense Retrieval- Dual encoder architecture
- In-batch negatives
- Semantic search
-
29_rag.ipynb- Retrieval-Augmented Generation- RAG-Sequence vs RAG-Token
- Combining retrieval + generation
- Knowledge-grounded outputs
-
30_lost_in_middle.ipynb- Long Context Analysis- Position bias in LLMs
- U-shaped performance curve
- Document ordering strategies
Scaling & Theory
-
22_scaling_laws.ipynb- Scaling Laws- Power law relationships
- Compute-optimal training
- Performance prediction
-
23_mdl_principle.ipynb- Minimum Description Length- Information-theoretic model selection
- Compression = Understanding
- MDL vs AIC/BIC comparison
- Neural network architecture selection
- MDL-based pruning (connects to Paper 5)
- Kolmogorov complexity preview
-
25_kolmogorov_complexity.ipynb- Kolmogorov Complexity- K(x) = shortest program generating x
- Randomness = Incompressibility
- Algorithmic probability (Solomonoff)
- Universal prior for induction
- Connection to Shannon entropy
- Occam's Razor formalized
- Theoretical foundation for ML
-
24_machine_super_intelligence.ipynb- Universal Artificial Intelligence- Formal theory of intelligence (Legg & Hutter)
- Psychometric g-factor and universal intelligence Ξ₯(Ο)
- Solomonoff induction for sequence prediction
- AIXI: Theoretically optimal RL agent
- Monte Carlo AIXI (MC-AIXI) approximation
- Kolmogorov complexity estimation
- Intelligence measurement across environments
- Recursive self-improvement dynamics
- Intelligence explosion scenarios
- 6 sections: from psychometrics to superintelligence
- Connects Papers #23 (MDL), #25 (Kolmogorov), #8 (DQN)
-
01_complexity_dynamics.ipynb- Complexity & Entropy- Cellular automata (Rule 30)
- Entropy growth
- Irreversibility (basic introduction)
-
19_coffee_automaton.ipynb- The Coffee Automaton (Deep Dive)- Comprehensive exploration of irreversibility
- Coffee mixing and diffusion processes
- Entropy growth and coarse-graining
- Phase space and Liouville's theorem
- PoincarΓ© recurrence theorem (will unmix after e^N time!)
- Maxwell's demon and Landauer's principle
- Computational irreversibility (one-way functions, hashing)
- Information bottleneck in machine learning
- Biological irreversibility (life and the 2nd law)
- Arrow of time: fundamental vs emergent
- 10 comprehensive sections exploring irreversibility across all scales
-
26_cs231n_cnn_fundamentals.ipynb- CS231n: Vision from First Principles- Complete vision pipeline in pure NumPy
- k-Nearest Neighbors baseline
- Linear classifiers (SVM and Softmax)
- Optimization (SGD, Momentum, Adam, learning rate schedules)
- 2-layer neural networks with backpropagation
- Convolutional layers (conv, pool, ReLU)
- Complete CNN architecture (Mini-AlexNet)
- Visualization techniques (filters, saliency maps)
- Transfer learning principles
- Babysitting tips (sanity checks, hyperparameter tuning, monitoring)
- 10 sections covering entire CS231n curriculum
- Ties together Papers #7 (AlexNet), #10 (ResNet), #11 (Dilated Conv)
Repository Structure
sutskever-30-implementations/
βββ README.md # This file
βββ PROGRESS.md # Implementation progress tracking
βββ IMPLEMENTATION_TRACKS.md # Detailed tracks for all 30 papers
β
βββ 01_complexity_dynamics.ipynb # Entropy & complexity
βββ 02_char_rnn_karpathy.ipynb # Vanilla RNN
βββ 03_lstm_understanding.ipynb # LSTM gates
βββ 04_rnn_regularization.ipynb # Dropout for RNNs
βββ 05_neural_network_pruning.ipynb # Pruning & MDL
βββ 06_pointer_networks.ipynb # Attention pointers
βββ 07_alexnet_cnn.ipynb # CNNs & AlexNet
βββ 08_seq2seq_for_sets.ipynb # Permutation-invariant sets
βββ 09_gpipe.ipynb # Pipeline parallelism
βββ 10_resnet_deep_residual.ipynb # Residual connections
βββ 11_dilated_convolutions.ipynb # Multi-scale convolutions
βββ 12_graph_neural_networks.ipynb # Message passing GNNs
βββ 13_attention_is_all_you_need.ipynb # Transformer architecture
βββ 14_bahdanau_attention.ipynb # Original attention
βββ 15_identity_mappings_resnet.ipynb # Pre-activation ResNet
βββ 16_relational_reasoning.ipynb # Relation networks
βββ 17_variational_autoencoder.ipynb # VAE
βββ 18_relational_rnn.ipynb # Relational RNN
βββ 19_coffee_automaton.ipynb # Irreversibility deep dive
βββ 20_neural_turing_machine.ipynb # External memory
βββ 21_ctc_speech.ipynb # CTC loss
βββ 22_scaling_laws.ipynb # Empirical scaling
βββ 23_mdl_principle.ipynb # MDL & compression
βββ 24_machine_super_intelligence.ipynb # Universal AI & AIXI
βββ 25_kolmogorov_complexity.ipynb # K(x) & randomness
βββ 26_cs231n_cnn_fundamentals.ipynb # Vision from first principles
βββ 27_multi_token_prediction.ipynb # Multi-token prediction
βββ 28_dense_passage_retrieval.ipynb # Dense retrieval
βββ 29_rag.ipynb # RAG architecture
βββ 30_lost_in_middle.ipynb # Long context analysis
All 30 papers implemented! (100% complete!) π
Learning Path
Beginner Track (Start here!)
- Character RNN (
02_char_rnn_karpathy.ipynb) - Learn basic RNNs - LSTM (
03_lstm_understanding.ipynb) - Understand gating mechanisms - CNNs (
07_alexnet_cnn.ipynb) - Computer vision fundamentals - ResNet (
10_resnet_deep_residual.ipynb) - Skip connections - VAE (
17_variational_autoencoder.ipynb) - Generative models
Intermediate Track
- RNN Regularization (
04_rnn_regularization.ipynb) - Better training - Bahdanau Attention (
14_bahdanau_attention.ipynb) - Attention basics - Pointer Networks (
06_pointer_networks.ipynb) - Attention as selection - Seq2Seq for Sets (
08_seq2seq_for_sets.ipynb) - Permutation invariance - CS231n (
26_cs231n_cnn_fundamentals.ipynb) - Complete vision pipeline (kNN β CNNs) - GPipe (
09_gpipe.ipynb) - Pipeline parallelism for large models - Transformers (
13_attention_is_all_you_need.ipynb) - Modern architecture - Dilated Convolutions (
11_dilated_convolutions.ipynb) - Receptive fields - Scaling Laws (
22_scaling_laws.ipynb) - Understanding scale
Advanced Track
- Pre-activation ResNet (
15_identity_mappings_resnet.ipynb) - Architecture details - Graph Neural Networks (
12_graph_neural_networks.ipynb) - Graph learning - Relation Networks (
16_relational_reasoning.ipynb) - Relational reasoning - Neural Turing Machines (
20_neural_turing_machine.ipynb) - External memory - CTC Loss (
21_ctc_speech.ipynb) - Speech recognition - Dense Retrieval (
28_dense_passage_retrieval.ipynb) - Semantic search - RAG (
29_rag.ipynb) - Retrieval-augmented generation - Lost in the Middle (
30_lost_in_middle.ipynb) - Long context analysis
Theory & Fundamentals
- MDL Principle (
23_mdl_principle.ipynb) - Model selection via compression - Kolmogorov Complexity (
25_kolmogorov_complexity.ipynb) - Randomness & information - Complexity Dynamics (
01_complexity_dynamics.ipynb) - Entropy & emergence - Coffee Automaton (
19_coffee_automaton.ipynb) - Deep dive into irreversibility
Key Insights from the Sutskever 30
Architecture Evolution
- RNN β LSTM: Gating solves vanishing gradients
- Plain Networks β ResNet: Skip connections enable depth
- RNN β Transformer: Attention enables parallelization
- Fixed vocab β Pointers: Output can reference input
Fundamental Mechanisms
- Attention: Differentiable selection mechanism
- Residual Connections: Gradient highways
- Gating: Learned information flow control
- External Memory: Separate storage from computation
Training Insights
- Scaling Laws: Performance predictably improves with scale
- Regularization: Dropout, weight decay, data augmentation
- Optimization: Gradient clipping, learning rate schedules
- Compute-Optimal: Balance model size and training data
Theoretical Foundations
- Information Theory: Compression, entropy, MDL
- Complexity: Kolmogorov complexity, power laws
- Generative Modeling: VAE, ELBO, latent spaces
- Memory: Differentiable data structures
Implementation Philosophy
Why NumPy-only?
These implementations deliberately avoid PyTorch/TensorFlow to:
- Deepen understanding: See what frameworks abstract away
- Educational clarity: No magic, every operation explicit
- Core concepts: Focus on algorithms, not framework APIs
- Transferable knowledge: Principles apply to any framework
Synthetic Data Approach
Each notebook generates its own data to:
- Immediate execution: No dataset downloads required
- Controlled experiments: Understand behavior on simple cases
- Concept focus: Data doesn't obscure the algorithm
- Rapid iteration: Modify and re-run instantly
Extensions & Next Steps
Build on These Implementations
After understanding the core concepts, try:
- Scale up: Implement in PyTorch/JAX for real datasets
- Combine techniques: E.g., ResNet + Attention
- Modern variants:
- RNN β GRU β Transformer
- VAE β Ξ²-VAE β VQ-VAE
- ResNet β ResNeXt β EfficientNet
- Applications: Apply to real problems
Research Directions
The Sutskever 30 points toward:
- Scaling (bigger models, more data)
- Efficiency (sparse models, quantization)
- Capabilities (reasoning, multi-modal)
- Understanding (interpretability, theory)
Resources
Original Papers
See IMPLEMENTATION_TRACKS.md for full citations and links
Additional Reading
- Ilya Sutskever's Reading List (GitHub)
- Aman's AI Journal - Sutskever 30 Primers
- The Annotated Transformer
- Andrej Karpathy's Blog
Courses
- Stanford CS231n: Convolutional Neural Networks
- Stanford CS224n: NLP with Deep Learning
- MIT 6.S191: Introduction to Deep Learning
Contributing
These implementations are educational and can be improved! Consider:
- Adding more visualizations
- Implementing missing papers
- Improving explanations
- Finding bugs
- Adding comparisons with framework implementations
Citation
If you use these implementations in your work or teaching:
@misc{sutskever30implementations, title={Sutskever 30: Complete Implementation Suite}, author={Paul "The Pageman" Pajo, pageman@gmail.com}, year={2025}, note={Educational implementations of Ilya Sutskever's recommended reading list, inspired by https://papercode.vercel.app/} }
License
Educational use. See individual papers for original research citations.
Acknowledgments
- Ilya Sutskever: For curating this essential reading list
- Paper authors: For their foundational contributions
- Community: For making these ideas accessible
Latest Additions (December 2025)
Recently Implemented (21 new papers!)
- β Paper 4: RNN Regularization (variational dropout)
- β Paper 5: Neural Network Pruning (MDL, 90%+ sparsity)
- β Paper 7: AlexNet (CNNs from scratch)
- β Paper 8: Seq2Seq for Sets (permutation invariance, attention pooling)
- β Paper 9: GPipe (pipeline parallelism, micro-batching, re-materialization)
- β Paper 19: The Coffee Automaton (deep dive into irreversibility, entropy, Landauer's principle)
- β Paper 26: CS231n (complete vision pipeline: kNN β CNN, all in NumPy)
- β Paper 11: Dilated Convolutions (multi-scale)
- β Paper 12: Graph Neural Networks (message passing)
- β Paper 14: Bahdanau Attention (original attention)
- β Paper 15: Identity Mappings ResNet (pre-activation)
- β Paper 16: Relational Reasoning (relation networks)
- β Paper 18: Relational RNNs (relational memory + Section 11: manual backprop ~1100 lines)
- β Paper 21: Deep Speech 2 (CTC loss)
- β Paper 23: MDL Principle (compression, model selection, connects to Papers 5 & 25)
- β Paper 24: Machine Super Intelligence (Universal AI, AIXI, Solomonoff induction, intelligence measures, recursive self-improvement)
- β Paper 25: Kolmogorov Complexity (randomness, algorithmic probability, theoretical foundation)
- β Paper 27: Multi-Token Prediction (2-3x sample efficiency)
- β Paper 28: Dense Passage Retrieval (dual encoders)
- β Paper 29: RAG (retrieval-augmented generation)
- β Paper 30: Lost in the Middle (long context)
Quick Reference: Implementation Complexity
Can Implement in an Afternoon
- β Character RNN
- β LSTM
- β ResNet
- β Simple VAE
- β Dilated Convolutions
Weekend Projects
- β Transformer
- β Pointer Networks
- β Graph Neural Networks
- β Relation Networks
- β Neural Turing Machine
- β CTC Loss
- β Dense Retrieval
Week-Long Deep Dives
- β Full RAG system
β οΈ Large-scale experimentsβ οΈ Hyperparameter optimization
"If you really learn all of these, you'll know 90% of what matters today." - Ilya Sutskever
Happy learning! π