ML By Hand
We are creating a deep learning library from scratch (that evolved from a simple autograd engine). It is designed to demystify the inner workings of building deep learning models by exposing every mathematical detail and stripping down the abstractions shiny ML libraries (e.g. PyTorch/TensorFlow) have. This project tries to provide an opportunity to learn deep learning from first-principles. And use the hand-built library to create and train state-of-art models (such as GPT-2)).
“What I cannot create, I do not understand.” — Richard Feynman
Key Principles
- Learn By Doing: All formulas and calculations are derived in code, so you see exactly how gradients (or derivatives) are computed—no hidden black boxes!
- Learning Over Optimization: Focus on understanding the underlying mathematics and algorithms, rather than optimizing for speed or memory usage (though we can still train GPT models on a single CPU)
- PyTorch-Like API: API interface closely mirrors PyTorch for low adoption overhead
- Minimal Dependencies: Only uses
numpy(andpytorchfor gradient correctness checks in unit tests).
Why build a deep learning library from scratch?
This project initially took inspiration from Micrograd, which was trying to build an Autograd (Wikipedia) engine from scratch for educational purposes. An autograd engine computes exact derivatives by tracking computations and applying the chain rule systematically. It enables neural networks to learn from errors and adjust parameters automatically. That's the core of deep learning. Then I started to add more features since everything seemed very straightforward after I had the initial building blocks (i.e. Tensor-level operations) implemented.
The primary motivation is to learn about neural networks from scratch and from first principles. There are many good ML libraries out there (e.g. Tensorflow, PyTorch, Scikit-learn, etc.) that are well-optimized and have a lot of features. But they often introduce lots of abstractions, which hide the underlying concepts and make it difficult to understand how they work. I believe, to better utilize those abstractions/libraries, we must first understand how everything works from the ground up. This is the guiding principle for this project. All mathematical and calculus operations are explicitly derived in the code without abstraction. Also, debugging a neural network, especially the backward() implementations of various functions (e.g. loss, and activation), offers a rewarding learning experience.
The goal is to keep the API interface as close as possible to PyTorch to reduce extra onboarding overhead and utilize it to validate correctness.
Demo/Examples
Explore the examples/ directory for real-world demonstrations of how this engine can power neural network training on various tasks:
📌 Transformers & GPT (Newly added):
Click to see all other examples
Toy Example
Click to expand
from autograd.tensor import Tensor from autograd.nn import Linear, Module from autograd.optim import SGD import numpy as np class SimpleNN(Module): def __init__(self, input_dim, output_dim): super().__init__() # A single linear layer (input_dim -> output_dim). # Mathematically: fc(x) = xW^T + b # where W is weight and b is bias. self.fc = Linear(input_dim, output_dim) def forward(self, x): # Simply compute xW^T + b without any additional activation. return self.fc(x) # Create a sample input tensor x with shape (1, 3). # 'requires_grad=True' means we want to track gradients for x. x = Tensor([[-1.0, 0.0, 2.0]], requires_grad=True) # We want the output to get close to 1.0 over time. y_true = 1.0 # Initialize the simple neural network. # This layer has a weight matrix W of shape (3, 1) and a bias of shape (1,). model = SimpleNN(input_dim=3, output_dim=1) # Use SGD with a learning rate of 0.03 optimizer = SGD(model.parameters, lr=0.03) for epoch in range(20): # Reset (zero out) all accumulated gradients before each update. optimizer.zero_grad() # --- Forward pass --- # prediction = xW^T + b y_pred = model(x) print(f"Epoch {epoch}: {y_pred}") # Define a simple mean squared error function loss = ((y_pred - y_true) ** 2).mean() # --- Backward pass --- # Ultimately we need to compute the gradient of the loss with respect to the weights # Specifically, if Loss = (pred - 1)^2, then: # dL/d(pred) = 2 * (pred - 1) # d(pred)/dW = d(xW^T + b) / dW = x^T # By chain rule, dL/dW = dL/d(pred) * d(pred)/dW = [2 * (pred - 1)] * x^T loss.backward() # --- Update weights --- optimizer.step() # See the computed gradients for the linear layer’s weight matrix: weights = model.fc.parameters["weight"].data bias = model.fc.parameters["bias"].data gradient = model.fc.parameters["weight"].grad print("[After Training] Gradients for fc weights:", gradient) print("[After Training] layer weights:", weights) print("[After Training] layer bias:", bias) assert np.isclose(x.data @ weights + bias, y_true)
Documentation
Check out the modules in this project in the docs website built from the docs/ directory.
Environment Setup
Run the bootstrap script to install dependencies:
./bootstrap.sh
source .venv/bin/activateThis sets up your virtual environment.
Tests
Comprehensive unit tests and integration tests available in test/autograd
Future Work
- Expanding the autograd engine to power cutting-edge neural architectures
- Further performance tuning while maintaining clarity and educational value
- Interactive tutorials for newcomers to ML and advanced topics alike
Contributing
Contributions are welcome! If you find bugs, want to request features, or add examples, feel free to open an issue or submit a pull request.
License
MIT
