I trained and uploaded the d34 model to huggingface: nanochat d34.
It was pretrained like this:
torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=34 --device_batch_size=4 --target_param_data_ratio=40 --save_every=5000 --run=d34
On an 8XH100 node, which ran for ~100 hours (~4 days) and cost ~$2,500. Notably, note that it is 2X longtrained compared to Chinchilla, i.e. the param:token ratio is overridden from 20 up to 40. This means the model was trained longer that is compute optimal, just to squeeze a bit more capability into a bit smaller package.
Some of the notable stats of the model are as follows:
- depth: 34
- max_seq_len: 2048
- target_param_data_ratio: 40
- device_batch_size: 4
- total_batch_size: 524,288
- Number of parameters: 2,217,082,880
- Number of FLOPs per token: 1.426509e+10
- Calculated number of iterations: 169,150
- Number of training tokens: 88,683,315,200
- Tokens : Params ratio: 40.0000
- DDP world size: 8
- Minimum validation bpb: 0.7045
- Final validation bpb: 0.7045
- MFU %: 47.40%
- Total training flops: 1.265075e+21
- Peak memory usage: 69811.85MiB
The achieved CORE score of this base model is 0.3382. (For comparison, d32 is 0.3168, d20 "speedrun" is 0.22, GPT-2 is 0.25).
The wandb logs look like this (d32 = brown, d34 = blue, 5% resumed training = green at the very end)
Notice that:
- d34 training was fairly stable except for a tiny bump that recovered quickly near the beginning.
- the training crashed for the last 5% with some cryptic NCCL error. So I used the new optimization resumption logic to resume training from that point for the last 5%, which worked perfect.
Midtraining and SFT then look like this:
| Metric | BASE | MID | SFT | RL |
|-----------------|----------|----------|----------|----------|
| CORE | 0.3382 | - | - | - |
| ARC-Challenge | - | 0.5333 | 0.5435 | - |
| ARC-Easy | - | 0.7071 | 0.7365 | - |
| GSM8K | - | 0.1137 | 0.1296 | - |
| HumanEval | - | 0.1159 | 0.0488 | - |
| MMLU | - | 0.4297 | 0.4311 | - |
| ChatCORE | - | 0.4087 | 0.4100 | - |
For comparison, d32 ($1000 run) looked like this:
| Metric | BASE | MID | SFT | RL |
|-----------------|----------|----------|----------|----------|
| CORE | 0.3168 | - | - | - |
| ARC-Challenge | - | 0.4787 | 0.4991 | - |
| ARC-Easy | - | 0.6233 | 0.6797 | - |
| GSM8K | - | 0.1099 | 0.1274 | 0.1994 |
| HumanEval | - | 0.1098 | 0.1280 | - |
| MMLU | - | 0.3896 | 0.4049 | - |
| ChatCORE | - | 0.2417 | 0.2734 | - |
So our ChatCORE lifted quite a lot, from 0.27 -> 0.41. You'll notice that HumanEval stubbornly plunged during SFT, which I attribute mostly to noise. I expect one should be able to tune things a bit to recover good performance. I expected a bit more of a lift on GSM8K too - midtraining shows good lift, but SFT botches it a bit again. Possibly not tuned super well.
The d34 model is now deployed live on nanochat.karpathy.ai