d34 model (~$2,500) · karpathy nanochat · Discussion #314

3 min read Original article ↗

I trained and uploaded the d34 model to huggingface: nanochat d34.

It was pretrained like this:

torchrun --standalone --nproc_per_node=8 -m scripts.base_train -- --depth=34 --device_batch_size=4 --target_param_data_ratio=40 --save_every=5000 --run=d34

On an 8XH100 node, which ran for ~100 hours (~4 days) and cost ~$2,500. Notably, note that it is 2X longtrained compared to Chinchilla, i.e. the param:token ratio is overridden from 20 up to 40. This means the model was trained longer that is compute optimal, just to squeeze a bit more capability into a bit smaller package.

Some of the notable stats of the model are as follows:

- depth: 34
- max_seq_len: 2048
- target_param_data_ratio: 40
- device_batch_size: 4
- total_batch_size: 524,288
- Number of parameters: 2,217,082,880
- Number of FLOPs per token: 1.426509e+10
- Calculated number of iterations: 169,150
- Number of training tokens: 88,683,315,200
- Tokens : Params ratio: 40.0000
- DDP world size: 8
- Minimum validation bpb: 0.7045
- Final validation bpb: 0.7045
- MFU %: 47.40%
- Total training flops: 1.265075e+21
- Peak memory usage: 69811.85MiB

The achieved CORE score of this base model is 0.3382. (For comparison, d32 is 0.3168, d20 "speedrun" is 0.22, GPT-2 is 0.25).

The wandb logs look like this (d32 = brown, d34 = blue, 5% resumed training = green at the very end)

image

Notice that:

  • d34 training was fairly stable except for a tiny bump that recovered quickly near the beginning.
  • the training crashed for the last 5% with some cryptic NCCL error. So I used the new optimization resumption logic to resume training from that point for the last 5%, which worked perfect.

Midtraining and SFT then look like this:

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|
| CORE            | 0.3382   | -        | -        | -        |
| ARC-Challenge   | -        | 0.5333   | 0.5435   | -        |
| ARC-Easy        | -        | 0.7071   | 0.7365   | -        |
| GSM8K           | -        | 0.1137   | 0.1296   | -        |
| HumanEval       | -        | 0.1159   | 0.0488   | -        |
| MMLU            | -        | 0.4297   | 0.4311   | -        |
| ChatCORE        | -        | 0.4087   | 0.4100   | -        |

For comparison, d32 ($1000 run) looked like this:

| Metric          | BASE     | MID      | SFT      | RL       |
|-----------------|----------|----------|----------|----------|
| CORE            | 0.3168   | -        | -        | -        |
| ARC-Challenge   | -        | 0.4787   | 0.4991   | -        |
| ARC-Easy        | -        | 0.6233   | 0.6797   | -        |
| GSM8K           | -        | 0.1099   | 0.1274   | 0.1994   |
| HumanEval       | -        | 0.1098   | 0.1280   | -        |
| MMLU            | -        | 0.3896   | 0.4049   | -        |
| ChatCORE        | -        | 0.2417   | 0.2734   | -        |

So our ChatCORE lifted quite a lot, from 0.27 -> 0.41. You'll notice that HumanEval stubbornly plunged during SFT, which I attribute mostly to noise. I expect one should be able to tune things a bit to recover good performance. I expected a bit more of a lift on GSM8K too - midtraining shows good lift, but SFT botches it a bit again. Possibly not tuned super well.

The d34 model is now deployed live on nanochat.karpathy.ai

image