Archives
Categories
Blogroll
As part of my "extra credit" projects after finishing the main body of Sebastian Raschka's book "Build a Large Language Model (from Scratch)", I've trained seven base models completely from scratch based on the book's GPT-2 code -- three locally, and four in the cloud. I plan to train more as I work on ways to improve the quality of the trained models, in the hope that I can get to something closer to the original OpenAI weights' loss on my own hardware, or at least on something I can rent without breaking the bank.
It makes sense to share these models somewhere, both so that other people can take a look if they like, and also to build the knowledge of how to do it so that if I produce something more interesting in the future, I'll know how to share that too.
Raschka's code is all released under the Apache v2 open source license, so I can share my stuff under the same license without worrying about triggering any legal issues. So: I've put all of the models I've trained so far on Hugging Face under that license, and made them reasonably HF-native (I'll explain what I mean by that later).
From the post where I trained the models locally, we have:
gpjt/1xrtx3090m24-fineweb-- the first model in that post, trained on a roughly Chinchilla-optimal number of tokens (20x the number of parameters) from FineWeb.gpjt/1xrtx3090m24-fineweb-edu-- the second model, trained on the same number of tokens from FineWeb-Edu.gpjt/1xrtx3090m24-fineweb-edu-2x-- the third one, which is thegpjt/1xrtx3090m24-fineweb-edumodel trained further on another roughly Chinchilla-optimal number of tokens from the same dataset.
Then, from the post where I trained on a bunch of different kinds of machines on Lambda Labs, four models (with two checkpoints from one of them):
gpjt/8xa100m40-- trained on a 8x A100, 40 GiB/GPU machine.gpjt/8xb200m160-- trained on a 8x B200, 160 GiB/GPU machine.gpjt/8xh100m80-best-- trained on a 8x H100, 80 GiB/GPU machine. The best validation loss for this train was not in the last iteration, so this is the checkpoint with the best loss.gpjt/8xh100m80-latest-- this one is the final checkpoint from the one above.gpjt/8xa100m80-- trained on a 8x A100, 80 GiB/GPU machine.
You can see how they compare on my evals at the bottom of this post.
I wanted to make them all usable within the Hugging Face ecosystem -- that is, I didn't want to just dump a bunch of weights and code into repos there, but rather to have something that someone coming to them without much context could make sense of. Let's dig into that.
Here's the code I've been using as a smoke test after training a model to make sure it's not complete garbage. There's quite a lot of it.
import json
import math
from pathlib import Path
import click
import tiktoken
import torch
from safetensors.torch import load_file
from gpt import GPTModel
@click.command()
@click.argument("model_config_path")
@click.argument("model_safetensors_path")
def main(model_config_path, model_safetensors_path):
if not Path(model_config_path).is_file():
raise Exception(f"Could not find model config at {model_config_path}")
with open(model_config_path, "r") as f:
model_config = json.load(f)
if not Path(model_safetensors_path).is_file():
raise Exception(f"Could not find model safetensors at {model_safetensors_path}")
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = GPTModel(model_config)
model.load_state_dict(load_file(model_safetensors_path))
model.to(device)
model.eval()
tokenizer = tiktoken.get_encoding("gpt2")
input_text = "Every effort moves you"
tokens = tokenizer.encode(input_text)
num_tokens = 20
temperature = 1.4
top_k = 25
with torch.no_grad():
for ix in range(num_tokens):
input_tensor = torch.tensor(
tokens, dtype=torch.long, device=device
).unsqueeze(0)
output_tensor = model(input_tensor)
logits = output_tensor[:, -1, :]
top_logits, _ = torch.topk(logits, top_k)
min_val = top_logits[:, -1]
logits = torch.where(
logits < min_val,
torch.tensor(-math.inf).to(logits.device),
logits
)
logits /= temperature
probs = torch.softmax(logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1).item()
tokens.append(next_token)
print(tokenizer.decode(tokens))
if __name__ == "__main__":
main()
That's a lot of faffing about to generate a continuation of Every effort moves you!
Disregarding the boilerplate with the argument parsing and validating, we have to load up the model,
load up the tokeniser, encode our prompt, and then do a bunch of rather arcane stuff 1 to sample from the
model to generate some tokens before we finally print out the result.
With the HF Transformers library, there are extra levels of abstraction that allow you to do things much more simply:
from transformers import pipeline
pipe = pipeline(task="text-generation", model="gpjt/some-model-name", trust_remote_code=True)
out = pipe(
"Every effort moves you",
max_new_tokens=20,
do_sample=True,
temperature=1.4,
top_k=25,
)
print(out[0]["generated_text"])
...and I wanted what I published to work with that -- and, indeed to be trainable further using the associated training library, like I did during my fine-tuning experiments.
I managed to get that all to work, but it was quite a lot more effort than I expected. But at the end, both the pipeline code above, and the training code that you can see in this notebook worked fine.
I'll write a follow-up blog post shortly about how to write the code to make a vanilla PyTorch model work within the Hugging Face ecosystem (probably not as part of this LLM from scratch series, as it's a bit of a tangent). But in the meantime, if you're using HF and want to take a look, have fun :-) I've put all of the models in a collection.
Update: here's the follow-up on how to upload custom models to Hugging Face.