I Trained an LLM on My MacBook Neo

It is Sunday night. March 15th. I have been writing all weekend. This is one of those weekends where the ideas just keep coming and I do not want to stop.

On Wednesday, Michael bought a MacBook Neo. That night, I played around with it and trained a model using MLX. I saw what Apple Silicon could do with machine learning on 8 gigabytes of RAM and I could not stop thinking about it. On Friday, I went to the Apple Store and bought my own. The 512GB model with Touch ID. 8GB of unified memory. An A18 Pro chip with a 6-core CPU and a 5-core GPU. $699.

And then I trained a language model on it.

I did not know that was possible. I genuinely did not think you could fine-tune an AI model on a machine with 8 gigabytes of memory. I was wrong. It took some tweaking, and I hit a few walls, but we are there. It works. And I am so excited about it.

Let me back up for a second, because I want to make sure everyone is on the same page here.

You know how models like ChatGPT and Claude can answer questions about anything? Those models have billions of parameters. Think of parameters as the knobs and dials inside the model’s brain. A 3 billion parameter model has 3 billion of those knobs. Each one was tuned during training on massive amounts of text from the internet.

You are not going to train the next GPT on your laptop. That takes thousands of GPUs and millions of dollars. That is not what this is.

Fine-tuning is different. Think of it like medicine. A general practitioner knows a little about everything. A cardiologist knows a lot about hearts. Fine-tuning takes a general model and turns it into a specialist. You are not building the brain from scratch. You are adding knowledge the model did not have before and sharpening it for a specific job.

There is a technique called LoRA that makes this even more efficient. LoRA stands for Low-Rank Adaptation. Instead of adjusting all 3 billion parameters, LoRA attaches small adapter weights to specific layers of the model and only trains those. The rest of the model stays frozen.

In my case, 0.054 percent of the parameters were trainable. That is 1.7 million parameters out of 3.2 billion. The adapter weights are controlled by a rank, which determines how much the model can learn, and an alpha, which scales the effect of the adaptation. A rank of 8 and an alpha of 16 is a common starting point.

The result is a tiny safetensors file, usually just a few megabytes, that contains everything the model learned during fine-tuning. You load it on top of the base model and the behavior changes. That is why it fits on a MacBook Neo.

MLX is Apple’s machine learning framework built specifically for Apple Silicon. It runs entirely on your local machine. No cloud account. No usage fees. No sending your data to someone else’s server.

The MacBook Neo uses Apple’s unified memory architecture. That means the CPU and GPU share the same memory pool. In traditional computers, data has to be copied between CPU memory and GPU memory, which wastes time and space. On Apple Silicon, the model just sits in one place and both the CPU and GPU access it directly. That is a big deal for machine learning on limited hardware.

Companies use fine-tuning all the time. They take a general model and train it on their internal data so it can answer questions about their products, their documentation, their codebase. Developers fine-tune coding models on their specific frameworks. Customer support teams fine-tune models on their FAQ databases. It is practical, everyday AI work. And you can do it at home.

I needed something to teach the model. I decided to build an Apple products chatbot. Seemed fitting given I just bought an Apple laptop.

I wrote 200 question-and-answer pairs about Apple products, with help from AI to generate some of them. That is called synthetic data. You use AI to create training examples, then you go through every single one and make sure it is accurate. Quality control matters. The model will learn whatever you give it, including mistakes. Garbage in, garbage out. That is not a cliche. That is the literal truth of how fine-tuning works.

The examples cover iPhones, iPads, Macs, AirPods, and Apple Watch. Real specs sourced from real product pages online. Things like “How much does the iPhone 16 Pro Max cost?” and “What is the difference between the MacBook Air and MacBook Pro?” and “Can I use an Apple Pencil with any iPad?”

Each example uses the chat format, which is perfect for building chatbots. A customer asks a question, and the assistant gives a specific, accurate answer. There are other formats too, like completions for simple input-output tasks and plain text for teaching a writing style. But for a chatbot, the conversation format is what you want. 200 of those.

That number matters. I will get to why.

Training a model can be difficult. It is not always plug and play. Sometimes it takes a few attempts to get the settings right and the data right. That is part of the craft.

Here is the thing about training locally. When a run fails, it costs nothing. No cloud bill ticking in the background. No GPU rental fees adding up while you experiment. I ran this training three times before I got it right. On a cloud platform, that would have been real money. On my MacBook Neo, it was just time.

You feed it your examples and it tries to learn patterns from them. Every few steps, you check how it performs on examples it has not seen before. That check is called validation loss. It is a number that tells you how wrong the model is. Lower is better.

How many examples do you need? It depends on what you are trying to do. 50 examples is enough for a smoke test. You can confirm your setup works and the model is learning something, but it will overfit. 200 to 500 examples is the minimum for meaningful behavior change. That is where the model starts learning the pattern, not just memorizing the specific answers. 1,000 to 5,000 examples is where you get real domain shift. The model genuinely adopts the behavior you are teaching it.

What makes a good training example? Consistency. Specificity. Cleanliness. Every example should use the same system message so the model learns a consistent identity. Answers should be specific, not vague. Say “the iPhone 16 Pro has a 48MP main camera” instead of “the iPhone has a good camera.” And keep it clean. No typos. No contradictions. Every example should be something you would want the model to actually say to a user.

My first training run used 83 examples. I ran it. I watched the numbers. For the first 200 iterations, everything looked great. The validation loss kept dropping. The model was learning.

Then it started going back up.

That is overfitting. The model stopped learning general patterns and started memorizing the specific examples I gave it. It could recite my training data perfectly but it could not handle a new question it had never seen. It is like a student who memorizes the answer key instead of understanding the material. Aces the practice test. Fails the real one.

With only 83 examples, the model ran through the entire dataset about 7 times during training. By iteration 200, it had already learned what it could. Everything after that was just memorization.

Here is what I learned and what you should do from the start. Watch the validation loss, not the training loss. The training loss will almost always go down. That does not mean the model is getting better. It might just be memorizing. The validation loss tells you how the model performs on data it has not seen. That is the number that matters.

Save checkpoints frequently using the --save-every flag. I use --save-every 50. That way, if the model starts overfitting at iteration 200, you can go back to the checkpoint at iteration 200 and use that one. You do not lose your best work.

Check validation loss often using --steps-per-eval. The default checks every 200 steps. That is too infrequent for small datasets. Set it to 50 so you can see exactly when the model stops improving.

Keep training under 3 passes through your data. With 83 examples and a batch size of 1, each pass is 83 iterations. Three passes is about 250 iterations. I ran 600. That was too many.

I had to stop, diagnose, and start over.

I am including this part because I think it is important to show what actually happens. Not just the clean version. The clean version is not honest.

I went back, sourced more examples from Apple’s product pages, and brought the dataset up to 200.

Then I ran it again. Here is the exact command.

mlx_lm.lora \ --model mlx-community/Llama-3.2-3B-Instruct-4bit \ --train \ --data ./data \ --iters 600 \ --batch-size 1 \ --num-layers 4 \ --grad-checkpoint \ --learning-rate 1e-5 \ --save-every 50 \ --steps-per-eval 50 \ --adapter-path ./apple-chat-adapters-v2

This time, the validation loss kept going down. All the way to the end. No overfitting. The model kept improving throughout the entire 600 iterations. It was still getting better when training finished.

If you can not get more data, you can reduce the number of iterations. Iterations are training steps. Each iteration, the model processes one batch of examples and updates its weights. Fewer iterations means fewer chances to memorize. If my first run had stopped at 200 iterations instead of 600, the overfitting would have been much less severe.

There is another trick called LoRA dropout. Dropout randomly turns off some of the adapter weights during each training step, which forces the model to generalize instead of memorize. You set it to something small like 0.05 or 0.1. You configure it in a YAML file that you pass to the training command with the --config flag.

A config.yaml file lets you set the LoRA rank, the alpha scaling factor, and the dropout rate all in one place. It keeps your training runs organized and repeatable. For a small dataset, a rank of 8, an alpha of 16, and a dropout of 0.1 is a reasonable starting point.

That is the difference more data makes. Same model. Same settings. Same MacBook Neo. Just more examples to learn from.

Here is the part that still surprises me.

Peak memory usage during the entire training run was 2.328 GB. Out of 8. The MacBook Neo did not choke. It did not overheat. It did not even breathe hard.

To put that in perspective, Safari with a dozen tabs open uses more memory than my entire training run did.

I did use some low-memory settings to get there. I processed one example at a time instead of batches. I applied LoRA adapters to only 4 of the model’s layers instead of 16. And I turned on gradient checkpointing, which is a trick that recalculates certain data during training instead of storing it all in memory. You trade a little speed for a lot of memory savings.

Training ran at about half an iteration per second. A full run of 600 iterations took about 20 minutes. It is not fast compared to a cloud GPU. But it is free. And it runs on a machine that fits in a backpack.

For comparison, renting a GPU on a cloud platform costs anywhere from $1 to $5 per hour. My training run costs zero dollars because it runs on hardware I already own. Do that a hundred times and the savings add up.

After training, the model can answer questions about Apple products. What storage options come with a specific iPhone. What chip is in a particular MacBook. How long an AirPods battery lasts. Which iPad works with which Apple Pencil.

It learned all of this from the 200 examples I wrote. That is the point of fine-tuning. You take a general-purpose language model and you teach it something specific to your needs.

The base model was Llama 3.2 3B. A 3 billion parameter model that Meta released as open source. Not the largest model out there. But it is one that runs on my MacBook Neo. And after fine-tuning, it knows things the base model did not know.

Anyone with a Mac can do this. An 8GB MacBook Neo is the entry-level machine. It is the one students buy. It is the one people get as their first Mac. And it can train a language model.

That was not true two years ago. MLX and Apple Silicon and LoRA made it true now.

I am writing a book about all of this. Everything from setting up your environment to understanding how models work to building your own dataset to running training and evaluating what you get. All of it on local hardware. All of it on a Mac.

This post covers one piece of it. The piece where I find out whether training a model on a MacBook Neo is actually possible.

It is.

I will share more as the book comes together. If you are working on something similar, or if you are thinking about starting, I would love to hear about it.

Thank you for reading.

I Trained an LLM on My MacBook Neo

Discussion about this post

Ready for more?