GitHub - sciences44/mlx-lora-finetune: Fine-tune Qwen3.5 for Text-to-SQL on your Mac with MLX LoRA — no GPU required. 4-model comparison, prompt engineering benchmark, and honest limitations.

Your M-series Mac is a training rig. You just didn't know it yet.

uv sync → train → eval — that's it.

📑 Table of Contents

💡 The Problem
🎯 What You'll Build
🔬 The Experiment — our 4-step journey from 0% to 50%
🚀 Quickstart — up and running in 2 minutes
🎮 Experiment Ideas — where to play and what to tweak
📊 Training Benchmarks
🧬 How LoRA Works
📦 Dataset
⚖️ How We Compare — vs GPT-4o, Prem-1B-SQL, Oxen.ai
⚠️ Limitations
🛠️ Troubleshooting
📚 Resources

💡 The Problem Everyone Has

You want to fine-tune an LLM. Your options:

Option	Cost	Setup Time	Pain Level
☁️ Rent an A100 on AWS	$3.50/hr	2 hours	SSH, CUDA drivers, OOM
📓 Google Colab Pro	$10/mo	30 min	Timeouts, disk limits
🍎 Your Mac with MLX	$0	2 min	`uv sync`, done

The secret: Apple Silicon's unified memory means your 16/36/64/96 GB of RAM is your VRAM. No data transfers between CPU and GPU. No OOM surprises. The model lives in one place, and the GPU reads it directly.

🎯 What You'll Build

A fine-tuned LLM that generates SQL queries from natural language — in 600 iterations, on your Mac. We tested 4 models to find the sweet spot.

You: "What is the total population of cities in California?"
  → SELECT SUM(population) FROM cities WHERE state = 'California'

🔬 The Experiment

We asked one question: what's the cheapest way to teach a small model a new skill?

The task: given a database schema and a natural language question, generate the correct SQL query. We kept everything identical across all runs — same dataset, same hyperparameters, same 600 iterations — and only changed the model.

graph LR
    A["🤔 Base model<br/>knows SQL?"] -->|"0% SQL"| B["📝 Prompt eng<br/>fixes it?"]
    B -->|"1.5% SQL"| C["🔧 LoRA<br/>fine-tuning"]
    C -->|"86.5% SQL"| D["📏 Bigger<br/>= better?"]
    D -->|"2B wins"| E["🏆 50% semantic<br/>accuracy"]

    style A fill:#ff6b6b,color:#fff
    style B fill:#ff9f43,color:#fff
    style C fill:#1dd1a1,color:#fff
    style D fill:#54a0ff,color:#fff
    style E fill:#5f27cd,color:#fff

Step 1️⃣ — Does the base model already know SQL?

First sanity check — does Qwen3.5-0.8B (4-bit) already generate SQL without any fine-tuning?

Prompt:  CREATE TABLE employees (...); INSERT INTO employees VALUES (...);
         Q: Who earns more than 100k in Engineering?
         A:

Base model output:  "Based on the provided schema, there is no specific
                     column that indicates salary..." (wrong — salary IS in the schema)

On 200 test examples: 0% valid SQL. The model either hallucinates text answers, outputs raw numbers (42, 300), or generates infinite repetition.

Step 2️⃣ — Can prompt engineering fix it?

Before investing in fine-tuning, we tried the obvious shortcut — just add "Only answer in SQL" to the prompt:

Approach	SQL Rate	Exact Match
❌ Base model (no instruction)	0%	0%
⚠️ Base + "Only answer in SQL"	1.5%	0%
✅ LoRA fine-tuning (600 iters)	86.5%	24.5%

"Can't I just add 'answer in SQL' to the prompt?" — We tested it. The model still computes answers in its head (42, 300) instead of writing queries. Prompt engineering doesn't work at this model size. Fine-tuning teaches a deep behavioral pattern.

Step 3️⃣ — Fine-tune with LoRA

600 iterations. 15 minutes. 0.48% of parameters trained:

Metric	Before	After	Delta
Generates valid SQL	0%	86.5%	🟢 +86.5pp
Semantically correct	0%	47.0%	🟢 +47.0pp
Exact match	0%	24.5%	🟢 +24.5pp

Manual review of all 200 test outputs shows 47% are semantically correct — nearly double the 24.5% exact match, which penalizes valid rewrites like alias differences (t.name vs name), dialect variations (DATE_SUB vs DATEADD), and equivalent logic (ORDER BY DESC LIMIT 1 vs subquery).

Step 4️⃣ — Does bigger = better?

We ran the same experiment across 4 models — same data, same hyperparameters, same 600 iterations:

Model	Params	Baseline SQL	Fine-tuned SQL	Δ SQL	Semantic Accuracy	Peak RAM
Qwen3.5-0.8B	0.8B	0%	86.5%	+86.5pp	47.0%	3.9 GB
🏆 Qwen3.5-2B	1.9B	3.5%	95.0%	+91.5pp	50.0%	5.9 GB
Qwen3.5-4B*	4.2B	8.5%	71.0%	+62.5pp	40.0%	11.1 GB
Mistral-Nemo-12B	12.2B	26.5%	57.0%	+30.5pp	31.0%	8.2 GB

The 2B model wins. And the reason is counterintuitive:

🪞 The Format Compliance Paradox

graph TD
    subgraph "LoRA says: output SQL"
        direction LR
        L["LoRA adapter<br/>🔧 Format push"]
    end
    subgraph "Instruct training says: answer the question"
        direction LR
        I["Base model<br/>🎓 Instruct training"]
    end

    L --- SMALL["🟢 0.8B: LoRA wins<br/>86.5% SQL"]
    L --- MED["🏆 2B: Sweet spot<br/>95% SQL"]
    I --- BIG["🔴 12B: Instruct wins<br/>57% SQL, 43% plain answers"]

    style SMALL fill:#1dd1a1,color:#fff
    style MED fill:#5f27cd,color:#fff
    style BIG fill:#ff6b6b,color:#fff

Larger models are too smart to follow the format. The 12B Mistral can actually compute the answer in its head — so it outputs 42 instead of SELECT COUNT(*) FROM .... It answered with plain numbers 43% of the time.

When models DO write SQL, quality is surprisingly similar (~54%) across all sizes. The difference is entirely about format compliance. The 2B is the Goldilocks zone.

📄 Full analysis with error breakdowns: results/grand-comparison.md

*4B required manual vision weight stripping — Qwen3.5 is natively multimodal, and the vision weights crash MLX LoRA training.

🚀 Quickstart

Requirements: Apple Silicon Mac (M1+) with 16 GB+ RAM, Python 3.10+, uv. That's it.

1. Clone & install

git clone https://github.com/sciences44/mlx-lora-finetune.git
cd mlx-lora-finetune

# One command. No venv, no pip, no activate.
uv sync

📦 Don't have uv? Install it first

curl -LsSf https://astral.sh/uv/install.sh | sh

2. Prepare the dataset

uv run python scripts/prepare_data.py

Downloads gretelai/synthetic_text_to_sql from HuggingFace (100K examples across 100 domains), filters to basic SQL and single-join complexity, and outputs 5,000 train / 500 valid / 1,000 test examples.

📄 Data format (text JSONL)

{"text": "CREATE TABLE cities (name VARCHAR, state VARCHAR, population INT);\nQ: Total population in California?\nA: SELECT SUM(population) FROM cities WHERE state = 'California';"}

3. Fine-tune with LoRA

All parameters are configurable via environment variables:

# Try the 2B model (best results in our comparison)
MODEL=mlx-community/Qwen3.5-2B-4bit-OptiQ bash scripts/train.sh

# Low-memory mode for 16 GB Macs
BATCH_SIZE=1 NUM_LAYERS=4 bash scripts/train.sh

# Quick experiment (100 iters, ~2 min)
ITERS=100 bash scripts/train.sh

You'll see output like:

Trainable parameters: 0.479% (3.608M/752.392M)
Iter 1:   Val loss 1.390
Iter 100: Train loss 0.810, Tokens/sec 267.5, Peak mem 8.19 GB
Iter 200: Val loss 0.709
Iter 400: Val loss 0.690
Iter 600: Val loss 0.617, Train loss 0.593

💡 Memory guide: 16 GB → BATCH_SIZE=1 NUM_LAYERS=4. 32 GB+ → BATCH_SIZE=4. 64 GB+ → BATCH_SIZE=4 GRAD_CHECKPOINT="" (disable grad checkpoint for speed).

4. Evaluate

# Compare fine-tuned vs base model
uv run python scripts/evaluate.py --num-samples 200

# Base model only (no adapter needed)
uv run python scripts/evaluate.py --baseline --num-samples 200

# Evaluate a different model
uv run python scripts/evaluate.py --model mlx-community/Qwen3.5-2B-4bit-OptiQ --adapter-path adapters-2b

5. Test your fine-tuned model 🎉

uv run python -m mlx_lm generate \
  --model mlx-community/Qwen3.5-0.8B-4bit-OptiQ \
  --adapter-path adapters \
  --max-tokens 100 \
  --prompt "CREATE TABLE employees (Name VARCHAR, Department VARCHAR, Salary INT, Start_Date DATE);
Q: Who earns more than 100k in Engineering?
A: "

Try it without the adapter to see the difference:

# Same prompt, no adapter — see what the base model does
uv run python -m mlx_lm generate \
  --model mlx-community/Qwen3.5-0.8B-4bit-OptiQ \
  --max-tokens 100 \
  --prompt "CREATE TABLE employees (Name VARCHAR, Department VARCHAR, Salary INT, Start_Date DATE);
Q: Who earns more than 100k in Engineering?
A: "

6. Fuse adapters into a standalone model (optional)

uv run python -m mlx_lm fuse \
  --model mlx-community/Qwen3.5-0.8B-4bit-OptiQ \
  --adapter-path adapters \
  --save-path ./fused-model

🎮 Experiment Ideas

Done with the quickstart? Here's where it gets fun. Every experiment below takes under 15 minutes.

🔄 Try a different model

We tested 4 models — you can reproduce any of them:

Model	HuggingFace ID	RAM Needed	Our Result
Qwen3.5-0.8B	`mlx-community/Qwen3.5-0.8B-4bit-OptiQ`	4 GB	47% semantic acc.
🏆 Qwen3.5-2B	`mlx-community/Qwen3.5-2B-4bit-OptiQ`	6 GB	50% semantic acc.
Mistral-Nemo-12B	`mlx-community/Mistral-Nemo-Instruct-2407-4bit`	8 GB	31% semantic acc.

# Train the 2B model (our best performer)
MODEL=mlx-community/Qwen3.5-2B-4bit-OptiQ ADAPTER_DIR=adapters-2b bash scripts/train.sh

# Evaluate it
uv run python scripts/evaluate.py --model mlx-community/Qwen3.5-2B-4bit-OptiQ --adapter-path adapters-2b

📊 Change the dataset size

Edit scripts/prepare_data.py — line 16-18:

TRAIN_SIZE = 5000   # Try 10000 or 20000 (up to ~80K available)
VALID_SIZE = 500
TEST_SIZE = 1000

Then regenerate: uv run python scripts/prepare_data.py

⚡ Quick hyperparameter experiments

# Higher learning rate (faster learning, risk of instability)
LR=5e-5 ITERS=300 bash scripts/train.sh

# More LoRA layers (more capacity, more memory)
NUM_LAYERS=32 bash scripts/train.sh

# Speed run: 100 iters, see how far you get
ITERS=100 bash scripts/train.sh

🧪 Test the prompt engineering claim yourself

# Base model — outputs garbage
uv run python -m mlx_lm generate --model mlx-community/Qwen3.5-0.8B-4bit-OptiQ --max-tokens 100 \
  --prompt "CREATE TABLE sustainable_urbanism (id INT, city VARCHAR(50), project_type VARCHAR(50)); INSERT INTO sustainable_urbanism (id, city, project_type) VALUES (1, 'Oakland', 'Affordable Housing'), (2, 'Seattle', 'Green Spaces');
Q: How many sustainable urbanism projects are in Oakland?
A: "

# Same prompt + "Only answer in SQL" — still outputs garbage
uv run python -m mlx_lm generate --model mlx-community/Qwen3.5-0.8B-4bit-OptiQ --max-tokens 100 \
  --prompt "CREATE TABLE sustainable_urbanism (id INT, city VARCHAR(50), project_type VARCHAR(50)); INSERT INTO sustainable_urbanism (id, city, project_type) VALUES (1, 'Oakland', 'Affordable Housing'), (2, 'Seattle', 'Green Spaces');
Q: How many sustainable urbanism projects are in Oakland?
A: Note: Only answer in SQL.
"

# Fine-tuned — outputs correct SQL
uv run python -m mlx_lm generate --model mlx-community/Qwen3.5-0.8B-4bit-OptiQ --adapter-path adapters --max-tokens 100 \
  --prompt "CREATE TABLE sustainable_urbanism (id INT, city VARCHAR(50), project_type VARCHAR(50)); INSERT INTO sustainable_urbanism (id, city, project_type) VALUES (1, 'Oakland', 'Affordable Housing'), (2, 'Seattle', 'Green Spaces');
Q: How many sustainable urbanism projects are in Oakland?
A: "

📝 Use your own data

Create data/train.jsonl, data/valid.jsonl, data/test.jsonl with this format:

{"text": "CREATE TABLE orders (id INT, amount DECIMAL);\nQ: List all orders above $500\nA: SELECT * FROM orders WHERE amount > 500"}

Then train: bash scripts/train.sh. Works with any text-to-X task — code generation, translation, summarization.

📊 Training Benchmarks

Real numbers from an M1 64 GB:

Model	Size (4-bit)	Adapter	Trainable Params	Peak RAM	Tokens/sec	Time
Qwen3.5-0.8B	570 MB	14 MB	3.6M (0.48%)	3.9 GB	~475	~15 min
Qwen3.5-2B	1.1 GB	25 MB	6.8M (0.36%)	5.9 GB	~180	~8 min
Qwen3.5-4B	2.4 GB	42 MB	13.6M (0.32%)	11.1 GB	~115	~12 min
Mistral-Nemo-12B	6.9 GB	56 MB	18.0M (0.15%)	8.2 GB	~210	~8 min

📉 Loss converges by ~600 iters — we tested 1,000 iters and saw no improvement in eval metrics. 600 is the sweet spot.

🧬 How LoRA Works (30-second version)

Full fine-tuning:      Update ALL parameters  → huge memory
LoRA:                  Freeze model, train tiny adapters → 0.48% of params
QLoRA:                 Same, but model is 4-bit quantized → fits in 8 GB

┌──────────────────────────────────┐
│         Frozen LLM Weights       │  ← 4-bit quantized (QLoRA)
│         (752M parameters)        │     Untouched during training
├──────────────────────────────────┤
│  ┌──────┐          ┌──────┐     │
│  │LoRA A│ (rank r) │LoRA B│     │  ← Only these train
│  │64×4  │    ×     │4×64  │     │     ~0.48% of parameters
│  └──────┘          └──────┘     │
├──────────────────────────────────┤
│    Output = Frozen + (A × B)    │  ← Merged at inference
└──────────────────────────────────┘

The genius: instead of updating millions of weight values, you decompose the update into two tiny matrices. Same result, fraction of the compute.

📦 Dataset

gretelai/synthetic_text_to_sql — 100K synthetic text-to-SQL examples across 100 domains (healthcare, finance, education, etc.) with multiple complexity levels.

We filter to basic SQL and single join complexity — clean, learnable examples appropriate for a 0.8B parameter model:

Split	Size	Purpose
Train	5,000	LoRA fine-tuning
Valid	500	Loss monitoring during training
Test	1,000	Evaluation (200 used for benchmarks)

⚖️ How We Compare

Text-to-SQL is a well-studied benchmark. Here's how our results stack up:

Project	Model	Params	Data	Hardware	Metric	Score
🍎 This repo (0.8B)	Qwen3.5-0.8B	0.8B	5K	Mac, LoRA	Semantic acc.	47%
🍎 This repo (2B)	Qwen3.5-2B	1.9B	5K	Mac, LoRA	Semantic acc.	50%
Oxen.ai	Qwen3-0.6B	0.6B	5K	A10G, full FT	LLM-as-judge	42%
Oxen.ai	Qwen3-1.7B	1.7B	5K	A10G, full FT	LLM-as-judge	57%
Oxen.ai	GPT-4o	—	zero-shot	API	LLM-as-judge	45%
Prem-1B-SQL	DeepSeek-1.3B	1.3B	122M tok	4× A100, full FT	Exec acc. (Spider)	85%
Databricks	Llama3-8B	8B	Spider	A100, full FT	Exec acc. (Spider)	79.9%

Key takeaways:

✅ We're in the right ballpark. Our 0.8B at 47% is close to Oxen.ai's 0.6B at 42% — both use ~5K examples with similar-sized Qwen models.
🏆 Our 2B at 50% beats GPT-4o zero-shot (45%), which is a compelling result for a 15-minute local fine-tune.
📏 The gap to SOTA is real but explained. Prem-1B-SQL hits 85% on Spider, but uses full fine-tuning (not LoRA), 24× more data, 4× A100 GPUs, and execution-guided decoding. That's a different budget and scope.
📐 Metrics aren't directly comparable. We use semantic accuracy (manual review), others use execution accuracy (run SQL against a database) or LLM-as-judge. Execution accuracy is the gold standard.

The point of this repo isn't to beat SOTA. It's to show that a Mac, 15 minutes, and 5K examples gets you surprisingly far.

⚠️ Limitations & Honest Assessment

This is a learning project, not a production system. Here's what we didn't push:

🔍 What we tested

1 dataset (synthetic text-to-SQL)
1 set of hyperparameters (LR 1e-5, 600 iters, 16 LoRA layers, rank 8)
1 prompt format (text completion, no chat template)
4 models (Qwen3.5 0.8B/2B/4B, Mistral-Nemo 12B)
200 test examples per model (from 1,000 available)

🔧 What could improve results

Lever	Current	Could try	Expected impact
Training data	5K examples	20–50K (from 100K available)	🟢 High
Evaluation	String match + manual review	Execution accuracy (SQLite)	🟢 High
Prompt format	Text completion	Chat template (for 4B+ models)	🟡 Medium
LoRA layers	16 (all models)	32+ for larger models	🟡 Medium
LoRA rank	8	16 or 32	🟡 Medium
Hyperparameters	Single run (LR 1e-5)	Grid search	🟡 Medium
Training duration	600 iters (all models)	LR schedule + more iters for 4B+	🟠 Low-Medium
Fine-tuning method	LoRA (0.48% params)	Full fine-tune	🟢 High (but needs GPU)

✅ What this project does demonstrate

LoRA works on Apple Silicon — from 0% to 86.5% SQL in 15 minutes
Model size has a sweet spot — 2B beats both smaller and larger models
Prompt engineering isn't a substitute — 1.5% vs 86.5% SQL rate
Val loss is misleading — lowest val loss (4B) ranked 3rd in accuracy
Exact match underestimates quality — 24.5% exact match vs 47.0% actual correctness

🗂️ Project Structure

mlx-lora-finetune/
├── 📄 README.md
├── 📦 pyproject.toml           # uv managed dependencies
├── 📁 data/
│   ├── train.jsonl             # 5K training examples
│   ├── valid.jsonl             # 500 validation examples
│   └── test.jsonl              # 1K test examples
├── 📁 scripts/
│   ├── prepare_data.py         # Dataset download + formatting
│   ├── evaluate.py             # Exact match evaluation
│   └── train.sh                # Training wrapper script
├── 📁 adapters/                # LoRA adapter weights (after training)
└── 📁 results/
    ├── eval_results.json       # Evaluation metrics
    └── grand-comparison.md     # Full multi-model analysis

🛠️ Troubleshooting

Problem	Fix
`Insufficient Memory` (Metal OOM)	Reduce `--batch-size` to 1, add `--grad-checkpoint`
Slow training	Close Chrome and other apps (they share unified memory)
High val loss	More data, more iters, or try `--learning-rate 5e-6`
Model outputs numbers instead of SQL	Check data format — each line must be `{"text": "..."}`
`uv: command not found`	`curl -LsSf https://astral.sh/uv/install.sh \| sh`
Chat template errors	Use text format (not completions) — Qwen3.5 has a thinking template
2B+ model swapping to disk	Reduce `--batch-size` to 1 — we hit 97% RAM on M1 64 GB with batch 2

📚 Resources

Resource	Description
MLX Documentation	Apple's ML framework
MLX-LM	Production-ready LLM tools
MLX Community	Pre-converted models on HuggingFace
Qwen3.5 0.8B 4-bit	The primary model used here
gretelai/synthetic_text_to_sql	Training dataset
LoRA Paper	The original research
QLoRA Paper	Quantized adaptation
uv	Modern Python package management