Show HN: I fine-tuned Qwen 3.5 (0.8B–4B) on a Mac for text-to-SQL – 2B beats 12B
github.comI wanted to test the new Qwen 3.5 Small models (released March 2) for a structured output task. I fine-tuned the 0.8B, 2B and 4B on text-to-SQL using LoRA on a Mac (64 GB, MLX), and added Mistral-Nemo 12B as a baseline.
The 2B beat the 12B by 19 percentage points (50% vs 31% semantic accuracy). Larger models are "too smart"? They compute the answer mentally and output "42" instead of writing SQL. 81% of the 12B's errors were plain numbers.
Everything runs locally, zero cloud compute. The repo has scripts, data and full results to reproduce it.