GLM-4.7-Flash: 30B MoE model achieves 59.2% on SWE-bench, runs on 24GB GPUs

1 points by czmilo 6 months ago · 1 comment

Reader

czmiloOP 6 months ago

Z.AI's GLM-4.7-Flash is a 30B parameter MoE model with only 3B active parameters per token, achieving 59.2% on SWE-bench Verified (vs 22% for Qwen3-30B, 34% for GPT-OSS-20B). Runs efficiently on consumer hardware: 24GB GPUs (RTX 3090/4090) or Mac M-series chips at 60-80+ tokens/second with 4-bit quantization.

The guide covers architecture details (MoE design, MLA attention for 200K context), benchmark comparisons, local deployment via vLLM/MLX/Ollama, API pricing ($0.07/$0.40 per 1M tokens), and real-world user feedback. Community reports highlight strong performance in UI generation and tool calling, though reasoning lags behind specialized models.

Open weights available on Hugging Face. Free API tier or completely offline deployment.

Settings

GLM-4.7-Flash: 30B MoE model achieves 59.2% on SWE-bench, runs on 24GB GPUs

Keyboard Shortcuts