Next Grok model training with 10T parameter model

3 points by ramshanker 3 months ago · 4 comments

Reader

I guess we are reaching the point where “10T parsmeters” sounds more like a marketing number than a meaningful metric.

Between moE, aggressive quantization, and synthetic data pipelines, it’s getting harder to tell whether bigger models are actually better, or just more expensive to train.

Would be more interesting to see -> capability per dollar or per watt, not parameter count...

bfeynman 3 months ago

Isn't what the leading labs are currently chasing after is not pretraining and massive parameters but enriched and deep fine tuning and post training for agentic tasks/coding? MoE with just new post training paradigms lets smaller models perform quite well, and much more pragmatic to scale inference with. Given that, this choice seems super odd, as the frontier labs seem to stay neck and neck, and I don't even see Grok being used in any benchmarks because of how poorly it performs

ramshankerOP 3 months ago

This is the best publically posted model size, ever since top AI labs started treating model size as a trade secret. This should also guide next generation of inference ASICs.

carolien 3 months ago

Sounds more them marketing number. Carolien eutrucking

Settings

Next Grok model training with 10T parameter model

Keyboard Shortcuts