Press enter or click to view image in full size
AI has a hype problem. Every few months, a new model drops, and people start asking if it’s the next big thing. OpenAI, Anthropic, Google, and now DeepSeek — all racing to one-up each other with bigger benchmarks and flashier announcements.
DeepSeek is the latest name in the game, and don’t get me wrong — it’s impressive. But let’s be clear: DeepSeek isn’t revolutionary — it’s optimized.
And that distinction matters.
DeepSeek’s Not a Breakthrough — It’s a Remix
DeepSeek didn’t come out of nowhere with novel AI research or groundbreaking architectures. It stands on the shoulders of Meta’s Llama models and heavily borrows from Meta’s optimization strategies for training large-scale AI.
Here’s what DeepSeek did differently:
Mixture of Experts (MoE) Architecture
- DeepSeek’s flagship model, DeepSeek-V3, has a massive 671 billion parameters, but it only activates 37 billion per token.
- This selective activation allows it to scale efficiently while keeping costs low.
- It’s a smart trick, but MoE itself isn’t new — it’s been around for years (Meta’s Llama 3, Google’s Switch Transformer, and even OpenAI’s research have explored this).
Pipeline Parallelism Done Right
Training massive AI models across thousands of GPUs is a challenge because GPUs spend more time waiting on each other than actually computing (this is called a pipeline bubble).
- DeepSeek tackled this with a custom DualPipe algorithm, ensuring GPUs work in parallel without wasting cycles on waiting.
Memory Efficiency Hacks
- DeepSeek trained on NVIDIA H800 GPUs — a step down from the H100s used by OpenAI and Google.
- These GPUs only have 80GB of memory, which is a bottleneck for large models.
- Instead of brute force, DeepSeek optimized memory with low-precision FP8 training, smart recomputing of activations, and offloading specific calculations to CPU memory.
InfiniBand & NVLink Optimization
At scale, communication between GPUs is just as important as raw computation.
- DeepSeek developed custom all-to-all communication kernels to squeeze out every last bit of bandwidth from NVLink and InfiniBand, reducing delays in training.
Training on a Budget
DeepSeek reportedly trained its massive models for just $5.6 million, compared to the $100M+ budgets of OpenAI and Google.
This cost efficiency is real — but it’s a product of smart optimizations, not magic.
The Real Takeaway: AI Is Getting Smaller, Not Bigger
DeepSeek’s biggest win isn’t its model size — it’s proof that efficiency beats brute force. The AI world used to be obsessed with bigger is better, but we’re seeing a shift:
- Smaller, optimized models are competing with the giants.
- Smarter architectures (like MoE) are making models more cost-effective.
- The real battle isn’t just size — it’s performance per dollar.
Enter Microsoft’s Phi 4: The Next Step in AI Efficiency
If DeepSeek proves we don’t need trillion-parameter giants, then Microsoft’s Phi 4 takes it to the next level.
- Phi 4 is 5x cheaper to run than DeepSeek. DeepSeek is currently $0.69/M output tokens vs Microsoft Phi 4 at $0.14/M output tokens on OpenRouter at the time of this writing.
- It generates answers just as fast (or faster).
- It’s tiny in comparison but still holds its own against larger models.
Here’s the kicker: Phi 4 does all of this while being even smaller and cheaper. If DeepSeek’s goal was to prove that lean models can compete, then Phi 4 takes that logic to its natural conclusion.
The Future of AI: Lean, Not Large
We’re entering a new phase of AI development:
MoE and Parallelism Will Define the Next Generation
- DeepSeek and Phi 4 prove you don’t need trillion-parameter models.
- MoE is becoming the dominant way to make models powerful yet cost-efficient.
Compute Optimization Is More Important Than Model Size
- Companies that optimize their compute spend will dominate.
- Training massive models without waste is the real innovation.
The AI Arms Race Is About Cost, Not Just Intelligence
- Everyone talks about AI capabilities, but real-world AI is about who can afford to run it at scale.
- DeepSeek is cheaper than OpenAI, but Phi 4 is cheaper than both — and just as good.
DeepSeek vs Phi 4
Press enter or click to view image in full size
DeepSeek’s Accidental Lesson
DeepSeek didn’t mean to prove that smaller models are the future — but that’s exactly what it did.
Yes, DeepSeek is impressive. But it’s not a revolution. It’s part of a larger trend where AI is becoming more efficient, not just more powerful.
And if you’re still chasing the biggest models, you’re missing the point.
The real winners will be the ones who do the most with the least.
And right now, Phi 4 is leading that game.