mezark
- Karma
- 27
- Created
- 3 years ago
Recent Submissions
- 1. ▲ Moe inference optimizations: 15% lower expert load by request reordering (blog.doubleword.ai)
- 2. ▲ Tensor Network Attention (mainlymatmul.com)
- 3. ▲ Redundant Information in LLM Weights (fergusfinn.com)
- 4. ▲ Tans: Precomputing RANS (fergusfinn.com)
- 5. ▲ Also-RANS: Asymmetric Numeral Systems for Entropy Coding (fergusfinn.com)
- 6. ▲ 70x faster cold(ish) starts for SGLang (fergusfinn.com)
- 7. ▲ QueueSpec – drafting speculation tokens while a request queues (blog.doubleword.ai)
- 8. ▲ ZeroDP: Just-in-Time Weight Offloading over NVLink for Data Parallelism (mainlymatmul.com)
- 9. ▲ Parallel Primitives for Multi-Agent Workflows (fergusfinn.com)
- 10. ▲ New fastest AI Model Gateway – 450x less overhead than LiteLLM (github.com)