Run Deepseek from fast NVMe drives
github.comTesting extreme NVME offload (4 x Gen5x4) for DeepSeek R1Because PCI-E 5x16 (~60GB/s) is close to dual channel DDR5 bandwidth, this is the cheapest method to run huge models. Code: https://github.com/BlinkDL/fast.c
Do you have any benchmark run yet? I am interested in knowing how many tokens/sec you can get to. Though in the end it should be more efficient to run the model on distributed server clusters.
Deepseek's open source inference code, while correct, may not be fully efficient. For example the MLA needs the right associative matrix multiplication order to be efficient.