Unweight: We compressed an LLM 22% without sacrificing quality

blog.cloudflare.com

5 points by subset 2 months ago · 2 comments

Reader

ttd 2 months ago

I love these optimization tales. Memory throughput bottlenecks (extremely common, perhaps moreso than they seem) are my favorite to tackle - there are frequently some juicy optimizations that can apply there.

Do model weights have any spatial locality that can be exploited? If so, there are some more general pre-compression techniques that might be interesting to try, e.g. bitshuffle is one I've worked with (https://github.com/kiyo-masui/bitshuffle).

Another fun fact: in some scenarios (depends a lot on CPU and memory characteristics), gzip+memcpy+gunzip can be faster end-to-end than just memcpy. I forget where I first heard this but my familiarity comes from the blosc compression library.

Settings

Unweight: We compressed an LLM 22% without sacrificing quality

Keyboard Shortcuts