Karatsuba Matrix Multiplication and Its Efficient Hardware Implementations
arxiv.orgRandom anecdote: 15 years ago I was starting an online community for computer science students, and needed to come up with a name. I made a survey for the community members to vote on options, with some boring ones like "CS Hub" and stuff; one of the options was "Karatsuba": I had just learned about the Karatsuba multiplication algorithm, and somehow got this idea into my head that his last name sounds cool and unique, and my online community (later grown into a ed-tech startup) should be named after this Russian mathematician.
Anatoly Karatsuba himself was already dead (he died in 2008). I emailed his daughter Yekaterina (who is also a mathematician, btw) asking for a permission to use their last name. She agreed, but asked to be extra careful about potential implicit affiliations, i.e. to be clear that the content has nothing to do with her father's research.
She also expressed an opinion that in the field of mathematics and computation, at least in Russia, actual researchers are rarely involved in writing textbooks, and the textbooks used in universities often contain conflicting or even wrong information about the authorship of research.
In the end a different name was chosen for the project.
Would this work have the potential to speed up encoding/decoding of the PAR2 format[0]? This format is widely used to protect datasets against bitrot and small losses, but is held back because of the significant compute overhead when dealing with large datasets.
Unless I am badly misunderstanding the paper, this is about doing matrix multiplication on matrices whose entries are large integers. So far as I can tell, PAR2 doesn't do any large-number arithmetic at all, so I don't think this work will have any relevance for implementations of PAR2.
[EDITED to add:] Reading the paper a bit more carefully, the above isn't exactly right. It is about doing matrix multiplication on matrices whose entries are "large" but they're interested in using the idea for hardware design where "large" may mean not-very-large, building (say) 16-bit multiplies out of 8-bit multipliers.
But for software implementations (and I doubt anyone will be making special-case hardware for PAR2) small multiplications are not generally faster than large ones. Maybe you might be able to do more of them in parallel using SIMD, but I bet the extra additions and subtractions and general overhead would outweigh any gains in that setting.
They're proposing "new hardware architectures" to take advantage of this idea. Anybody with a background in GPU floating point math comment on how realistic this is?
First author here. The hardware architectures are realistic - we developed & evaluated real example hardware implementations for them, validated on FPGA, and they achieved state-of-the-art ResNet performance in a deep learning accelerator system implementation compared to prior accelerators evaluated on similar FPGAs. See the associated accelerator system source code here:
https://github.com/trevorpogue/algebraic-nnhw
The hardware architectures focused on in the paper are systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs. It may be possible to extend the proposed KMM algorithm to other types of hardware architectures also in future work. Regarding floating point - this work is applicable for integer matrix multiplication acceleration, it may be possible to extend the concept to floating point data types in future work also.
> systolic array designs, an efficient type of hardware design for matrix multiplication (e.g., the Google TPU uses this), as opposed to more SIMD-like vector architectures like GPUs
this is wrong. TPUv4 has tensor cores just like NVIDIA has tensor cores just like AMD has tensor cores. no one uses a systolic array because bandwidth/connectivity is much scarcer than compute. the only people that keep talking about them are academics that don't actually fab/sell chips.
https://cloud.google.com/tpu/docs/v4
https://www.nvidia.com/en-us/data-center/tensor-cores/
https://rocm.docs.amd.com/projects/rocWMMA/en/latest/what-is...
ninja edit: before you gotcha me with "a tensor core is a systolic array!!!" - most tensor cores are actually outerproduct engines not riffle shuffle engines (or whatever you wanna call the topology corresponding to a systolic array).
https://cloud.google.com/tpu/docs/system-architecture-tpu-vm...
>The primary task for TPUs is matrix processing, which is a combination of multiply and accumulate operations. TPUs contain thousands of multiply-accumulators that are directly connected to each other to form a large physical matrix. This is called a systolic array architecture. Cloud TPU v3, contain two systolic arrays of 128 x 128 ALUs, on a single processor.
I don't see any contradiction between your claim that TPU v3 uses systolic arrays and the parent post's claim that TPU v4 does not.
The TPU obviously uses a systolic array: https://jax-ml.github.io/scaling-book/tpus/
Fair enough - my understanding was they moved away from systolic arrays. I stand corrected. I will also say it is well-known they're basically impossible to program/build a compiler for.
This is why Google has 500 people working on the TPU compiler team.
I have used Karatsuba's & Winograd's Inner product [0] algorithm in my work for wide multi-simd integer multipliers and matrix multiplication HW for DSPs. The latter cuts down the MACs by half - n^3/2 instead of n^3. I think the paper talks about it's derivative - FFIP.
The issue is memory bandwidth. These techniques indeed help you save multiplier area however the performance is still bandwidth limited - you'd need to be able to feed more data per cycle to increase performance.
One thing the paper doesn't talk about is energy. For DNN, at the network level the energy consumed by integer macs is not that high. Localizing data computation would have a much more impact on energy reduction than optimizing MACs.
On an FPGA integer adders are much more abundant than integer multipliers. So this algorithm definitely helps get more utilization out of the FPGA. Once the multiplier is small enough, say 3 bits by 3 bits, it can fit into several LUT6's.
The paper is about integer multiplication, not float
Why would that matter? I understood the point was to speed up matrix multiplication by doing the adds and multiplies in a different order. Shouldn't matter whether the datatype is int, float, complex, whatever.
One can use this techniques to optimize the multiplier inside the FP FMA unit. However this cannot be used to multiply two floating point numbers as FP arithmetic is not associative.
I'm pretty sure a lot of people doing matrix multiplications are ok with non-associative floating point arithmetic?
The basic idea of reducing 4 multiplications to 3 multiplications
(ax + b)(cx + d) = acx^2 + [(a + b)(c + d) - ac - bd]x + bd
holds pretty generally; there isn't any new math or algorithm here that I can see. Their own complexity analysis (eqns. 7 and 8) shows this performs about the same as using Karatsuba multiplication on the entries of the matrices (instead of on the matrices themselves).
> this performs about the same as using Karatsuba multiplication on the entries of the matrices (instead of on the matrices themselves).
If it offers improvements in both, why wouldn't one do it in both?
the govy uses specialized hardware that isn't sold on the market right? would something like this be useful in developing said hardware>?
FPGAs are widely available. They are not the cheapest hardware, but they allow you to have something similar to custom silicon, but quickly and in quantities starting from 1.
Modern CPU architecture for example the Versal architecture from AMD has seamless integration with compute accelerators namely FPGA, GPU, DSP, etc [1].
They even has built-in ADC/DAC for communicatiobmn, sensing and data acquisition (DAQ) [2].
On top of that they have native support for dataflow library and framework including linear algebra that is more suitable for data-intensive in which AI is only one of the intended applications [3].
I was recently asked by my colleague why we still need CPU, since GPU is very dominant now and it seems it's all that we need. I just pointed out GPU is only one of the many accelerators available to the CPU, but a very good one that happened to be very useful and biased towards fancy killer applications namely graphic (game), blockchain (bitcoin) and AI (LLM/ChatGPT).
[1] AMD Adaptive SoC Platform: Versal Architecture [pdf]:
https://www.amd.com/content/dam/amd/en/documents/products/ad...
[2] AMD adds RF-sampling data converters to Versal adaptive SoCs (2024):
https://news.ycombinator.com/item?id=42899304
[3] AIEBLAS: Open-Source Expandable BLAS Implementation for AMD/Xilinx Versal Device: