Settings

Theme

Meta MTIA v2 – Meta Training and Inference Accelerator

ai.meta.com

189 points by _yo2u 2 years ago · 62 comments

Reader

jsheard 2 years ago

I like the interactive 3D widget showing off the chip. Yep, that sure is a metal rectangle.

  • whilenot-dev 2 years ago

    Really annoys me that the loading animation of these before-/after-images doesn't finish on firefox and that it won't let me drag the knob with the separator. ...no "Under the hood" for me.

  • TulliusCicero 2 years ago

    Exactly what I was thinking. Like showing off a model of a blank DVD.

  • huevosabio 2 years ago

    jajajaj I thought the same! I thought maybe someone with hardware experience can make a sense of this?

modeless 2 years ago

Intel Gaudi 3 has more interconnect bandwidth than this has memory bandwidth. By a lot. I guess they can't be fairly compared without knowing the TCO for each. I know in the past Google's TPU per-chip specs lagged Nvidia but the much lower TCO made them a slam dunk for Google's inference workloads. But this seems pretty far behind the state of the art. No FP8 either.

  • leetharris 2 years ago

    They are different architectures optimized for different things.

    From the Meta post: "This chip’s architecture is fundamentally focused on providing the right balance of compute, memory bandwidth, and memory capacity for serving ranking and recommendation models."

    Optimizing for ranking/recommendation models is very different from general purpose training/inference.

    • janalsncm 2 years ago

      Translation: you don’t need to serve 96 layer transformers for ranking and recommendation. You’re probably using a neural net with around 10-20 million parameters. But it needs to be fast and highly parallelizable, and perhaps perform well in lower precisions like f16. And it would be great to have a very large vector LUT on the same chip.

      • samspenc 2 years ago

        Is there a better way to compare performance across these high-end chips? The only comparable numbers I was able to find were the TFLOPS.

        Meta seems to be reported these numbers for this v2 chip:

            708 TFLOPS/s (INT8) (sparsity)
            354 TFLOPS/s (INT8)
        
        And I see Nvidia reporting these numbers for its latest Blackwell chips https://www.anandtech.com/show/21310/nvidia-blackwell-archit...

            4500 T(FL)OPS INT8/FP8 Tensor 
        
        Am I understanding correctly that Nvidia's upcoming Blackwell chips are 5-10x faster than this one Meta just announced?
        • ipsum2 2 years ago

          To a rough approximation, yes. The blackwell chip is also ~10x larger in surface area than MTIA, so the costs are proportional.

    • modeless 2 years ago

      Yeah, it may fit their current workload perfectly, but it doesn't seem very future proof with the limited bandwidth. Given how fast ML is evolving these days I question if it makes sense to design and deploy a chip like this. I guess they do have a very large workload that will benefit immediately.

      • tony_cannistra 2 years ago

        Don't mean to single you out at all, but I find this comment to be a great example of how the "ML Hype" is perceived by a certain segment folks in our industry.

        The development of this chip shows that it doesn't (and shouldn't!) matter to the ML teams at Meta how 'fast ML is evolving.'

        Indeed what it demonstrates is that a huge, global, trillion-dollar business has operationalized an existing ML technology to the extent that they can invest into, and deploy, customized hardware for solving a business problem.

        How ML "evolves" is irrelevant. They have a system which solves their problem, and they're investing in it.

        • airstrike 2 years ago

          Not to mention the capabilities they developed by actually creating this and what they'll be able to do next thanks to this experience.

          You've gotta learn to walk before you can run

        • janalsncm 2 years ago

          In their defense, it’s because the article is (understandably) sparse on details about what makes the requirements of their ranking models different from image classification or LLMs. Unless you work in industry it’s unlikely you will have heard of DeepFM or ESMM or whatever Meta is using.

          And building out specialized hardware does lock you in to a certain extent. Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that.

          • sangnoir 2 years ago

            > Want to use more than 128GB of memory? Too bad, your $10B chip doesn’t support that.

            Which is probably why Meta is also buying the biggest Nvidia datacenter cards by the shipload. There is no need to run inference for a small model - say for a text-ad recommendation system - on an H100 with attendant electricity and cooling costs.

            • namibj 2 years ago

              Also, like, FP tensor cores are way more expensive than fixed-point tensor cores, and with some care, it's very much practical to even train DNNs on them.

              E.g. it's common to have a full-width accumulator and e.g. s16 gradients with u8 activations and s8 weights, with the FMA (MAC) chain of the tensor multiply operation post-scaled with a learned u32 factor plus follow-up "learned" notify, which effectively acts as a fixed-point factor with learned position of it's point, to re-scale the outcome to the u8 activation range.

              By having the gradients by sufficiently wider, it's practical to use a straight-through estimator for backpropagation. I read a paper (kinda two, actually) a few months ago that dealt with this (IIRC one of them was more about the hardware/ASIC aspects of fixed-point tensor cores, the other more about model training experiments with existing low precision integer-MAC chips IIRC particularly for interference in mind). If requested, I can probably find it by digging through my system(s); I would have already linked it/them if the cursory search hadn't failed.

        • prpl 2 years ago

          To me, it’s bizarre to see the HPC mindset taking hold again after the cloud/commodity mindset dominated the last 16 years.

          You don’t always need a Ferrari to go to the store

          • thorncorona 2 years ago

            WDYM by HPC mindset?

            • rfoo 2 years ago

              "The only meaningful benchmark in the world is LAPACK and only larger than ever monolithic problem instances matter, I don't know what you're talking about, 'embarrassingly parallel'? What a silly word! Serving web requests concurrently? Good for you, congratulations, but can you do parallel programming?"

              Sorry if this make anyone feels bad. It certainly made myself uncomfortable typing it out though.

              • prpl 2 years ago

                Roughly this. Part of it is performance fetish. Part of it is one architecture for every purpose. I can’t tell you how many times I’ve seen people run embarrassingly parallel jobs coordinated by MPI on a Cray - because somebody spent all that money on that machine. Don’t forget about Bell prize outages.

      • Aurornis 2 years ago

        > Yeah, it may fit their current workload perfectly, but it doesn't seem very future proof

        It’s custom silicon designed for a specific, known workload. It’s not designed to be a general purpose part or to be future proofed for unknown future applications.

        When a new application comes along with new requirements, the teams will use their experience to create a new chip targeting that new application.

        That’s the great part about custom silicon: You’re not hitting general specs for general applications that you may not even know about yet. You’re building one very specific thing to do a very specific job and do it very well.

        • noiseinvacuum 2 years ago

          Right and they have a LOT of GPUs from Nvidia for handle the unknown. Custom silicon for custom workloads seems like a good strategy specially considering the capabilities that the team will develop along the way.

          • giantrobot 2 years ago

            Offloading a known workload to a custom chip can also save a lot on operations costs, particularly power. Facebook is interested in workload operations per watt rather than raw floating point operations per watt. A GPU might have better raw specs but if the whole GPU package has worse workload ops per watt, a custom chip is likely better.

            At Facebook's scale the spherical cow raw performance stats don't matter nearly as much as real world workloads per ops dollar. They can also repurpose their GPUs to other workloads and let their custom chips handle the boring baseline stuff.

  • chabons 2 years ago

    > Intel Gaudi 3 has more interconnect bandwidth than this has memory bandwidth.

    LPDDR5 vs HBMe2. I'm guessing there's a 2-5x price difference between those, but even so it's an interesting choice, I don't know any other accelerators which spec DDR. But yeah, without exact TCO numbers it's hard to compare exactly.

    • namibj 2 years ago

      Bandwidth is far more power hungry for DDR, but capacity is far cheaper.

      If the bandwidth capability of DDR suffices, HBM isn't worth it.

      At least with LPDDR's; GDDRs may well not be worth it under data center TCO considerations due to the high interface power usage. Feel free to correct me if I'm mistaken, the numbers in question aren't too easy to search for so I didn't confirm this (LPDDR vs. GDDR) part.

  • chessgecko 2 years ago

    Also its at 90 watts vs 900 watts for gaudi 3, the flops/mem bw per watt is much more comparable.

    • modeless 2 years ago

      With high end chips like that it's often possible to get dramatically better efficiency by running it at less than peak power consumption, like 90% performance at 50% power or something like that. It's hard to compare the numbers in a fair way.

    • moffkalast 2 years ago

      It would be interesting if this could be made into a reasonably priced (lmao) card for home inference if they intend to mass produce it.

      Can't imagine any other reason other than cost as to why they went with LPDDR5, LPDDR5X has more bandwidth and GDDR6 has even more.

      • chessgecko 2 years ago

        they didn't use GDDR cause they wanted the memory capacity which is really important for recommendation models. But I totally agree that this is a sort of perfect cost/perf per watt point for a home setup. I really hope they do it, if not for this one at least for v3.

  • cma 2 years ago

    Only 48MB of SRAM on Gaudi 3 per die (96 MB across both) vs 256MB here maybe increases the memory bandwidth needs for Gaudi. Way different power consumption too.

mlsu 2 years ago

Certainly an interesting looking chip. It looks like it's for recommendation workloads. Are those workloads very specific, or is there a possibility to run more general inference (image, language, etc) on this accelerator?

And, they mention a compiler in PyTorch, is that open sourced? I really liked the Google Coral chips -- they are perfect little chips for running image recognition and bounding box tasks. But since the compiler is closed source it's impossible to extend them for anything else beyond what Google had in mind for them when they came out in 2018, and they are completely tied to Tensorflow, with a very risky software support story going forward (it's a google product after all).

Is it the same story for this chip?

chessgecko 2 years ago

I thought MTIA v2 would use the mx formats https://arxiv.org/pdf/2302.08007.pdf, guess they were too far along in the process to get it in this time.

Still this looks like it would make for an amazing prosumer home ai setup. Could probably fit 12 accelerators on a wall outlet with change for a cpu, would have enough memory to serve a 2T model at 4bit and reasonable dense performance for small training runs and image stuff. Potentially not costing too much to make either without having to pay for cowos or hbm.

I'd definitely buy one if they ever decided to sell it and could keep the price under like $800/accelerator.

  • buildbot 2 years ago

    I suppose it might, there are not a lot of details (what kind of sparsity for example?) about what they mean in terms of INT8 support - it could be MXINT8, or something else.

    Glad someone was thinking the same thing I was though!

    • chessgecko 2 years ago

      its gotta be that 2/4 sparsity that everyone has, but I haven't seen used anywhere right? If they put it in though they must be using it, but I'm not sure for what. And without details I think its a good bet that int8 is the standard int8.

      Wishful thinking maybe they'll announce selling it with the giant llama3 cause there's no good, cheap way to inference something like that at home at the moment and this could change that.

teaearlgraycold 2 years ago

Still seems pretty primitive. Very cool though.

I can only imagine the lack of fear Jensen experiences when reading this.

  • airstrike 2 years ago

    It would be foolish to underestimate the long term capabilities of a sufficiently funded and driven competitor

  • moffkalast 2 years ago

    adjusts black leather jacket "Look at what they need to mimic a fraction of our power."

prng2021 2 years ago

3x performance but >3x TDP. Am I missing something or is that unimpressive?

jrgd 2 years ago

I find it weird that not everyone agree Meta and Facebook and social networks in general are doing some good the the society and our democracies; yet they manage to spend incredible amount of money/energy/time to develop solutions to problems we aren't exactly sure are worth solving…

  • pptr 2 years ago

    What is worth solving in your opinion? Should they not make their service more efficient?

    I assume this helps reduce their server and electricity costs. At a certain scale these things pay off.

  • ixaxaar 2 years ago

    If all this turns out to be useless, burning their cash for nothing seems like a great way to accelerate tech while going down. I guess that would actually be a positive thing.

duchenne 2 years ago

Is it possible to buy it?

ein0p 2 years ago

Come on, Zuck, undermine Google Cloud and take NVIDIA down a few pegs by offering this for purchase in good quantities.

sroussey 2 years ago

Pretty large increase in performance over v1, particularly in sparse workloads.

Low power 25W

Could use higher bandwidth memory if their workloads were more than recommendation engines.

throwaway48476 2 years ago

It's interesting that they are not separating training and inference.

  • noiseinvacuum 2 years ago

    This is specifically designed for inference for recommendations models. It’s not for LLM training or inference.

xnx 2 years ago

My mind still boggles that a BBS+ads company would think it needs to design its own chips.

bevekspldnw 2 years ago

Pretty fascinating they mention applications for ad serving but not Metaverse.

I feel like Zuck figured out he’s just running an ads network, the world is a long way anway from some VR fever dream, and to focus on milking each DAU for as many clicks as possible.

  • ec109685 2 years ago

    It’s not a gpu, and these chips aren’t able to generate images fast enough at inference time to be usable in VR context.

  • photonbeam 2 years ago

    Hes always known what pays the bills

    • bevekspldnw 2 years ago

      I dunno, he burned a lot of cash on metaverse and wasn’t focused on FB. All the top talent was moved over to Metaverse and FB was treated as career killer. My impression is ads work is once again a good career play. People chase promo.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection