Analyzing the performance of Tensorflow training on M1 Mac Mini and Nvidia V100

226 points by briggers 5 years ago · 93 comments

Reader

volta87 5 years ago

When developing ML models, you rarely train "just one".

The article mentions that they explored a not-so-large hyper-parameter space (i.e. they trained multiple models with different parameters each).

It would be interesting to know how long does the whole process takes on the M1 vs the V100.

For the small models covered in the article, I'd guess that the V100 can train them all concurrently using MPS (multi-process service: multiple processes can concurrently use the GPU).

In particular it would be interesting to know, whether the V100 trains all models in the same time that it trains one, and whether the M1 does the same, or whether the M1 takes N times more time to train N models.

This could paint a completely different picture, particularly for the user perspective. When I go for lunch, coffee, or home, I usually spawn jobs training a large number of models, such that when I get back, all these models are trained.

I only start training a small number of models at the latter phases of development, when I have already explored a large part of the model space.

---

To make the analogy, what this article is doing is something similar to benchmarking a 64 core CPU against a 1 core CPU using a single threaded benchmark. The 64 core CPU happens to be slightly beefier and faster than the 1 core CPU, but it is more expensive and consumes more power because... it has 64x more cores. So to put things in perspective, it would make sense to also show a benchmark that can use 64x cores, which is the reason somebody would buy a 64-core CPU, and see how the single-core one compares (typically 64x slower).

---

To me, the only news here is that Apple GPU cores are not very far behind NVIDIA's cores for ML training, but there is much more to a GPGPU than just the perf that you get for small models in a small number of cores. Apple would still need to (1) catch up, and (2) extremely scale up their design. They probably can do both if they set their eyes on it. Exciting times.

sdenton4 5 years ago

The low gpu utilization rate in the first graph is kind of a tell... Seems like the M1 is a little bit worse than 40% of a v100?
- volta87 5 years ago
  
  If that's the case that would be very good. One can buy lots of M1 mac minis for the price of a V100..
  - sdenton4 5 years ago
    
    Well, you can also get many RTX 3080's (~$700) for the price of a V100 (~$6000), and the RTX 3080's are faster: https://browser.geekbench.com/cuda-benchmarks
    As I understand it, the V100 price is mostly artificial datacenter markup, enabled by lack of competition...
nightcracker 5 years ago

> When developing ML models, you rarely train "just one".
Depends on your field. In Reinforcement Learning you often really do train just one, at least on the same data set (since the data set often is dynamically generated based on the behavior of the previous iteration of the model).
- volta87 5 years ago
  
  Even in reinforcement learning you can train multiple model with different data-sets concurrently and combine them for the next iteration.
lukas 5 years ago

Do you really train more than one model at the same time on a single GPU? In my experience that's pretty unusual.
I completely agree with your conclusion here.
- volta87 5 years ago
  
  Depends on model size, but if the model is small enough that I actually do training on a PCIe board, I do. I partition an A100 in 8, and train 8 models at a time, or just use MPS on a V100 board. The bigger A100 boards can fit multiple of the same models that do fit in a single V100..
  Also I tend to do this initially, when I am exploring the hyperparameter space, for which I tend to use smaller but more models.
  I find that using big models initially is just a waste of time. You want to try many things as quickly as possible.
- junipertea 5 years ago
  
  I found training multiple models on same GPU hit other bottlenecks (mainly memory capacity/bandwidth) fast. I tend to train one model per GPU and just scale the number of computers. Also, if nothing else, we tend to push the models to fit the GPU memory.
  - volta87 5 years ago
    
    Memory became less of an issue for me with V100, and isn't really an issue with A100, at least when quickly iterating for newer models, when the sizes are still relatively small.

mark_l_watson 5 years ago

I had the same experience. My M1 system does well on smaller models compared to a NVidia 1070 with 10GB of memory. My MacBook Pro only has 8GB total memory. Large models run slowly.

I found setting up Apple’s M1 fork of TensorFlow to be fairly easy, BTW.

I am writing a new book on using Swift for AI applications, motivated by the “niceness” of the Swift language and Apple’s CoreML libraries.

iluxonchik 5 years ago

do you happen to have a draft version available somewhere? i'm diving into ML with Swift soon
- mark_l_watson 5 years ago
  
  If you are interested in just the iOS/iPadOS/macOS platforms, then work through the tutorial articles on ML that Apple provides to devs.
  If you are on Linux, then Swift for TensorFlow is OK. You will save some effort by using Google Colab notebooks, that support Swift and Swift for TensorFlow.
- figomore 5 years ago
  
  I think this is the book https://leanpub.com/SwiftAI
  - mark_l_watson 5 years ago
    
    Thanks for linking that, but there is not much of the book written for now. I have mostly been working on the examples.

lopuhin 5 years ago

> I chose MobileNetV2 to make iteration faster. When I tried ResNet50 or other larger models the gap between the M1 and Nvidia grew wider.

(and that's on CIFAR-10). But why not report these results and also test on a more realistic datasets? The internet is full of M1 TF brenchmarks on CIFAR or MNIST, has anyone seen something different?

sillysaurusx 5 years ago

Hehe. That criticism could be applied to ML itself. :)
I wish ML used more than CIFNISTNet, but unfortunately there's not a lot of standard datasets yet. (Even Imagenet is an absolute pain to set up.)
- sdenton4 5 years ago
  
  Tensorflow Datasets includes a lot of the 'standard' datasets in a way that's dead simple to call up and use (including ~10 variants of imagenet): https://www.tensorflow.org/datasets/catalog/overview

tbalsam 5 years ago

This is on a model designed to run faster on CPUs. It's like dropping a bowling ball on your foot and claiming excitement that you feel bruised after a few days.

Maybe there's something interesting there, definitely, but the overhype of the title takes away any significant amount of clout I'd give to the publishers for research. If you find something interesting, say it, and stop making vapid generalizations for the sake of more clicks.

Remember, we only can feed the AI hype bubble when we do this. It might be good results, but we need to be at least realistic about it, or there won't be an economy of innovation for people to listen to in the future, because they've tuned it out with all of the crap marketing that comes/came before it.

Thanks for coming to my TED Talk!

lukas 5 years ago

I don't think MobileNetV2 is designed to train on GPUs - according to this https://azure.microsoft.com/en-us/blog/gpus-vs-cpus-for-depl... MobileNetV2 gets bigger gains from GPUs vs several CPUs than ResNet. You could argue the batch size doesn't fully use the V100 but these comparisons are tricky and this looks like fairly normal training to me.
It's pretty surprising to me that an M1 performs anywhere near a V100 on model training and I guess the most striking thing is the energy efficiency of the M1.
- tbalsam 5 years ago
  
  MV2 is memory-limited, the depthwise + groups + 1x1 convs has a long launch time on GPU. Shattered kernels are fine for CPU, but not for GPU.
  Though per your note on the scales, that's really interesting empirical results. I'll have to look into that, thanks for passing that along.

baxter001 5 years ago

No, but it's pretty good at retraining the final layer of low memory networks like MobileNet - weirdly a workload that the V100 is very poorly suited for...

enos_feedler 5 years ago

Not surprising since this is a training use case that Apple very much focuses on with CreateML.
xiphias2 5 years ago

What about the M1X that will come with 64GB RAM? I’m thinking of waiting for that to come out. Ah...I just see that the article authors are waiting for it as well

whywhywhywhy 5 years ago

>We can see better performance gains with the m1 when there are fewer weights to train likely due to the superior memory architecture of the M1.

Wasn't this whole "M1 memory" thing decided to be a myth now some more technical people have dissected it?

gameswithgo 5 years ago

I think two different memory things are being talked about:
1. There is an idea that M1 has RAM that is vastly higher bandwidth than intel/amd machines. In reality it is the same laptop ddr ram that other machines have, though at a very high clock rate. Not higher than the best intel laptops though. So the bandwidth is not any more amazing than a top end Intel laptop, and latency is no different.
2. But in this case I believe they are talking about the CPU and GPU both being able to freely access the same ram, as compared to a setup where you have a discrete GPU with it's own ram, where data must first be copied to the GPU ram for the GPU to do something with it. In some workloads this can be an inferior approach, in others it can be superior, as the GPU's ram is faster. The M1 model again isn't unique, as its similar to how game consoles work, I believe.
- dragontamer 5 years ago
  
  > In reality it is the same laptop ddr ram that other machines have
  LPDDR4 is more well known for cell phones than laptops actually. I think it shows the stagnation of the laptop market (and DDR4) that LPDDR4 is really catching up (and then some). Or maybe... because cell phones are more widespread these days, cell phones just naturally get the better tech?
  On the other hand, M1 is pretty wide. Apple clearly is tackling the memory bottleneck very strongly in its design.
  DDR5 is going to be the next major step forward for desktops/laptops.
  > 2. But in this case I believe they are talking about the CPU and GPU both being able to freely access the same ram, as compared to a setup where you have a discrete GPU with it's own ram, where data must first be copied to the GPU ram for the GPU to do something with it. In some workloads this can be an inferior approach, in others it can be superior, as the GPU's ram is faster. The M1 model again isn't unique, as its similar to how game consoles work, I believe.
  More than just the "same RAM", but probably even shares the same last-level cache. Both AMD's chips and Intel's iGPUs share the cache with its CPU/GPU hybrid architectures.
  However: it seems like on-core SIMD units (aka: AVX or ARM NEON / SVE) are even lower latency, since those share L1 cache.
  Any situation where you need low latency but SIMD, it makes more sense to use AVX / SVE than even waiting for L3 cache to talk to the iGPU. Any situation where you need massive parallelism, a dedicated 3090 is more useful.
  Its going to be tough to figure out a good use of iGPUs: they're being squeezed on the latency front (by things like A64Fx: 512-bit ARM SIMD, as well as AVX512 on the Intel side), and also squeezed by the bandwidth front (by classic GPUs)
iforgotpassword 5 years ago

Myth or not, it's memory bandwidth is amazing, so I guess that helps.
rsynnott 5 years ago

As with many things, there isn't one "M1 memory" thing. It's a combination of myth and real stuff. No, it isn't ultra-low latency or high-bandwidth. But on the other hand, single core achievable bandwidth is very high.
coldtea 5 years ago

No. Some technical people just gave their non-definitive two cents.
vletal 5 years ago

Could you please provide some resources on how the unified memory model supposedly works? Why is it a "myth"?
- tyingq 5 years ago
  
  I believe it's referring not to unified memory, but some speculation that the memory being closer to the CPU makes some notable difference. That line was in a fair amount of the initial articles about the M1, and fits the "myth" description.

jlouis 5 years ago

CPUs often outperform specialized hardware on small models. This is nothing new. You'd need to go to a larger model, and then power consumption curves change too.

procrastinatus 5 years ago

One thing I haven’t seen much mention of is getting things to run on the M1’s neural engine instead of the GPU - it seems like the neural engine has ~3x more compute capacity and is specifically optimized for this type of computation.

Has anyone spotted any work allowing a mainstream tensor library (e.g. jax, tf, pytorch) to run on the neural engine?

lldbg 5 years ago

George hotz got his "for play" tensor library[a] to run on the Apple Neural Engine (ANE). The results were somewhat dissapointing, however, and currently it only does relu.
[a]: https://github.com/geohot/tinygrad

sradman 5 years ago

I categorize this as an exploration of how to benchmark desktop/workstation NPUs [1] similar to the exploration Daniel Lemire started with SIMD. Mobile SoC NPUs are used to deploy inference models on smartphones and IoT devices while discreet NPUs like Nvidia A100/V100 target cloud clusters.

We don’t have apples-to-apples benchmarks like SPECint/SPECfp for the SoC accelerators in the M1 (GPU, NPU, etc.) so these early attempts are both facile and critical as we try to categorize and compare the trade-offs between the SoC/discreet and performance/perf-per-watt options available.

Power efficient SoC for desktops is new and we are learning as we go.

[1] https://en.m.wikipedia.org/wiki/AI_accelerator

volta87 5 years ago

> We don’t have apples-to-apples benchmarks
We do: https://mlperf.org/
Just run their benchmarks. Submitting your results there is a bit more complicated, because all results there are "verified" by independent entities.
If you feel like your AI use case is not well represented by any of the MLPerf benchmarks, open a discussion thread about it, propose a new benchmark, etc.
The set of benchmarks there increases all the time to cover new applications. For example, on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.
- solidasparagus 5 years ago
  
  Those benchmarks are absurdly tuned to the hardware. Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s. It's an interesting measurement of what experts can achieve when they modify their code to run on the hardware they understand well, but it isn't useful beyond that.
  - volta87 5 years ago
    
    > Just look at the result Google gets with BERT on V100s vs the result NVIDIA gets with V100s.
    These benchmarks measure the combination of hardware+software to solve a problem.
    Google and NVIDIA are using the same hardware, but their software implementation is different.
    ---
    The reason mlperf.org exists is to have a meaningful set of relevant practical ML problems that can be used to compare and improve hardware and software for ML.
    For any piece of hardware, you can create an ML benchmark that's irrelevant in practice, but perform much better on that hardware than the competition. That's what we used to have before mlperf.org was a thing.
    We shouldn't go back there.
- sradman 5 years ago
  
  > on top of the MLPerf Training and MLPerf Inference benchmark suites, we now have a new MLPerf HPC suite to capture ML of very large models.
  I think the challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow. One of the problems that I see is that the optimizer/codegen of the toolchain is a key component; the M1 has both GPU and Neural Engine and we don’t know which accelerator is targeted or even possibly both. Should we benchmark ML Create on M1 vs A14 or A12X? Perhaps it is my ignorance but I don’t think we are at a point where our existing benchmarks can be applied meaningfully with the M1 but I’m sure we will get there soon.
  - volta87 5 years ago
    
    > The challenge is selecting the tests that best represent the typical ML/DL use cases for the M1 and comparing it to an alternative such as the V100 using a common toolchain like Tensorflow.
    The benchmarks there are actual applications of ML, that people use to solve real world problems. To get a benchmark accepted you need to argue and convince people that the problem the benchmark solves must be solved by a lot of people, and that doing so burns enough cycles worldwide to be helpful to design ML hardware and software.
    The hardware and software then gets developed to make solving these problems fast, which then in turns make real-world applications of ML fast.
    Suggesting that the M1 is a solution, and now we just need to find a good problem that this solution solves well and add it there as a benchmark is the opposite to how mlperf works, and hardware vendors suggesting this is the reason mlperf exists. We already have common ML problems that a lot of people need to solve. Either the M1 is good at those or it isn't. If it isn't, it should become better at those. Being better at problems people don't want / need to solve does not help anybody.

0x008 5 years ago

Well, putting out a tl;dr and then a graph that does not mention FP16/FP32 performance differences or anything related to TensorRT cannot be taken seriously if we talk about performance per watt. We need to see the a comparison that includes multiple scenarios so we can determine something like a break-even point between Nvidia GPUs and Apple M1 GPU, possibly even for several SotA models.

helsinkiandrew 5 years ago

Can someone with more knowledge of Nvidia GPU's please say how much the V100 costs ($5-10K?) compared with the $900 mac mini.

fxtentacle 5 years ago

You would instead buy a used 1080 (no ti) for similar performance.
The special thing about the V100 is that it's driver EULA allows data center usage. If you don't need that, there are other much cheaper options.
- spi 5 years ago
  
  "Similar performance" still means 30%-50% slower [1] and half the RAM, not really that comparable.
  For much closer performance you should get a 2080ti, which should be roughly comparable in speed and have 11GB [edit: wrongly wrote 14GB before] of memory (against the 16GB for the V100). Price-wise you still save a lot of money, after quickly googling around, roughly $1200 vs. $15k-$20k.
  But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.
  (and more importantly for companies, you can actually use only V100-style stuff on servers [edit: as you mentioned already, although I'm not 100% sure it's just drivers that are the issue?])
  [1] I've not used 1080 myself, but I've used 1080ti and V100 extensively, and the latter is about 30% faster. Hence my estimate for comparison with 1080
  - fxtentacle 5 years ago
    
    For my workload (optical flow) I was honestly surprised to see that the Google Cloud V100 was not faster than my local GTX 1080. So I guess that varies a lot by how you're training, too.
    For many of my AI training workloads, already the 1080 is "fast enough" and the CPU or SSDs are the bottleneck. In that case, GPU doesn't really matter that much.
    
    spi 5 years ago
    
    Yes that might be the case. In my case I mostly trained big (tens to hundreds of millions of parameters) networks mostly made of 3x3 convolutions, and I think the V100 has dedicated hardware for that. Then as I mentioned you can get a further 2x speedup by using half precision.
    If you train smaller models, or RNN, you probably lose most of the gains of dedicated hardware. But I guess that for this same reason the experiments in the article are little more than a provocation, I don't know if you could train a big network in finite time on M1 chips...
    That said, of course, if the budget was mine, I wouldn't buy a V100 :-)
  - trott 5 years ago
    
    > But you still lose something, e.g. if you use half precision on V100 you get virtually double speed, if you do on a 1080 / 2080 you get... nothing because it's not supported.
    That's not true. FP16 is supported and can be fast on 2080, although some frameworks fail to see the speed-up. I filed a bug report about this a year ago: https://github.com/apache/incubator-mxnet/issues/17665
    What consumer GPUs lack is ECC and fast FP64.
  - FeepingCreature 5 years ago
    
    How does AMD stuff like Radeon VII or MI100 hold up?
    
    fxtentacle 5 years ago
    
    Can't use it because most AI frameworks won't run on AMD because they did not implement suitable back-ends (yet).
    
    breuleux 5 years ago
    
    There's one for PyTorch, I tested it about a year ago. You have to compile it from scratch and IIRC it translates/compile CUDA to ROCm at runtime which causes noticeable pauses on the first run. There may be other tweaks you have to do too. Once set up it performs decently, though.
- littlestymaar 5 years ago
  
  > The special thing about the V100 is that it's driver EULA allows data center usage.
  Wait what? Is it the only thing?
  That sounds hard to believe: if true, using the open driver (Nouveau) instead of Nvidia's proprietary one would be a massive money saver for datacenters operators (and even if Nouveau doesn't support the features you'd want already, supporting their development would be much cheaper for a company like Amazon than paying a premium on every GPU they buy)
  - rrss 5 years ago
    
    No, that's not the only thing.
    Other characteristics of V100 that may be interesting to people buying GPUs for data centers:
    - higher capacity GPU memory. 1080 has 8 GB, V100 has 16 or 32 GB.
    - higher bandwidth GPU memory. V100 has HBM2 with a peak of 900 GB/s, 1080 has G5X with a peak of ~300 GB/s.
    - ECC support.
    - data center certification + warranty
    (The geforce warranty covers normal consumer usage, like gaming, and does not cover datacenter use)
    - availability of enterprise support contracts.
    (If you are buying a ton of GPUs to put in a datacenter, you probably don't want to end up on the normal consumer support line when something goes wrong)
    - fast fp64
    There are probably others
    
    Firadeoclus 5 years ago
    
    A GTX1080 manages about ~9 TFLOPS(fp32) (and has terrible fp16 support), where V100 gets ~15 TFLOPS(fp16), ~30 TFLOPS(fp16), and ~120 TFLOPS(tensor cores).
    Apart from one being a gaming product and the other being designed for computational tasks, they're a generation apart and have various small differences that may be quite relevant for individual tasks (such as V100 allowing twice the shared memory - 96 KiB - per thread block)
    
    littlestymaar 5 years ago
    
    Thanks, that makes much more sense!
  - fsh 5 years ago
    
    Nouveau does not support CUDA and is therefore not usable for GPU computing on Nvidia.
  - YetAnotherNick 5 years ago
    
    NVIDIA has EULA to prevent data centre use of their hardware. Also, NVIDIA does not allow bulk buying of RTX series.
    
    alickz 5 years ago
    
    They barely allow single buying for the 30 series :(
    Took me quite a while to get my hands on a 3080.
    
    jklehm 5 years ago
    
    What ended up working for you?
    
    alickz 5 years ago
    
    I bought from a (relatively) small German commerce site[1] rather than a bigger site like Amazon, OCUK, or Scan. I'm in EU though, probably doesn't help if you're US. I think I paid a €50 or so premium over the retail price but I didn't mind that too much.
    I used this[2] site to keep an eye open for stock, as you can see it's pretty much empty now but I just checked every day and finally found one.
    [1] https://www.reichelt.de/ [2] https://www.gputracker.eu/en/search/category/1/graphics-card...
    
    jklehm 5 years ago
    
    Thanks for the insights...frustrating times to be searching for one.
- sillysaurusx 5 years ago
  
  Don't buy hardware in general for AI work, IMO. It'll be out of date in a year and you'll end up training in the cloud anyway.
  - dx034 5 years ago
    
    If you properly utilize your hardware, on premise (or colocation in an area with cheap electricity prices) is vastly cheaper and will likely continue to be for a while. I don't see how training models in the cloud makes financial sense for organizations that can utilize their hardware 24/7.
    For all others with burst workloads training in the cloud can make sense, but that has been the case for a while already.
    
    sillysaurusx 5 years ago
    
    We're not talking about organizations, though. I don't agree with your premise, either. People aren't training models 24/7, so the idea that it's "vastly cheaper and will continue to be for a while" isn't true.
    
    king_magic 5 years ago
    
    > People aren’t training models 24/7
    ... uh, you sure about that? Let me go check on the 3 models I have concurrently training for my organization on 3 separate GPU servers (all 2 year old hardware to boot) that have been running continuously for the past 36 hours. It pretty much works out to 24/7 training for the past several months.
    And BTW, this is massively cheaper for us than training in the cloud.
    
    qayxc 5 years ago
    
    Instead of arguing back and forth, how about a test case instead?
    Pretraining BERT takes 44 minutes on 1024 V100 GPUs [1]
    This requires dedicated instances, since shared instances won't be able to get to peak performance if only because of the "noisy neighbour"-effect.
    At GCP, a V100 costs $2.48/h [2], so Microsoft's experiment would've cost $2,539.52.
    Smaller providers offer the same GPU at just $1.375/h [3], so a reasonable lower limit would be around $1,408.
    For a single BERT pretraining, provided highly optimised workflows and distributed training scripts are already at hand, renting a GPU for single training tasks seems to be the way to go.
    The cost of V100-equivalent end-user hardware (we don't need to run in a datacentre, dedicated workstations will do), is about $6,000 (e.g. a Quadro RTX 6000), provided you don't need double precision. The card will have equal FP32 performance, lower TGP and VRAM that sits between the 16 GB and 32 GB version of the V100.
    Workstation hardware to go with such card will cost about $2,000, so $8,000 are a reasonable cost estimation. The cost of electricity varies between regions, but in the EU the average non-household price is about 0.13€/kWh [4].
    Pretraining BERT therefore costs an estimated 1024 h * 0.13€/kWh * 0.5 kW ≈ 57€ in electricity (power consumption estimated from TGP + typical power consumptions of an Intel Xeon workstation from my own measurements when training models).
    In order to get the break-even point we can use the following equation: t * $1,408 = $8,000 + t * $69, which results in t = 8,000/(1408-69) or t > 5.
    In short, if you pretrain BERT 6 times, you safe money by BUYING a workstation and running it locally over renting cloud GPUs from a reasonably cheap provider.
    This example only concerns BERT, but you can use the same reasoning for any model that you know the required compute time and VRAM requirements of.
    This only concerns training, too - inference is a whole different can of worms entirely.
    [1] https://www.deepspeed.ai/news/2020/05/27/fastest-bert-traini...
    [2] https://cloud.google.com/compute/gpus-pricing
    [3] https://www.exoscale.com/syslog/new-tesla-v100-gpu-offering/
    [4] https://ec.europa.eu/eurostat/statistics-explained/index.php...

StavrosK 5 years ago

I'm seeing a lot of M1 hype, and I suspect most of it us unwarranted. I looked at comparisons between the M1 and the latest Ryzens, and it looks like it's comparable? Does anyone know details? I only looked summarily.

ZeroCool2u 5 years ago

The main hype is that performance is similar, but the M1 does it with a lot less power draw. The performance itself isn't too crazy. It's just crazy that it does it with a somewhat similar power draw to a high end phone.

fxtentacle 5 years ago

"trainable_params 12,810"

laughs

(for comparison, GPT3: 175,000,000,000 parameters)

Can Apple's M1 help you train tiny toy examples with no real-world relevance? You bet it can!

Plus it looks like they are comparing Apples to Oranges ;) This seems to be 16 bit precision on the M1 and 32 bit on the V100. So the M1-trained model will most likely yield worse or unusable results, due to lack of precision.

And lastly, they are plainly testing against the wrong target. The V100 is great, but it is far from NVIDIA's flagship for training small low-precision models. At the FP16 that the M1 is using, the correct target would have been an RTX 3090 or the like, which has 35 TFLOPS. The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.

So they compare the M1 against an NVIDIA model from 2017 that lacks the relevant hardware acceleration and, thus, is a whopping 60% slower than what people actually use for such training workloads.

I'm sure my bicycle will also compare very favorably against a car that is lacking two wheels :p

iaml 5 years ago

GPT3 is so big it would take 355 years to train on a nvidia V100, so your example is also not really useful for comparison. It would be interesting to see some mid-sized nn benchmarks though.
coolness 5 years ago

This, not to mention one could get the GPU usage on the V100 way higher by training with larger batch sizes, which would also make training much faster.
JacobSuperslav 5 years ago

thanks for the thorough comment. the article is, unfortunately, just clickbait.
- nvarsj 5 years ago
  
  It seems like a common trend with M1 articles on HN lately.
- coldtea 5 years ago
  
  The comment is bogus empty snark (and factually wrong).
  The arguments made (and I use the word arguments loosely):
  "Too few trainable_params compared to GTP3".
  GTP3 is several orders of magnitude higher than what people train, and so it's a useless comparison. It's like we're comparing a bike to an e-bike, and someone says "yeah, but can the e-bike run faster than a rocket?"
  Second argument "Sure, it's faster than a machine that costs 3-4 fives more, but you should instead compare it to a machine that costs even more than that".
  I can only take it as a troll comment.
- joseph_grobbles 5 years ago
  
  Thorough? Their comment is noisy snark.
  A huge number of models are "small". I'm currently training game units for autonomous behaviors. The M1 is massively oversized for my need.
  Saying "Oh look, GPT-3" just stupidifies the conversation, and is classic dismissive nonsense.
apl 5 years ago

Hard disagree. V100s are a perfectly valid comparison point. They're usually what's available at scale (on AWS, in private clusters, etc.) because nobody's rolled out enough A100s at this point. If you look at any paper from OpenAI et al. (basically: not Google), you'll see performance numbers for large V100 clusters.
- fxtentacle 5 years ago
  
  Yes and you'll see parameters tuned for V100, not parameters tuned for m1 somehow limping along on a V100 in emulation mode.
  I wouldn't complain about a benchmark executing any real world SOTA model on m1 and V100, but those will most likely not even run on the M1 due to memory constraints.
  So this article is like using an ios game to evaluate a Mac pro. You can do it, but it's not really useful.
  - YetAnotherNick 5 years ago
    
    You can count the number of GPUs having more than M1 memory(16 GB) in a single hand.
    
    oblio 5 years ago
    
    Isn't the M1 GPU memory shared with everything else? Can the GPU realistically used that much? Won't the OS and base apps use up at least 2-3GB?
    
    qayxc 5 years ago
    
    The M1 can only address 8 GB with its NPU/GPU.
Firadeoclus 5 years ago

> The V100 only gets 14 TFLOPS because it lacks the dedicated TensorRT accelerator hardware.
V100 has both vec2 hfma (i.e. fp16 multiply-add is twice the rate of fp32), getting ~30 TFLOPS, and tensor cores which can achieve up to 4x that for matrix multiplications.
YetAnotherNick 5 years ago
For the first graph:
```
  trainable parameters: 2236682
```
- qayxc 5 years ago
  
  So it's a toy model...
  - Firadeoclus 5 years ago
    
    Many models of that size are in serious productive use.
jbverschoor 5 years ago

Even the RTX 3090 is double the price of an M1 for just 1 card.
The V100 is almost 5-10x the price of an M1.

SloopJon 5 years ago

The first graph includes "Apple Intel", which is not mentioned anywhere else in the post. Any idea what hardware that was, and whether it used the accelerated TensorFlow?

vanpelt 5 years ago

My bad, this was using non-Accelerated TensorFlow on a 2.3GHz 8-Core i9.

tpoacher 5 years ago

Betteridge says no.

coldtea 5 years ago

And Betteridge is wrong.

JohnHaugeland 5 years ago

"Can Apple's M1 do a good job? We cut things down to unrealstic sizes, turned off cores, and p-hacked as hard as we could until we found a way to pretend the answer was yes"

Settings

Analyzing the performance of Tensorflow training on M1 Mac Mini and Nvidia V100

Keyboard Shortcuts