Ask HN: Running production server on M1 mini?

23 points by groundthrower 4 years ago · 64 comments · 1 min read

We have a performance intensive application running on a AMD Epyc dedicated instance with 32 cores (our application is highly parallelizable)

We just noticed locally on our dev environment that our M1 is actually performing better performance wise (don’t ask me how).

We are now considering switching our production servers to M1 minis which is also offered by our cloud provider. Do you have any experience on running M1s / Mac in an production environment regarding stability / uptime etc?

Edit: it’s a Rust application which uses the Rayon crate. The application gets on average one request a minute which crunches some numbers for an average of 2 seconds - so it’s mostly idle. No disk IO.

al2o3cr 4 years ago

    don’t ask me how

_You_ should be asking you how - there are lots of reasons why this could be happening and knowing which one is important if you're changing stuff.

Based on a "highly parallelizable" application performing better on 8 cores than 32, I'd guess you're running out of something else: memory or disk bandwidth.

gwbas1c 4 years ago

Probably the hardest thing to clean up is a codebase where very complicated "optimizations" were built because someone didn't understand some very basic bottlenecks.
I recently inherited an app that makes heavy use of Redis caching because someone didn't first try to optimizing SQL. The complexity that Redis caching adds is insane to maintain compared to spending a few minutes optimizing SQL.
The original poster really needs to hook up a profiler.
Also: having written lots of parallel code: Parallelization isn't a magic way to make things faster. If the codebase is breaking up tasks into lots of tiny tasks that run in parallel, there might be more overhead in parallelization than needed. Sometimes the fastest (performance and implementation) way to parallelize is to keep most of the codebase serial, but only parallelize at the highest level and never share data among operations.
- tmnstr85 4 years ago
  
  The old... anything but reviewing the execution plan approach... throw more vCPU's at it! Thank god for query store.
gjsman-1000 4 years ago

If his application is running better on 8 instead of 32, that reeks to me of a dependency on single-core performance somewhere. An example of this would be Minecraft, which performs worse on heavily-multi core systems compared to a few fast cores (like M1).
- alpaca128 4 years ago
  
  Also Dwarf Fortress, which runs tons of simulations but is a single-threaded 32bit application, which makes multithreaded performance and RAM beyond ~2GB meaningless.
Matthias247 4 years ago

+1. They should start profiling their application. If its running on alpine linux e.g. the default memory allocator is extremely bad and would degrade performance - but it could also be tons of other things. Taking random actions without understanding what the current bottleneck is will never be great long term.
groundthrowerOP 4 years ago

It does not consume much memory but do lots of allocations/deallocations. No disc operations whatsoever.
- dragontamer 4 years ago
  
  M1 has a larger L1 cache, but smaller L3 cache.
  It could very well be that your application is hitting a memory pattern that favors larger L1 cache, while the huge L3 cache of EPYC is not useful.
  ------
  If you really wanted to know, you should learn how to use hardware performance counters and check out the instructions-per-clock. If you're around 1 or 2 instructions per clock tick, then you're CPU-bound.
  If you're less than that, like 0.1 instructions per clock (ie: 10 clocks per instruction), then you're Cache and/or RAM-bound.
  -----
  From there, you continue your exploration. You count up L1 cache hits, L2 cache hits, L3 cache hits and cache-misses. IIRC, there are some performance counters that even get into the inter-thread communications (but I forget which ones off the top of my head). Assuming you were cache/ram bound of course (if you were CPU-bound, then check your execution unit utilization instead).
  EPYC unfortunately doesn't have very accurate default performance counters, and I'd bet that no one really knows how to use M1 performance counters yet either.
  While the default PMC counters of AMD/EPYC are inaccurate (but easy to understand), AMD has a second set of hard-to-understand, but very accurate profiling counters called IBS Profiling: https://www.codeproject.com/Articles/1264851/IBS-Profiling-w...
  Still, having that information ought to give you a better idea of "why" your code performs the way it does. You may have to activate IBS-profiling inside of your BIOS before these IBS-profiling tools work.
  By default, AMD only has the default performance counters available. So you may have a bit of a struggle juggling the BIOS + profiler to get things working just right, and then you'll absolutely struggle at understanding what the hell you're even looking at once all the data is in.
  - reacharavindh 4 years ago
    
    This.
    I have dabbled with the AMD & Intel Xeon side of this, but never on MacOS. Do you have an idea how one would go about getting performance counters on MacOS? IPC, L1hit/miss, L2 hitless etc.
    
    dragontamer 4 years ago
    
    Unfortunately not. I only have experience on the AMD-side as I played around on my own personal computer.
  - groundthrowerOP 4 years ago
    
    Thanks, appreciated!
- gjsman-1000 4 years ago
  
  I’d suggest investigating single core performance. If you have the money, buy an i9-12900K (slightly faster single-core than M1 but much hotter) and do some testing on that. If my theory is correct, performance will be even better.
  - groundthrowerOP 4 years ago
    
    We have examined that as well, last week we tried a AMD 5950X which has half the amount of cores but much better single core performance - the result was still at 60% of the Epyc performance
    
    gjsman-1000 4 years ago
    
    What was the M1 % relative to your Epyc?
    
    groundthrowerOP 4 years ago
    
    Roughly 10% faster
    
    CobaltFire 4 years ago
    
    Have you investigated memory constraints?
    Ryzen is 2 channels; Epyc is 4-8 (depending on CPU). M1 has that stupidly fast/wide setup.
    If your Epyc is one of the 4 channel optimized SKUs or is only running in 4 channel mode, you would get pretty close to the quoted ratios on a memory bandwidth test.
    Correlation, not causation, but worth looking into.
    
    fomine3 4 years ago
    
    Also check Node per Socket (NPS) settings on EPYC
    
    gjsman-1000 4 years ago
    
    HN makes us wait for replies… so if we need to continue this further I’m open at muse.theses-0z@icloud.com .
    My next question would be if you ran the 12900K in dual-channel memory.
- Zagitta 4 years ago
  
  As others have noted this sounds like a contention issue that you should fix by not allocating in your hot path if at all possible. The easiest fix would probably be to try to switch out your global allocator for something like https://github.com/gnzlbg/jemallocator and see if that doesn't give you a nice performance boost.
  - groundthrowerOP 4 years ago
    
    Hmm, yes we are already using jemallocator actually
- userbinator 4 years ago
  
  It sounds like you might be running into some sort of contention.

hvgk 4 years ago

They are perfectly stable machines for running batch jobs. I have had one running a bunch of build automation as a jenkins slave for about 9 months now. Never skipped a beat. It just works and the thing is damn fast.

If it’s doing it offline it’s probably cheaper to buy one and chuck it in your office than borrow one from a cloud provider. The ass end ones are really really really cheap. Much cheaper than just the CPU in an equivalent server machine. If they blow up, just mill down to the apple store and buy another one.

Disclaimers of course: (1) it doesn’t have ECC RAM (2) it doesn’t have redundant power. We ignore (1) and solve (2) by running a prometheus node exporter on it and seeing if it disappears.

Someone1234 4 years ago

No currently offered M1 Mini has redundant fail-over power or storage. Also, without knowing how your cloud provider has cooling setup it is unclear how well it will operate under heavy load for extended periods of time (blade servers are designed for that specific workload and have cooling solutions to match).

My point is: If your workload is time critical, and you cannot afford downtime/outages then it may not be for you. If your workload can afford the time it would take to adopt a new M1 Mini when the old one dies, then maybe?

jagger27 4 years ago

> No currently offered M1 Mini has redundant fail-over power or storage.
It's kind of funny, but an M1 MacBook does. In fact it comes with a solid >12 hour UPS built-in.
- jb1991 4 years ago
  
  I assume you are just talking about the battery, which I suppose is sorta a built-in UPS though I've never heard it referred to as such!
  - jtbayly 4 years ago
    
    I’ve explicitly considered laptop batteries as UPS’s before in designing certain systems.
- Someone1234 4 years ago
  
  So does a data center. Neither one has a redundant PSU.
  - jagger27 4 years ago
    
    A MacBook can actually have multiple power supplies plugged in at once and will use the more powerful one. I bet having two of the same wattage would work fine. It also works with the new MacBook Pros with MagSafe.
    In fact, if you plug the type-C end of MagSafe cable into the MacBook, it will "charge" itself. USB-PD is pretty great.
    It's too bad the Mac mini can't be powered over USB-C though.
bpicolo 4 years ago

Does that include the AWS launched M1 instances last week?
- Someone1234 4 years ago
  
  I don't know, Amazon's press releases don't talk very much about how the offering works under the hood.
groundthrowerOP 4 years ago

Well, it waits for calculations which take about 2 seconds to complete on average - the vast majority of the time it’s idle
- gjsman-1000 4 years ago
  
  Could this maybe, someday, be simplified or implemented to a AWS Lambda function?
  - groundthrowerOP 4 years ago
    
    We have already tried that and use Lambdas as last point of backup if the other servers aren’t available- however the performance is about 1/50 compared to our current production server

joshdev 4 years ago

Have you considered looking at Amazon for their ARM offering (Graviton)? I'd be hesistant to use M1 minis for a production workflow as they are not really production grade (lacking ECC memory, not sure how long they are rated to run at high CPU, lack of user replaceable disks, no RAID, etc...).

svacko 4 years ago

How do you actually compare performance/bechmark the app - are you testing/benchmarking both prod and dev directly on the box itself? I'm thinking, there might be other infrastructure shielding the production like load balancers, proxies and other involved (observability/security tooling running and slowing the prod server) compared to accessing the dev on M1 directly..

crankyadmin 4 years ago

Knowing what the development language is as well would help a lot - but the first thing you want to do is get some instrumentation on both your 32 Core AMD box and your M1 and compare the two.

The M1 is very fast at doing certain things and your application may just be making good use of the M1 instruction set... both without knowing a bit more its difficult to tell.

krageon 4 years ago

If you do not understand why your performance profile is as it is, how do you know next week's patch won't make it perform better on AMD machines suddenly? You should understand your problem before you solve it.

DrBenCarson 4 years ago

I don’t think any amount of historical or present-state analysis will shed light on next week’s patch.
That being said, it would prepare one to better analyze next week’s patch,
- krageon 4 years ago
  
  If you know how your application performs and why, you are well equipped for estimating potential impact of patches and hardware. Obviously you still need to profile. In any case complete ignorance should not be the accepted approach.

poulsbohemian 4 years ago

>our M1 is actually performing better performance wise

I did performance analysis work for a long span of my career. While I'm reading between the lines of what you wrote, my first question is - what do you mean by performing better? As in, is it somehow able to process more of these tasks over a given timeframe? If so, I'd want to understand more about the workloads you are running to make sure it's a proper comparison.

There's a whole lot more questions we need to answer here to understand the results you are seeing before we can have any kind of discussion of whether M1s would be "better."

marban 4 years ago

I have one sitting on my desk that generates videos 24/7 and hasn't been down in a year.

https://imgur.com/a/VAxpGCL

nobbis 4 years ago

We use MacStadium's M1 mini servers for Metascan's photogrammetry batch processing. They've only been running a few months, but no downtime yet and I'm impressed with MacStadium's customer support, responsiveness, and price.

toast0 4 years ago

Unless it's changed recently, OS X has essentially no protection from synfloods. The TCP stack predates FreeBSD's syncache, and it was never ported. It doesn't have syncookies either. The pf port's synproxy stuff doesn't seem to work either.

You've got to put some sort of firewall or something in front, don't let it accept tcp connections directly. You might be OK, but not great if you just set the listen queue really short; at least that should prevent the machine from falling over when it's flooded, but without syncookies, chances are you won't be able to make new connections either.

DarthNebo 4 years ago

Feels like your provisioned disk or IOPS could be the missing factor instead of core counts.

groundthrowerOP 4 years ago

We do not do any disk operations at all
- DarthNebo 4 years ago
  
  Ohkk, just see how the network stats on both setups are. How are you testing the remote env? Is the traffic from local env or same cloud env?

gjsman-1000 4 years ago

No - but I can give a few suggestions.

One would be to look, if you haven’t, at MacStadium and what they’ve got there. You can get an M1 Mini there and it will be run by experts who know all about using M1 minis for servers. Considering your application is highly parallelizable, this would also make it easy to upgrade to the M1 Pro with double the performance cores down the line.

Secondly, if your application is running better on M1, that reeks of an application which is somehow greatly benefiting at single-threaded performance somewhere, which the M1 excels and the Epyc is poor at. That probably needs some investigation.

errcorrectcode 4 years ago

Terrible idea. I supported a dozen Xserves back in the day. They were crap because they weren't designed for production use. They used nonswappable, commodity retail IDE drives not meant for 100% duty cycle operation. Fixed power supplies. Real enterprise servers were cheaper.

Mac minis don't have redundant power or ECC. You might as well run a bunch of RPis or PICs. Get yourself some real enterprise servers or rent some via a VPS.

Disclaimer: I use a Mac mini as my living room HTPC. I wouldn't run anything real on it. That's what I have a 96 thread EPYC virtualized box for.

caeril 4 years ago

No personal experience other than a slightly different experience running production services (involving money!) on another box without ECC DRAM (to save money!) and experiencing random permission flags flips and actual balance/amount flips. Only a small handful over many years, but it does happen, and when it matters, it REALLY matters.

My advice is to always use ECC DRAM in production unless you're serving cat photos, porn, social media posts, or other societally useless applications. For anything that actually matters, please use ECC.

groundthrowerOP 4 years ago

Yes this is one concern. Are you sure it was a result of using non ECC mem and how did you find out it was because of that?
- caeril 4 years ago
  
  We could never be absolutely sure, due to the true Heisenbug nature of the behavior, but after tons of code audits and the observation after reverse proxy traffic analysis that it only occurred on processing by the non-ECC hosts, and never on the ECC hosts, that it was the most likely culprit.
  The fact that the errors were single bit errors also strongly pointed in that direction.

skw-hn 4 years ago

Scaleway is also providing M1 mac minis. The price is around 0.10€/hr which is quite cheaper.

As per the stability, my scaleway m1 has never had any issues. works just fine for some CI.

jagger27 4 years ago

I'd be curious to know if your application scales even further onto an M1 Pro/Max. If that's the case, then something about Apple silicon makes your application scream.

throwaway4good 4 years ago

Mac OS X will require updates from time to time. Otherwise they will run 24/7 with no problem. You can consider building a hybrid setup where you leave the stuff that requires no / little downtime at your cloud provider.

maksimpiriyev 4 years ago

I was thinking the same, as switching to M1 server,and also next version of M1 Mac minis will probably be x2 faster than the current one, so next year you buy mac mini will be double benefit :)

tyingq 4 years ago

Have you tried using taskset or similar to force the production application onto fewer cores? Perhaps something about thread/ipc/locking overhead?

usefulcat 4 years ago

This is a good thing to check. However, do be aware that a lot of apps check the number of physical processors in the system rather than the CPU affinity mask for the process, even though the latter is almost always what they ought to be using.

BizarroLand 4 years ago

"Dedicated instance" makes me think it's a cloud-based system. Are you actually getting what you're paying for?

p_papageorgiou 4 years ago

Any more details about the workload type of your application? Single threaded / Multithreaded / AI etc

Andys 4 years ago

There are now 3 generations of AMD Epyc.

The latest is noticeably quicker than the older ones, and competitive with the M1.

awinter-py 4 years ago

can you describe the program? just broad strokes about language, framework, what kind of traffic it's receiving?

groundthrowerOP 4 years ago

It’s written in Rust and uses Rayon to a big extent. It’s receiving data to crunch maybe once every 5 minutes
- awinter-py 4 years ago
  
  the msg from dragontamer to set up a profiler seems like one approach to diagnose this
  and also from joshdev to try aws graviton, which is also arm based but potentially more suited for cloud hosting than an m1
  if you figure this out, definitely write it up -- very cool tech blog topic, most people never get to debug cpu architecture firsthand

Settings

Ask HN: Running production server on M1 mini?

Keyboard Shortcuts