AMD Prepares 32-Core Naples CPUs for 1P and 2P Servers: Coming in Q2
anandtech.comI think Naples is a very exciting development, because:
- 1S/2S is obviously where the pie is. Few servers are 4S.
- 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas
- First x86 server platform with SHA1/2 acceleration
- 128 PCIe lanes in a 1S system is unprecedented
All in all Naples seems like a very interesting platform for throughput-intensive applications. Overall it seems that Sun with it's Niagara-approach (massive number of threads, lots of I/O on-chip) was just a few years too early (and likely a few thousands / system to expensive ;)
> 128 PCIe lanes in a 1S system is unprecedented
Yes, definitely drooling at this. Assuming a workload that doesn't eat too much CPU, this would make for a relatively cheap and hassle-free non-blocking 8 GPU @ 16x PCIe workstation. I wants one.
That does sound pretty spectacular, and really loud. What kind of case would you put that in? Would you work with ear protection?
Wrt. noise: decibels (dB, "perceived loudness") are logarithmic in sound energy, so going from e.g. a single GTX 1080 at 47 dB to 8x GTX 1080 only increases the noise to 56 dB, which is noticeable but not really annoying, and very far from requiring ear protection. Recommendations for office spaces is that noise be kept < 60 dB IIRC.
Wrt. cases: I think a regular E-ATX compatible case should be enough, but it all depends on the motherboard, and those don't exist yet. Existing 8x GPU servers have been 4U rack mount dual socket affairs; you can also already get 7x GPU dual socket "EEB" motherboards and workstation style cases, but none that will do full 16x for all the GPUs.
Your noise scale is off. 60dB is restaurant conversation level noise.
For comparison, Notebookcheck's system noise scale is 30dB=silent, 40dB=audible, 50dB=loud.
Yup, sorry, my bad. 60 dB is definitely loud enough to be annoying. Still, 8 GPUs are not 8x louder than one, which was the main point.
That's not the point. 80 dB isn't twice as loud as 40 dB, either, it's much more than that.
> Recommendations for office spaces is that noise be kept < 60 dB IIRC.
I'd quit if I had to work in an office space with 60 dB noise. That's like sitting next to a rack of 1U servers at "moderately angry bee swarm" fan level.
I personally cannot stand to be near a noise source above 40 dB for any extended length of time (more than a few hours).
But 60 dB... wow. Can't imagine how shitty that must be to work in for 8 hours per day.
Seriously. Here in my country (Spain) the recommended maximum level for office spaces is 45 dB(A) of equivalent continuous sound level (NBE-CA-88 regulation, only in Spanish: http://www.ual.es/Depar/proyectosingenieria/descargas/Normas...)
> 8 DDR4 channels per socket is twice the memory bandwidth of 2011, and still more than LGA-36712312whateverthenumberwas
This one will be interesting. The current Ryzen (like most of the Intel desktop range) has two channels, but everyone has been benchmarking it against the i7-6900K because they both have eight cores. The i7-6900K is the workstation LGA 2011 with four channels. If the workstation Ryzen will have eight channels...
Let's hope this isn't niagra again: it needs to have decent clock speeds as IPC is still worth something today. But yes, I totally agree, this is an exciting chip.
It's not, not only did AMD move from CMT (clustered multi-thread) design used in the previous Bulldozer microarchitecture, they now have an SMT (simultaneous multithreading) architecture allowing for 2 threads per core.
By comparison, the performance of sparc substantially improved moving from the T1, T2 to T3+. The T1 used a round-robin policy to issue instructions from the next active thread each cycle, supporting up to 8 fine-grained threads in total. That made it more like a barrel processor.
Starting with the T3, two of the threads could be executed simultaneously. Then, starting with the T4, sparc added dynamic threading and out-of-order execution. Later versions are even faster and clock speeds have also risen considerably.
I didn't know about this. Are there benchmarks that aren't canned by Oracle that you know of? I'm intrigued by this round-robin way of threading. I'm not a cpu expert, but how does this compare with the Power arch's way of threading?
Think of it this way, the original Niagara (T1) was an in-order CPU. That is, instructions were executed in the order they occur in the program code. This is simple and power efficient but doesn't produce very good single thread performance, since the processor stalls if an instruction takes longer than expected. Say, a load instruction misses L1 cache and has to fetch the data from L2/L3/Lwhatever/memory. Now, one way to drive up the utilization of the CPU core is to add hardware threads. And the simplest way to do that? Well, just run an instruction from another available thread every cycle (that is, if a thread is blocked e.g. waiting for memory, skip it). So now you have a CPU that is still pretty small, simple and power efficient, but can still exploit memory level parallelism (i.e. have multiple outstanding memory ops in flight).
Now, the other approach, is that you have a CPU with out of order (OoO) execution. Meaning that the CPU contains a scheduler that handles a queue of instructions, and any instruction that has all its dependencies satisfied can be submitted for execution. And then later on a bunch of magic happens so that externally to the CPU it still looks like everything was executed in order like the program code specified. This is pretty good for getting good single thread performance, and can exploit some amount of MLP as well, e.g. if a bunch of instructions are waiting for a memory operation to complete, some other instructions can still proceed (perhaps executing a memory op themselves). So in this model the amount of MLP is limited by the inherent serial dependencies in the code, and on the length of the instruction queues that the scheduler maintains. The downside of this is that the OoO logic takes up quite a bit of chip area (making it more expensive), and also tends to be one of the more power-hungry parts of the chip. But, if you want good single-thread performance, that's the price you have to pay.. Anyway, now that you have this OoO CPU, what about adding hardware threads? Well, now that you already have all this scheduling logic, turns out it's relatively easy. Just "tag" each instruction with a thread ID, and let the scheduler sort it all out. So this is what is called Simultaneous Multi-Threading (SMT). So in a way it's a pretty different way of doing threading compared to the Niagara-style in-order processor. Also, since you already have all this OoO logic that is able to exploit some MLP within each thread, you don't need as many threads as the Niagara-style CPU to saturate the memory subsystem. So, this SMT style of threading is what you see in contemporary Intel x86 processors (they call it hyperthreading (HT)), IBM POWER, and now also AMD Zen cores.
As for benchmarks, I'm too lazy to search, but I'm sure you can find e.g. some speccpu results for Niagara.
Doing that now. Thanks for the write-up.
So although separated by time but not by clocks (the intel setup has the roughly the same base clocks and the same ram as the t4 setup) the 40 thread Xeon system had roughly double the perf of the 128 thread t4 setup running speccjvm2008 https://www.spec.org/jvm2008/results/jvm2008.html
The T7 and S7 are even faster than the T4, and unfortunately I haven't seen newer results published for them.
Naples is based on Ryzen which, if you look at early benchmarks, is beating the competition on all fronts except gaming (suspectedly due to software optimisation and motherboard issues).
yes but four modules of ryzen to make this beastly naples chip isn't going to be clocked at the same frequencies. the top end intel chips have TDPs of 165W but 4 ryzen chips at 3.6ghz have a tdp of 65w a piece and you're not going to see a 260W server chip if you want to sell into the datacenter.
The Ryzen 7 1800X has a TDP of 95W and beats the 140W Intel i7-6900K by 4% in performance. They've made some huge jumps in power efficiency.
I don't know if AMD will make a new architecture or not, but I can't see why they wouldn't just release 32 Ryzen cores side-by-side and underclocked at the stock configuration.
The 1800X will use 130W+ in the same scenarios as the 6900k. AMD just seems to be defining TDP differently.
The thermal design power is the maximum amount of heat generated by a computer chip or component that the cooling system in a computer is designed to dissipate in typical operation. TDP =/= power consumed
> TDP =/= power consumed
Where do you think the heat comes from? Or where do you think the power that doesn't turn into heat goes?
I think he is trying to argue that TDP is a figure about the cooling requirement during peak power usage. Actual power usage may or may not be less during typical workload.
As far as TDP goes, the transferrence of electrical energy to heat is equivalent to that of a space heater as Puget Systems demonstrated here: https://www.pugetsystems.com/labs/articles/Gaming-PC-vs-Spac...
but TDP is also a function of power consumption, directly proportional. So comparing TDP of processors of similar fabrication should tell you about comparative power consumption if not the exact difference.
I've seen benchmarks showing the 95w number is very unrealistic and that it actually uses more like the Intel processor under certain loads
Ryzen is highly-efficient up to 3.3GHz:
https://forums.anandtech.com/threads/ryzen-strictly-technica...
Wow, that's pretty insane. Underclocked to 3.3Ghz it runs at ~42 watts and is benching at ~178.5% performance per watt vs stock clock. This CPU will be very interesting to see in the datacenter space.
They will likely drop the clock. I don't think the market cares at all about how much a CPU takes. If a 1U box has competitive performance AND better performance/watt then it's attractive. It if has worse performance/watt than it's not.
AMD might well steal some of the dual socket market with a dual socket, and maybe some of the quad socket market with dual sockets.
Considering that the current ryzen at $500 is relatively competitive with the $1,000 intel (basically a relabled Xeon with 4 memory busses in the LGA2011 server socket) a quad module (32 core/64 thread) in a socket sounds pretty good. Even if it's more watts than the intel.
The E7's go to 160W each. If you can drive better than 1.7x the performance and stay within the maximum thermal output per physical volume, I see no reason why not to use this.
One reason, perhaps, is if my binaries are compiled with Intel-specific optimizations and it's inconvenient to deploy separate AMD-optimized binaries.
I can see a use case for it, as long as it delivers on performance. No one minds high TDP, as long as it offers a performance advantage. Hell, some servers have 4-8 Titans in them, and no one is complaining about their TDP. If a 260W CPU TDP is justified by the performance, no one will care.
The bigger E7 scratch the 200 W mark pretty hard and IBM already had POWER chips go beyond 200 W. However, cooling and power density are ... problematic. The same goes for accelerators. Supermicro will happily deliver you a 1U box with four pascals and two Xeon sockets, but there is no datacenter in the world were you can stuff 42 of those in a cabinet. [Which doesn't mean that these don't make sense]
However, high end systems don't lend themselves well to mass-deployment (i.e. scale out).
maybe for single server or academia setups, but in datacenters TDP and power consumption absolutely do matter.
And that was my point as well.
I'm more than willing to admit I could be wrong. And maybe Intel will push the TDP envelope with servers as well if Naples proves a threat when things all shake out. Just if Intel hasn't put a ~250W server chip into production I doubt AMD will then again if it performs that much better then there's a calculus there that will need to be done. My prediction, based on no evidence, is that this 32-core chip will be clocked at 2.6ghz and boost to 3.2. Shot in the dark, but given current TDPs that's where I think things might shake out to.
32 cores in one socket may also take a bite out of some servers that are currently 2 sockets.
My shallow understanding of big servers and IBM Z series amounted to "lots of dedicated IO processors". Seems like "mainstream" caught up with big blue.
Sort of. It ebbs and flows, generally more maintainable to do more in CPU/kernel and less in HW/firmware for PCs and of course price runs the market so there's a race to do less. Part of the mainframe price tag is getting long term support on the whole system stack, whereas PC vendors actively abandon stuff after a few years. That is a big risk for something like TCP offload engine.
Every mainframe interface is basically an offload interface.. "computers" DMAing and processing to the CPs and each other. Every I/O device has a command processor, so it can handle channel errors and integrated pcie errors in a way PCs cannot.
A PC with Chelsio NICs doing TCP offload with direct data placement or RDMA as well as Fiber Channel storage would be mini/mainframe-ish.
Pretty much. Mainframes have been very I/O oriented from the start. Channel I/O (more or less DMA) with dedicated channel programs and processors can be very high-throughput.
Also I suppose it frees the logic processors from all IO (caching too?) related processing and allow for fancier strategies downstream .. (all guess fest)
Intel doesn't have SHA2 acceleration? ARMv8 has had it for like 2-3 years now...
And AMD should dump SHA1 acceleration in the next generation.
>And AMD should dump SHA1 acceleration in the next generation.
The cost to have that on silicon is probably close to zero. If you think SHA1 is just going to magically disappear because you want it to, well, you'll be in for a SHA1 sized surprise. Our grandkids will still have SHA1 acceleration.
>ARMv8 has had it for like 2-3 years now...
Because ARM cores don't remotely have the CPU heft an Intel x86/64 chip has, so ARM needs all this acceleration because its typically used in very low power mobile scenarios. On top of that, Intel claims AES-NI can be used to accelerate SHA1.
https://software.intel.com/en-us/articles/improving-the-perf...
Why should it be dropped ? Isn't it just a hash function ?
If you remove things from the instruction set, any code that uses them will either crash or run very slowly in emulation.
Most uses of special instructions will check feature bits or CPU version, but not all will do so correctly.
(I'd say that the additional area cost of something like this is small, and the big cost of special instructions is reserving opcodes and feature bits)
Short story: because its role as a crypto hash function is sort of obsolete given that it's been proven to be broken, and faster, more secure alternatives exist.
But for all practical purposes, SHA1 isn't about to disappear. MD5 has been shown to be broken since forever and people still write new code using it today.
The thing with SHA-1 is that we know (and have known for a decade) that is not a good cryptographic hash function. It is still, along with MD5, a good hash function if you control the input, i.e. in a hash table.
There are better functions than SHA1 to use for hash tables. Candid question: really what is the use for MD5/SHA1 these days?
Yes, but those are widely implemented and thus available more or less everywhere. I'd rather use a slightly less ideal hash function, than use an untested one.
ARM cores are much weaker, crypto performance without NEON is absymal across the board. Of course, compared to hardware-acceleration software always seems slow; Haswell manages AES-OCB at <1 cpb.
As a side note, XOP had rotate instructions. Sadly it is no longer supported in Ryzen.
Intel hass had SHA1/2 acceleration for YEARS via the AES-NI instruction set.
https://en.wikipedia.org/wiki/Intel_SHA_extensions
>There are seven new SSE-based instructions, four supporting SHA-1 and three for SHA-256:
>SHA1RNDS4, SHA1NEXTE, SHA1MSG1, SHA1MSG2, SHA256RNDS2, SHA256MSG1, SHA256MSG2
This is not part of AES-NI and has never been released in a mid-range+ server/desktop CPU, only part of some Atom parts (Goldmont). Therefore software support is poor (I think OpenSSL does not support it). It is said to be included in 2018+ Cannonlake, though.
haha nope. This is not a part of AES-NI.
The only processors so far with these extensions are low power Goldmont chips.
Goldmont probably has them because it doesn't have the wide AVX pipelines necessary for fast software crypto.
Skylake can compute SHA1 at 4.3-3.4 cycles/B and SHA256 at 7-9 cycles/B [1]. That's ~1GB/s SHA1 and ~500MB/s SHA256.
This is what I have really been looking forward to. I theorycrafted a more ideal system for the genetics work a former employer was doing, but didn't get to build it until after I had left there. A quad 16 core opteron system for a total of 64 cores (for physics calculations in comsol). I think that there is more potential use for high actual core count servers than many people realize, so I can't wait to build one. (for my purposes these days is as an game server in a colo, one of my projects is a multiplayer UE4 game)
At the previous job where I built the 64-core system, I even emailed the AMD marketing department to see if we could do some PR campaign together, but I think it was too soon before the Naples drop, because I never got a response. Here's to hoping supermicro does a 4 cpu board for this... 124 cores would be amazing. (But I'll take 64 naples cores as long as it gets rid of the bugs and issues I found with the opterons).
Out of curiosity, I thought that genetics was the domain of gpus?
I did sequence-based bioinformatics back around 2006 or so.
Very few of the operations used GPU. Things may have changed since I was working there, but the work at the time wasn't suited for a GPU architecture.
Initial step was sequence cleanup, which is a hidden markov model executed over a collection of sequences of varying length, so hard to parallelize. Sequence annotation is embarassingly parallel on a per-library basis (each sequence can be annotated independently of the other), but the computational work is fuzzy string matching, which is once again hard to GPU-ize. Another major computational job was contig assembly, which is somewhat parallelizable (pairwise sequence comparisons), but once again involves fuzzy string matching so not GPU-izable.
So that's just sequence genetics. Don't know if GPUs are used in other areas.
Lots of cores, lots of threads, and lots of main memory. That was the key.
"Lots of cores, lots of threads, and lots of main memory. That was the key."
Very much this. Which is why I ended up theorycrafting that the AMD many core CPU's would be so useful.
And still is ;) Partly because some key work loads just did not run well on GPU's due to lack of addressable memory. Lots of amdahls getting in the way. Some of the key use cases required stupendous large memory machines (genome assembly using only short reads).
Then a lot of code is very branchy but massively parallel leading to clusters of pure CPUs to be more flexible, which is important in research settings, and with higher utilization than mixed CPU/GPU clusters.
GPU code takes longer to get to market and has more specialized skills required then standard CPU orientated programming. Late to market means you miss a whole wave of experimental methods from the lab. i.e. GPU short read aligners came when long reads started to come out of the sequencing lab. Leading to people to stop doing short reads or at least stop doing pure short reads.
Secondly quite a bunch of the key staff at the large research institutes had been burned by previous hardware acceleration attempts and where not going to throw money at it until market proven.
Bio-informatics tends to cutting edge (the hemorrhaging kind) on the bio/lab tech side yet the production IT tends to balance that to doing the things we know as we already have enough risks. i.e. focus on the algorithms and robustness not on pure power.
Hmmm, isn't deep learning starting to pick up for genetics? No idea if it is actually is, but everyone in DL seems to be talking about it, I thought I'd ask someone actually in bioinformatics :)
I wish I could say, I it's been a good two+ years since I left the genetics company, so I've been in a different industry for a while. I would say theres probably plenty of room, if people start taking more novel approaches that use more data, eg, full microbiome analysis. Also, I was just a sysadmin, so I don't really know anything other than keeping systems running, so take what I say with a grain of salt.
I suspect DL will have a limited to modest role in the actual search / alignment part, and a lot more to do with the analysis part. This includes medical diagnosis, identifying regulatory patterns based on high throughput expression data, such stuff.
Not necessarily in comp. genetics / sequencing.. / the DNA stuff..
Xeon Phi tried to crack this nut and seems to have mostly failed so far.
I think there is plenty of room for GPU usage in bioinformatics, but there are some barriers that prevent it from gaining prevelance, such as cost vs cpus, and lack of updates (example, gpu-blast is still 1.1, blast is 1.2).
For most of the really time consuming steps the speedup isn't spectacular, 1.6x is not worth the effort
http://ce-publications.et.tudelft.nl/publications/1520_gpuac...
It's quite rare to find GPUs being used in genetics.
Is that because the workloads are fundamentally unsuitable for current GPU architectures or because no one has took a good stab at it yet?
I know very little about computation genetics/biology but it sounds interesting.
AFAIK probably a bit of both. A majority of genetics/biology workloads are I/O bound (mapping, blast, etc) and/or require a lot of memory (i.e. de novo assembly of genome)
On the other hand many of the bioinformatics software solve a specific scientific question and usually are written by people with mostly non-computational background. They use higher level languages such as Python/Perl/R and people often don't have the expertise or time to implement them for GPUs.
However now that machine learning and deep neural network approaches are being picked up by the field, the workloads might change a and also there are frameworks that make it easer to leverage GPUs (Tensorflow, etc)
> They use higher level languages such as Python/Perl/R and people often don't have the expertise or time to implement them for GPUs.
That's an interesting thought, has anyone ever attempted to get 'regular' programmers interested in this stuff as a 'game'/code golf kind of thing?
(Too many) Years ago one of the programming channels I was active in got distracted for 3 weeks while everyone tried to come up with the fastest way to search a 10Mb random string for a substring, not in the theoretical sense but in the actual how fast can this be done, that was the point I found out that Delphi (which was my tool of choice at the time) had relatively slow string functions in it's 'standard' library and ended up writing KMP in assembly or something equally insane, I got my ass handed to me by someone who'd written a bunch of books on C but eh it was damn good fun, it was also one of the first realizations I had just how fast machines (back then) had gotten and just how slow 'standard' (but very flexible) libraries could be.
Obviously the total scope of re-writing researchers code would probably be far far beyond that but if they could define the parts they know are slow with their code and some sample data I know a few programmers who would find that an interesting challenge.
Thanks for the response.
I don't think it is because no one has tried it as much as the fact that the workloads need the cpu architecture / are not easily parallizable (as far as I understand). Comp bio in genetics is largely sequence alignment & search, which is still largely CPU / memory bound; but I don't understand programming enough to speculate if development in algorithms will allow GPUs to be used because the problem itself is not parallelizable. I think of it as the difference between a super computer & a cluster..
(More than a decade ago, I struggled to / barely succeeded in building a Beowulf cluster; I am just amazed at how far both the hardware & the software tools have come..)
In other areas of comp bio though, GPUs I think are finding use. Protein folding, molecular dynamics. Also, with STORM & such: super resolution microscopy? I think increasingly, gpus will become important.
Also, whole cell simulations?
What you wrote about super computer vs cluster is quite right. Recently I attended a HPC meeting where we were the only DevOps of an HPC for a biological institute and most of the other people were from physics & chemistry. They usually don't consider the biology workloads as High Performance Computing but as big resource/data computing. The physics & chemistry guys run simulation using hundred thousands of cores and are mostly CPU bound. They use MPI and their nodes typically have not more than 64 GB and they consider 120 GB memory usage as a lot. Biologist on the other hand hardly use MPI because they can just parallelize the workload on the data level (i.e. sample or chromosome) and run them independently on each node. For that reason also high memory NUMA machines from SGI can relatively often be found.
You are also right that some of the comp bio areas (CryoEM, protein folding, molecular dynamics) are well suited for GPUs
Thank you for your response, it was extremely interesting.
One of the nice things about HN is you get to look outside your own bubble (I mostly do Line of Business/SME stuff so this stuff isn't just outside my wheelhouse it's on the other side of the ocean).
FPGA type applications will probably pay way bigger dividends than GPU acceleration ever will.
GPUs excel at problems where you can apply exactly the same logic to lots of data in parallel. CPUs can handle branching cases, where each operation requires a lot of decisions, a lot better.
Sufficiently large FPGA chips could accelerate certain parts of the workflow, if not the whole thing, since they're extremely good at branching in parallel. This is why early FPGA Bitcoin implementations blew the doors off of any GPU solution, each round of the SHA hashing process can be run in parallel on sequentially ordered data if you organize it correctly.
I've heard that annually for a decade or so.
FPGAs run hot, don't have many transistors, limited clock rate, and are a pain to program.
So yeah a "Sufficiently large" chip, a "sufficiently fast clock", and a "sufficiently well written app" could theoretically do well. Problem is in the real world they aren't and developers aren't targeting them.
CPUs and GPUs are a pain to program if you don't have the right tools. If it's tooling that's the huge impediment then maybe Intel's acquisition and (hopefully) tool realignment will help.
That the FPGAs use this proprietary and for all intents opaque binary format is not very helpful and is probably the biggest barrier.
In the past several years, quite a few developers have tried GPUs for analyzing bio-sequences, but found the speedup is modest. Good GPUs are expensive. It is usually better to put that money on CPUs or RAM.
//OT::
Your user name: a fan of the cre-lox system, or the enzyme itself?
Cool uid!
In my past life, I've used flp/frt & cre/lox; and studied mismatch repair enzymes. And topoisomerases.. :)
Wouldn't Xeon phi be a good choice for that kind of usage?
I'm looking forward to the benchmarks since the performance per watt of the desktop parts (Ryzen R7) seems to be really good. Quite curious how it will compare against Skylake-EP.
A quote from a anandtech forum post [0] reads promising:
"850 points in Cinebench 15 at 30W is quite telling. Or not telling, but absolutely massive. Zeppelin can reach absolutely monstrous and unseen levels of efficiency, as long as it operates within its ideal frequency range."
A comparison against a Xeon D at 30W would be interesting.
The possibility of this monster maybe coming out sometime in the future is also quite nice: http://www.computermachines.org/joe/publications/pdfs/hpca20...
[0] https://forums.anandtech.com/threads/ryzen-strictly-technica...
The important thing here, from my perspective, is how NUMA-ish a single socket configuration will be. According to the article, a single package is actually made up of 4 dies, each with its own memory (and presumably cache hierarchy, etc). While trivially parallelizable workloads (like HPC benchmarks) scale quite well regardless of system topology, not all workloads do so. And teaching kernel schedulers about 2 levels of numa affinity may not be trivial.
With that say, I'm looking forward to these systems.
Intel's largest CPUs are already explicitly NUMA on a single socket. They call it Cluster On Die: http://images.anandtech.com/doci/10401/03%20-%20Architectura...
Very true, I should have mentioned that. At least for us, COD doesn't seem to impact our performance at all, while NUMA does. I'm hoping that Naples is the same for us.
However, there is an important difference. AMD seems to be putting multiple dies into the same package, whereas Intel seems to have (as the Cluster on Die name implies) everything on the same die. So my fear is that the interconnect between dies may not be fast enough to paper-over our NUMA weaknesses.
Sounds like your application is latency sensitive, and not bandwidth sensitive, take a look at the graphs towards the end of this article:
https://www.starwindsoftware.com/blog/numa-and-cluster-on-di...
There's not much difference in memory bandwidth between crossing domains on the same die (COD) vs crossing domains system wide (accessing memory for a different socket). What kind of computation are you running?
I'm talking about Netflix CDN servers. The workload is primarily file serving. The twist is that we use a non-NUMA aware OS (FreeBSD).
We're not latency sensitive at all. The problem we run into with NUMA is that we totally saturate QPI due to FreeBSD's lack of NUMA awareness.
The results you link to don't match with what we've seen on our HCC Broadwell CPUs, at least with COD disabled. Though we only really look at aggregate system bandwidth, so potentially the slowness accessing the "far" memory on the same socket is latency driven, and falls away in aggregate.
Sorry, my google-fu isn't on point today; what's the difference between 1p and 1u. or 2p and 2u? My nomenclature knowledge is lacking ...
P = Processor and S = Socket (they're pretty interchangeable). U = rack Unit https://en.wikipedia.org/wiki/Rack_unit
n-P / n-S / n-way = how many sockets/processors a system has. A 1S system has one socket / processor, a 2S system two, a 4S four and so on.
x U (or x HE, if you're talking with a German manufacturer, they like to make that mistake ... ;) are rack-units, i.e. how large the case is.
The title should be re-written to say "single and dual socket" not "1P and 2P".
Nice. This is the more interesting market for AMD rather than the gaming market in my opinion. 128 PCIe lanes and up to 4TB of ram will be awesome.
Gaming? More like consumer market, Ryzen 7 is definitely not suited for gamers, advertising it as such was IMO mistake. Nevertheless Naples can be big innovation in server segment.
Also what with ECC? Ryzen can support it or not?
"Ryzen 7 is definitely not suited for gamers"
The underperformance in gaming was tracked down to software issues according to AMD. Namely:
- bugs in the Windows process scheduler (scheduling 2 threads on same core, and moving threads across CPU complexes which loses all L3 cache data since each CCX has its own cache)
- buggy BIOS accidentally disabling Boost or the High Performance mode (feature that lets the processor adjust voltage and clock every 1 ms instead of every 40 ms.)
- games containing Intel-optimized code
More info: http://wccftech.com/amd-ryzen-launch-aftermath-gaming-perfor...
Furthermore hardcore gamers usually play at 1440p or higher in which case there is no difference in perf between Intel or AMD, as demonstrated by the many benchmarks (because the GPU is always the bottleneck at such high resolutions.)
"Hardcore" is different from "hardware enthusiast".
Hardcore is that guy who plays Call of Duty 24x7 on his Xbox 360 and mediocre 720p television. You can't deny the determination or enthusiasm. Hardware's irrelevant.
> bugs in the Windows process scheduler
Blaming windows is just a desperate excuse from AMD to justify its lack of performances. Don't be tricked by that.
It's possible -and rather common- that there are motherboard issues on the first generation of MB, which again, is not a a valid excuse but a bad thing that desperately needs fixing from AMD and a sign that it's still in testing phase.
Were you around for when bulldozer came out? There were huge problems with Windows task scheduling that were later fixed with updates.
Or when Intel HT first appeared. Or when Intel HT reappeared. Or when the first dual core appeared. Every time Windows needed updates to perform properly; Linux also needed patches to adjust scheduling for Zen and also received patches in many other instances.
This is nothing new or outstanding at all.
The scheduling decisions of Windows are not unknowable. It was entirely AMD's call to make a CPU that was effectively hyperthreaded but to still mark the cores as fully independent.
A Google searches indicates that this is not accurate.
Not being the top single-threaded performer which is required to push many many hundreds of frames per second != "not suited for gamers". Games in general are more likely to be GPU-bound!! Intel's quad cores are only really required for the pro Counter-Strike players who want 600fps at 1080p just to get the absolute latest frame.
BTW they advertised it as good for gaming + streaming (h264 CPU encoding at the same time on the same machine). And "content creation", which pretty much always means video editing.
IIRC Ryzen supports unbuffered ECC if the mainboard supports it.
> Intel's quad cores are only really required for the pro Counter-Strike players who want 600fps at 1080p just to get the absolute latest frame.
The source engine isn't exactly the pinnacle of engine development.
It doesn't really know what to with more than 2ish cores, so you probably get more FPS by using a dual core instead of a quad core, which tend to go farther in terms of overclocking.
Pretty much all games are CPU intensive and it's not getting better.
Try running on a cheap i3 from a few years ago and you'll understand your pain quickly.
AdoredTV has a pretty objective video on this subject[1]. TL;DW he expects it to move past Intel perf in the future - based on how an older AMD chip is now beating a then-better Intel chip.
My opinion: if Microsoft is able to pivot the Scorpio over to the Ryzen (or indeed, any CPU with more than 4C/8T) it will drastically alter the lowest common denominator in terms of what game developerss target - i.e. we'll see games moving towards more modern threading architectures (e.g. futures/jobs as-per Star Citizen, which more thoroughly exploit CPU resources).
Furthermore, there is hearsay evidence that supports AMDs claims. Ashes of the Singularity currently runs better on Intel but the developers claim:
[2]> Oxide games is incredibly excited with what we are seeing from the Ryzen CPU. Using our Nitrous game engine, we are working to scale our existing and future game title performance to take full advantage of Ryzen and its 8-core, 16-thread architecture, and the results thus far are impressive.
In addition to that, if you look at the CPU usage/saturation alongside the benchmarks (13:08 in [1]) it's strikingly obvious that the CPU is not the bottleneck - Intel is upwards of 90% on all cores while the Ryzen hovers around ~60%. I'm holding my credit card close until the aforementioned optimizations and rumored bios patches land, but I'm willing to give AMD a little benefit of the doubt - what we're seeing largely matches what they are saying.
[1]: https://youtu.be/ylvdSnEbL50 [2]: http://wccftech.com/amd-ryzen-launch-aftermath-gaming-perfor...
Sorry but for years the mantra was for a gaming pic to invest in a i5 or even an i3 and spend the extra money in a good GPU. But for some bizarre reason suddenly everything that is not performing as an i7 7770k is a "bad cpu for gaming". It's ridiculous.
Hell in 30 million households there are 8 jaguar x86 core gaming machines active now with an IPC that is probably (I assume) atrocious.
I build my i7 4770 4 years ago and the sad part is that it will probably still take a lot of time for it to become a bottleneck in 90% of the games.
If you check Digital Foundry's excellent i5 vs i7 benchmarks, if building a machine to game on you want 8 threads minimum today. Times have indeed changed on the old i5 recommendation.
That said, completely tangential to what you're saying. Ryzen may (at worst) perform like an i5 in gaming but it has more than 8 threads. I do everything with my machine and going with a R7 1700 overclocked.
I use a FX8730E and they only bottleneck that I have it's my old GTX660 GPU.
CPU intensive != single threaded
It's just as suited for gaming as it is for anything else. The problem is everyone expected all games to run buttery smooth on day one with no hiccups. Ryzen specific game engine optimisations are coming according to AMD, as well as a Windows 10 scheduler patch. There are also other issues on the motherboard/BIOS side which manufacturers are working on.
You can't say this enough to people, they just don't want to get it. New CPU arch, new platform, beta bios, new node, no OS scheduler patch, no engine optimizations. Considering all this, the performance is truly amazing.
1. Most of the benchmarks are not even compiled or made with Zen Optimization in mind. But the results are already promising, or even Surprising.
2. Compared to Desktop / Windows Ecosystem, their are much more Open Source Software on the Server side, along with usual Open Source Compiler. Which means any AMD Zen optimization will be far easier to deploy compared to Games and App on Desktop coded and compiled with Intel / ICC.
3. The sweet spot for Server Memory is still at 16GB DIMMs. A 256GB Memory for your caching needs or In-memory Database will now be much cheaper.
4. When are we going to get much cheaper 128GB DIMM Memory? Fitting 2TB Memory per Socket, and 4TB per U, along with 128 lanes for NVM-E SSD Storage, the definition of Big Data, just grown a little bigger.
5. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0. I am very excited!
> 5. Between now and 2020, the roadmap has Zen+ and 7nm. Along with PCI-E 4.0. I am very excited!
Yes, and it's rumored that the top end 7nm chip will be 48 cores (codename starship). Exciting times ahead now that the competition is back.
In previous threads there was discussion about Intel processors, specifically Skylake (which is a desktop processor), being superior for server workloads involving vectorization.
How will Naples fare on this front?
That front remains to be seen. However, 128 lanes, 8 channel ram; It will make a mess out of Intel in the vm hosting arena.
I'm glad I don't own any Intel stock atm :)
The VM hosting arena is exactly where cloud providers play.
A high core count, energy efficient CPU with IO out the wazoo?
I'm happy I bought AMD stock over the summer (:
Outside of specialized workloads, not a lot of software is vectorized. Maybe your database server can take advantage, but your application server will probably not benefit one bit.
Desktop Skylake doesn't support AVX-512. Server Skylake will, when it ships. (The Xeon E3 v5 doesn't, because it's the same chip as desktop Skylake.)
Removed the incorrect information from my post. Thanks for the correction.
Naples might not fare well, but AMD is betting on vector operations being offloaded to a GPU-like accelerator connected via Infinity Fabric.
Badly, but it doesn't matter because it's still just a tiny portion of the market.
I've long been advocating for a high i/o cpu with several pcie lanes. 128 lanes will support 8 GPUs at max bandwidth. AMD has positioned itself well.
How well does, say, Postgres scale on such hardware? Is anything more that 8 cores overkill or can we assume good linear increases in queries per second...
Depends on your queries. I am looking at a server right now that uses 80% of 32 cores with Postgres 9.6. It's doing lots of upserts and small selects. Averages 76k transactions per second. I think it could easily take advantage of a 64 core system.
The main scalability issue I have with Postgres is its horrible layout of data pages on disk. You can't order rows to be layed out on disk according to primary key. You can CLUSTER the table every now and then but that's not really practical for most production loads.
I think I saw a proposal recently for something that would cover this use case. IIRC it was for an index organized table that stores the entire contents in a btree (so it would naturally be stored in primary key order).
I don't think there's been any work on it yet though.
The main utility of CLUSTER is to mark a table as sorted - that is, few disc seeks will be required on an index-scan of the table. This is important when doing (for example) a merge join with another table, or just when you are requesting a large proportion of the table in sorted order. Postgres knows enough statistics about the order of entries in the table to know that it can read the table faster in order using the index with the occasional seek for recently added elements than if it was to do a sequential scan and re-sort the table in memory (spilling to disc if necessary).
A B-tree can in no terms be described as being laid out on disc in primary key order. The individual pages of the tree are placed on disc randomly, as they are allocated. Therefore an index scan won't return the rows in index order as quickly as the current scheme of having the rows separate from the index and sorting them every now and again.
Ultimately, for the goal of fast in-order scan of a table while adding/removing rows, you need the rows to be laid out on disc in that order, so that a sequential scan of the disc can be performed with few seeks. This requires that inserted rows are actually inserted in the space they should be, which is not always possible - often there isn't space in the page, and you don't want to spend lots of time shifting the rest of the rows rightwards a little bit to make space. To a certain extent Postgres already does insert in the right place if there is space in the right disc page (from deleted rows), but because this is not always possible, the solution is to re-CLUSTER the table every now and again.
I think the Postgres way is actually very well thought out.
Have you looked into pg_repack [1]? It's a PostgreSQL extension that can CLUSTER online, without holding an exclusive lock. I haven't used it, but it looks interesting as an alternative to the built-in CLUSTER.
This is from 2012: http://rhaas.blogspot.com/2012/04/did-i-say-32-cores-how-abo...
My guess is the 1 socket options scales great. 2 sockets are are less than ideal, and you will not double the 1 socket performance.
Yeah, every single release since then has had work done on scaling Postgres up on a single machine. They have been working on eliminating bottlenecks one after another to allow it to scale on crazy numbers of cores.
I've seen benchmarks on the -hackers mailing list with 88 core Intel servers (4s 22c) in regard to eliminating bottlenecks when you have that many cores. So even if it's not 100% there yet, it will be soon.
I'd like to see this data on Postgres scaling updated, with more info on the write scaling as well. (the chart appears to be for SELECT queries only)
The other change is now a single select can use multiple cores, so you could see how that scaled to 32, 64, 128 cores...
On highly concurrent PG systems by when using parallel queries you are sacrificing throughput for better latency. You really don't want to use more than a fairly small number of workers per single select.
Outside of some very exotic scenarios you are IOPS bound on writes and not CPU bound.
Is that still a problem with cheap NVMe drives that can do 500k IOPS?
Not a problem at all if you provision your server properly so you aren't IOPS bound, though there are plenty of databases which simply won't fit on NVMe drives, so in those cases, yes it could still be a problem.
Not likely to be a problem if you have FusionIO drives. Likely to be a problem sooner or later on everything else. A definitive recurrent issues on all cloud providers and NAS/network drives.
Nope but everyone is running in the "cloud" and there you are lucky to get 50K IOPS
This is one of the reasons I don't run in the cloud.
same here
I've seen decent enough scaling on 8 socket servers. There's still some bottlenecks if you have a lot of very short queries (because there's some shared state manipulations per query & transaction), but in a lot of cases postgres scales quite well even at those sizes.
If they have a much better performance/$ than Intel, which they likely will have, it sounds like a good opportunity for AWS to significantly undercut Microsoft and Google (which recently bragged about purchasing expensive Skylake-E chips).
There's opportunity cost to consider. Google has Skylake-E now which is not even available at retail yet.
Well, it also seems that Intel prioritized its customers. If I were Amazon or Microsoft (the rumors said Google and Facebook were the priority customers), I would get Naples just to spite Intel (it doesn't hurt that AMD's Naples likely offers better perf/$, too, though):
https://semiaccurate.com/2016/11/17/intel-preferentially-off...
Really dumb move by Intel, this what happens when a company becomes too arrogant. They knew about Zen but probably just laughed it off as nothing. At the high level business decisions, things like this matter just as much as technical details like performance/$. Hard to believe they were dumb enough to piss off Amazon AWS, MS Azure, and others.
This is the first I'm reading about the 32 cores being 4 dies on a package - Not sure how well that will work out in practice. IBM does something similar with Power servers where 2 dies on a package are used for lower end chips.
Basically, using multiple dies increases latency significantly between the cores on different dies. This will affect performance. I will not judge till I see the benchmark though :-)
With how big these chips are getting, I wonder if the next iteration will have an HBM last-level cache on chip.
That's the old EHP concept.
http://wccftech.com/amd-exascale-heterogeneous-processor-ehp...
I'd like to have that in the old project quantum package: http://wccftech.com/amd-project-quantum-not-dead-zen-cpu-veg...
That would be a TFLOPS level supercomputer on your desk.
Here is the newest PDF about something like that: http://www.computermachines.org/joe/publications/pdfs/hpca20...
"IBM did it first"
Well not with HBM (which is DRAM), but huge amounts of L3 SRAM on a MCM... POWER5 I believe.
Just having one of those in a workstation get me all warm and fuzzy.
I think Naples will be a very serious threat to Intel in the server market. As Ryzen benchmarks & reviews have shown, Zen really shines in heavy-multithreaded applications. The typical workload of a server.
Though I am kind of worried concerning memory access. Latency penalties when accessing non-local memory are very high on Zen CPUs due to the multi-die architecture design.
Does that mean we will finally see some serious interest in Shared-Nothing design and alike in the future ?
Semi-ironically this looks like just the thing to use in a supercomputer controlling a good number of NVidia GPUs.
Was thinking the same thing. Like the CPU marked it's good to have competition with GPUs but it would be interesting if Nvidia picked up/ partnered with AMD. Oh well let's see how OpenPOWER pans out.
What would be really interesting if AMD CPUs and GPUs both support their Infinity Fabric concept. Heterogeneous systems with high-performance direct memory access is a huge deal.
This is a multi-chip-module (MCM). Are the high core-count Xeons now all single die? Will be interesting to see what impact the MCM approach has on benchmarks as I supposed could have a latency impact in certain use cases?
In other words, we have a faster server chip coming
This is when things will get interesting. Ryzen appears to do better with hot and server workloads than gaming.
Should read HPC instead of "hot"
can anyone chime in as to why use PCIe over something more core to core direct? As I understand it, the CPU still needs to talk to a PCIe host/bridge controller. Why not have something that is more direct between processors?
Hypertransport is an AMD technology that's high bandwidth per line, low latency, and scalable. It's also cache-coherent (well there's a version that is), so it's great for connecting CPUs. But the AMD hardware is flexible and can use the same pins for either.
So the single socket systems can have more pci-e lanes available, but the dual socket has less per socket because some of those lanes are used for hypertransport.
What I can't figure out is why Intel and AMD aren't using similar (Hypertransport for AMD and QPI for intel) to connect directly to GPUs in a cache coherent way. These days the faster interconnects spend a decent fraction of their latency just getting across the PCI-e bus twice.
So 100 Gbit networks, Infiniband, GPUs, etc all could take advantage of a lower latency cache coherent interface, but it's not available.
I suspect mainly because qpi and hypertransport are incompatible and pci-e is good enough for the high volume cases.
Well, AMD is one of the founding members of OpenCAPI, http://opencapi.org/ , so I guess there's some hope. It seems they haven't talked about it wrt Zen/Naples, maybe some later iteration will have it?
Licensing Windows 2016 Datacenter would cost a fortune for the 2P server.
I'm wondering how they will stack up against XeonPhi.
How feasible will a Naples desktop build be?