Using Intel’s Xeon Phi for Brain Research Visualization
top500.orgThe lowest price Xeon Phi in this generation is $2,348 (1.3ghz, 64 cores) - I can't help but feel Intel would do well to introduce an enthusiast product in to the lineup. Even 1.0ghz, 48 cores for $1000.
They're Tesla priced without an equivalent desktop gamer graphics card, and that means you can't just dip your toe into the water; you've got to buy the canoe up front.
Programming on a normal x86 doesn't really count, because there's no way to get a feel for what is fast and slow when you're using a monster of a core capable of running your poor code more quickly than it deserves.
I agree with you completely. One other thing that I think Intel could/should do is to cooperate with one of the major cloud providers to offer reasonably priced by-the-hour remote access.
There is one wonderful opportunity, though, that deserves to be better known. Intel has sponsored Colfax Research to offer free online introductory courses, which include two weeks of remote access. The next session begins August 29th: http://colfaxresearch.com/how-16-08/
(I'm unaffiliated, but enjoyed the course a few months ago.)
As someone who has programmed both Phis and conventional x86 CPUs, I can confirm that the Phi is more sensitive to data traversal order and NUMA effects on which core accesses which memory. Also, the latest generation (Knights Landing) has much better performing cores than the previous generation.
Well, if it weren't the case and you had that core count without compromise, Phis would have come along a lot sooner with a price tag to match :)
Did you happen to use the Knight's Corner or the new Knight's Landing variant? I'd be quite interested to know how KL stacks up, as naively from the specs it seems like it should be a lot more tolerant with code (but not poor memory access patterns).
Both, and it agrees with your prediction. KNL's cores are each much faster than KNC's cores. A KNC core was over 10X slower than a mainstream CPU core, and a KNL core seems to only be about 4.5X slower (on my particular code). I also get linear OpenMP scaling from 1 to 64 threads on KNL, so the parallelism is all there.
Some questions out of curiosity: Is your application bandwidth-bound / compute-bound or something else? Also what modes have you been operating the KNL chip in?
Are you using the socketed version of the pcie version?
I agree - but I think there appears to be a more significant shift underpinning this. I suspect that we are beginning to see an architectural divergence between server and client.
This reverses the last 20 years where intel made inroads into the datacenter and there were few fundamental differences between xeons and their desktop brethren (the i5/i7 etc). Intel will have vastly different ISAs on server and client this coming generation (desktop is not getting AVX512). I suspect the storage layer to get bifurcated as well, since its unclear if clients will see much benefit from things like xpoint. In short client side the only tangible gains that seem to benefit off late are - will the hardware change improve battery life, will it enable thinner form factors and will it make a browser run measurably faster. I watch with great interest how Intel will push adoption of hardware features going forward on the client.
It's is rather silly for Phi to be positioned as "it's just like x86, oh wait except for needing to use special SIMD instructions to get max performance". Kind of like Atom being x86 for ultra mobile platforms, just not being able to match the power/performance of ARM. Once you start sacrificing things to maintain x86 compatibility, you really loose its benefits.
You really do need these changes to get max parallelism though. Where it shines is situations where you'd otherwise be porting to a GPU. On the Phi its a recompile and adding a few intrinsics to your inner loops. This is much faster than getting reasonable performance on a heterogeneous architecture and you don't have to micro-manage the slow PCIe link between the CPU and the GPU.
Programming a modern Xeon x86 does count. Modernising your software for a haswell/skylake server Xeon (ISA was made public) also modernises it for the new Xeon Phi. You have a nearly identical ISA and programming model. In other words, modernising your code to scale well on a 16C/2P Xeon system is essentially dipping your toe for a full blown KNL Xeon Phi.
PS. For pricing, take into account that the new generation Xeon Phi's are bootable, you do not need a host CPU babysit like Tesla's case.
Why not just use Xeon 2697-v2 for the same price as the phi?
It's 12 core so performance in all-core situation would be about the same as this one. But on non-parallelized code it would be ~5x faster..
Memory bandwidth is important too. The Knights Landing processors have a 16GB on-chip memory to the cores have significantly higher bandwidth than you'd get with DDR4; the additional memory bandwidth makes more of an impact on the runtime of some algorithms than raw compute performance does.
The optional 16GB L3 is on separate chips, but it's colocated inside the same chip package. This kind of MCMs (multi-chip modules) have been used for a long time in the semiconductor industry since the 70s. Recent examples include AMD Xenos in XBox 360, Wii U CPU, IBM POWER chips.
Nope. First, it's not L3 cache and secondly comparing 3D-stacked in-package memory (MCDRAM, HBM, HBM2) with your examples is misleading. https://en.wikipedia.org/wiki/High_Bandwidth_Memory
You can configure the near memory to be used as cache or directly addressed memory as desired. Users of existing codes will configure it as cache.
Direct addressing is the preferred configuration. Only if your existing code's working set does not fit in MCDRAM does the cache configuration make sense.
It might sound pedantic on my part, but 'it can act as cache' is very different in practice from 'It is a cache'.
Last year they sold a bunch of them 60 cores for $200. I got one, the problem is that they run hot and need a server that can support them. I'm yet to acquire a server with bar support, so it's still sitting. :-( Anyways, they are out there for decent price, keep your eyes open and you will find a deal.
More than just "run hot", they were the "passive" models that require external cooling. You might be interested in these 3D printed designs for the cooling:
http://www.thingiverse.com/thing:997213
http://ssrb.github.io/hpc/2015/04/17/cooling-down-the-xeon-p...
As you mention, you'd still need a motherboard with 64-bit Base Address Register support, but at least you could keep it from burning up (or more likely, shutting down when it overheats).
Nice, thanks, I'll check those out. My plan was to run it only during the winter months outside, with massive fans.
"Figure 1: Even first in-silico models show the complexity and beauty of the brain"
Man the human brain is such a narcissist
I'm imagining a new test for artificial intelligence measuring a system's capability for narcissism – the true metric of real consciousness.
Favourite comment in ages, thank you!
especially given that it is a poorly designed, if any, patchwork of new features developed using "spaghetti on the wall" approach and piled on top of the old ones.
I don't know if using Xeon Phi for rendering makes that much sense. It's sort of the problem it's least competitive to solve on a raw performance, performance per watt or development cost basis.
> However, ‘smaller’ is a relative term as current visualizations can occur on a machine that contains less than a terabyte of RAM. Traditional raster-based rendering would have greatly increased the memory consumption as the convoluted shape of each neuron would require a mesh containing approximately 100,000 triangles per neuron.
That sounds like a poor approach to this problem. You could write a shader that renders thick lines for the dendrites, and the rest of the geometry can be conventional meshes. The same shader could have a pass specially designed for lines and depth of field rendering. That's the one unusual shader. It's hard, but not super hard to write. [0]
Besides, unless you need this to run in real time (which the Xeon Phi doesn't anyway), you could just raster render and page in the mesh data from wherever. So what if it's slow.
I think highly technical platform decisions like Xeon Phi versus NVIDIA CUDA is really about the details. You have to educate the reader both on the differences that matter and why they should choose one over the other. The comment in the article, "no GPU dependencies," is a very PR-esque don't-mention-your-competitor dance around what they're actually trying to say: the CUDA ecosystem can be a pain since you can't buy the MacBook Pro with the GTX 750M easily, installing all its drivers is error-prone, SIP gets in the way of everything, Xcode and CUDA updates tend to break each other, etc. etc.
I sound like I know what I'm talking about, right? Intel's just not getting it. Show a detailed application of where Xeon Phi really excels. NVIDIA's accelerated science examples go back a decade, and some, like the accelerated grid solved Navier-Stokes fluids examples, are still state of the art.
The competition in rendering is intense. Some level of production-ready renderers like Arion, Octane and mental ray (specifically iRay, NVIDIA's GPU accelerated renderer) perform best or are exclusive to the CUDA platform. Conversely, you probably get the most flexibility from a platform like VRay or Renderman, whose support for GPU acceleration is limited. Intel embtree has a great presence today in baked lighting for game engines, but I think NVIDIA's OptiX is a lot faster.
> That sounds like a poor approach to this problem. You could write a shader that renders thick lines for the dendrites, and the rest of the geometry can be conventional meshes. The same shader could have a pass specially designed for lines and depth of field rendering. That's the one unusual shader. It's hard, but not super hard to write. [0]
You will be surprised how bad medical research and visualization is compared to their gaming counterparts. Most medical researchers use 5-10 year old technological approaches they learned in their PhD program.
On a side note, I have yet to see a Phi-vs-CUDA comparison. Intel is comparing Phi to Pentiums, which is utterly ridiculous.
Here is a comparison of the previous generation: https://www.xcelerit.com/computing-benchmarks/libor/intel-xe...
They hold their own against GPGPU, but are probably the inferior choice if your code already runs on a GPU (OpenCL/CUDA).
The real advantage of the Phi is of course combining this nearly-as-good-as-GPGPU parallelism with the x86_64 toolchain and infrastructure. x86 supports more languages with more libraries, and is easier to develop for.
That's not quite fair - some of the research into volumetric medium interaction and scattering is way ahead of the VFX / Gaming fields...
> It's sort of the problem it's least competitive to solve on a raw performance, performance per watt or development cost basis.
This is not true for anything beyond running compute shaders on large 1D, 2D, or 3D buffers. Just because something is 'graphics' doesn't mean that a GPU is automatically faster.
> Production-ready renderers like Arnold, Octane and mental ray (NVIDIA's renderer) perform best or are exclusive to the CUDA platform.
Arnold is a CPU renderer, Octane is FAR from what I would consider 'production ready' and mental ray is also a software renderer. Renderman does not use any GPU acceleration.
> I sound like I know what I'm talking about, right?
Not even slightly
My bad, I wrote Arnold instead of Arion, I mix them up when writing it out. iRay is sort of a feature of mental ray, I guess if you're being pedantic. Octane isn't production ready, but I suppose if you're used to building render farms it's not production ready. It's certainly production ready for someone paying for all those licenses.
> This is not true for anything beyond running compute shaders on large 1D, 2D, or 3D buffers.
Yes, but rendering is a shader on a bunch of those buffers right? That's what I wrote. I'm not 100% confident that you can efficiently render with conventional shaders what they showed in that frame. But I think you can. You could at least cull and tesselate tubes on the GPU, if you really don't want to write a shader.
> Yes, but rendering is a shader on a bunch of those buffers right?
No. Tracing rays is fundamentally a sorting problem when dealing with the acceleration structure. Rasterizing samples means accumulation of values and weights, which means either atomics or separate buffers (and if you are using the GPU creating a buffer for every core is out of the question). You could sort the samples into buckets and rasterize those separately, but you are again faced with GPU partitioning at the very least.
There are plenty of ways to use the GPU to do all aspects of rendering, but it is not even remotely as trivial as you are making it out to be.
Drawing lines is a stupid way of doing it, as:
1. With the mess of overlapping lines they've got, you'd suffer from severe overdraw (which is where raytracing really shines in terms of efficiency) as you can't efficiently cull lines (without clipping them)
2. You wouldn't get the ambient occlusion look where lines close to each other occlude / darken.
As someone who's previously compared Embree and OptiX (and we were given free hardware and support from Nvidia), Embree stacks up really well, and a dual Xeon can match a single top-of-the-line GPU fairly easily for pure ray-intersection performance.
Once you start putting complex shaders and layered materials on top, GPUs start to really suffer: there's a reason a lot of the GPU renders are mostly being used for clean renders like archvis / product design / car renders - they're simple to render. As soon as you stick dirt layers on top, their efficiency really starts to plummet.
> stupid way of doing it
I guess it really depends on what the objective is. I'm not talking speculatively, but concretely it seems like a reasonable way to achieve a few images that they show in the press release. They show two relatively flatly rendered lots-of-tubes images. I know SSAO isn't the same, and I get that there's overdraw, but there are a lot of details in the particular objective they want. In one shot, they show a lot of emissive tubes with depth of field, which is harder to achieve. I suppose if they're happy, they're happy.
> interactive performance for all datasets on a regular Intel Xeon processor, which can render images at 20-25 frames per second (FPS)
There's a big difference between interactive performance and a production-quality render. Something tells me it's not producing 25 frames of noise-free render per second. There isn't enough information here.
> and a dual Xeon can match a single top-of-the-line GPU
At what, like 3x-5x the price? At how many watts? And at what I.T. complexity? A GTX 1080, at better performance than a Titan X, is really a phenomenally good deal. Especially considering I can drop it into an existing workstation with all of my existing software installed on it; especially considering I can rent out computation time on Amazon by the hour.
I guess what I'm reacting to is how forced of an example it seems.
I think the objective is rendering a ridiculous amount of stuff - the fact they talk about "lots and lots of RAM" indicates there's no way a GPU is going to be able to render it efficiently without an aggressive culling step: GDDR5 might be very fast, but you've got to get the data onto the GPU first and probably page data as well. This is very often a significant bottleneck for GPUs, and is another reason GPUs aren't used for VFX rendering (at high-end), as 16 GB isn't anywhere near enough.
Production-quality render implies decent lighting and materials - this stuff has neither, so shading is likely to be negligible, and then you're going to be generally constrained by ray / primitive intersection performance.
No, cheaper (for CPU) : two ~$950 CPU cores vs $3,300 GPU. Granted you need a dual-socket system and twice the RAM to balance and it's easier to stick multiple GPUs in a system than make the jump to 4 sockets, but GPUs aren't really that much of a win... Thermal output and power usage is often worse for GPUs as well.
One of the interesting things to keep in mind is that these new Xeon Phi cards can be used as standalone CPUs, not just as PCIe cards like a GPU. This is the "self-hosted mode" the article talks about. So one can now think about comparing a lone Xeon Phi doing both jobs versus a CPU plus an NVidia GPU.
This article is too fluffy, sounds like it had help from Intel's PR depr.
I certainly hope Phi has more advantages than the write once run anywhere / portability angle they kept pushing.
Has anyone chosen Phi for a real project that was in no way funded or subsidized by Intel?
I'm excited for Xenon Phi even with the expense of it, Intel needs to realize that even though they dominate in x86 they need to price competitively.