A Tiny Chip That Could Disrupt Exascale Computing (2015)
nextplatform.comFounder of REX here, and surprised to see this posted here. Happy to answer any questions, and you can check my comment history for some of my prior posts on REX.
We've had some really great progress that we hope to share in the near future, so stay tuned.
EDIT: Since this article is over a year old, we have made a lot of progress, and have recently taped out our first chip. We haven't officially posted a job opening, but we are very shortly going to be looking for software engineers that would love to work on our architecture. Feel free to shoot me an email if you're interested!
I'm genuinely curious and really like that you're giving this a shot, but am very sceptical, as hardly any idea in computer architecture is new: if you dig enough you'll find it has been tried, but failed. You'll have to understand why it failed. If it was timing, maybe you can succeed today, but if it wasn't timing, you'll have to understand why it failed and don't repeat the same mistakes. It would be great to see more comparisons.
Firstly, your claims about virtual memory in general purpose CPUs is misleading: its purpose is memory virtualization and I wouldn't want a system without it in the presence of multiple processes (how can you trust every process not to shoot down another by accidentally accessing a wrong memory location?).
Ultimately, our hardware will become more specialized/heterogeneous, and we'll have many accelerators for various tasks, but there will likely always be a general purpose CPU at the heart of the system (that will have virtual memory, caches, etc.); for an overview, I enjoyed [1]. I see what you're building as another accelerator for inherently parallel latency-insensitive workloads (like you find in HPC). In a way, GPUs (+ Xeon Phi) cater to these markets today (benchmarks against these would be useful).
Second, I remember the previous post [2], where you claimed the system you are building relies on a RISC ISA, but now you claim it has changed to VLIW. You said yourself before "[...] stick to RISC, instead of some crazy VLIW or very long pipeline scheme. In doing this, we limit compiler complexity while still having very simple/efficient core design, and thus hopefully keeping every core's pipeline full and without hazards [...]"
What is the rationale behind this? Do you think you'll be able to manage compiler complexity now?
Any response is much appreciated!
I would be just as skeptical as you and everyone else should be of our claims. While I have talked informally about our architecture to many people on and offline, we have not posted much when it comes to the actual architecture that we have proceeded with to silicon (which is very close, but not exactly what we will be eventually bringing to market). I don't honestly believe any random person to take us seriously based on a couple of online postings, but I will say that most are decently convinced (or at least intrigued enough to withhold immediate doubt) after a rundown of the architecture.
As for why we think "this time is different" is a mix of a combination of good ideas and timing. I 100% agree with you that in the 50 years of von Neumann derivatives, basically all the low hanging fruit (and many higher up) have been attempted, and thankfully I can saw I've learned from a lot of them. Rather than be an entirely new concept, I think we have gone back to some fairly old ideas in going back to the time before hardware managed caches, and thought about simplicity when it comes to what it takes to actually accomplish computational goals. A lot of the hardware complexity that was starting to be added back in the mid/late 80's around the memory system (our big focus at REX) was before much attention was put into the compiler. While I am proud of what we have done on the hardware side, I think most of the credit will go to the compiler and software tools if we are successful, as that is what enables us to have such a powerful and efficient architecture. Ergo, we have the advantage of ~30 years of compiler advancements (plus a good amount of our own) where we have the luxury to remake the decision for software complexity over hardware complexity... plus 30 years of fabrication improvement. Couple that with Intel's declining revenues, end of easy CMOS scaling, and established portability tools (e.g. LLVM, which we have used as the basis for our toolchain) and I think this is the best time possible for us.
When it comes to virtual memory: Why would you need to have your memory space virtualized (which requires address translation) in order to have segmentation? We use physical addresses since it saves a lot of time and energy at the hardware level, but that doesn't mean you can't have software implement the same features and benefits that virtual memory, garbage collection, etc provide. The way our memory system as a whole (and in particular our Network On Chip) behaves and its system of constraints plays a very large role in this, but I can't/don't want to go into the details of that publicly right now. It may seem a bit hand wavy right now, but we do not see this as a limitation/real concern for us, and unless you want to write everything in assembly, the toolchain will make this no different than C/C++ code running on todays machines.
In the case of GPGPUs for HPC, we have the advantage of being truly MIMD over a SIMD architecture, plus a big improvement in power efficiency, programmability, and cost. We'd win in the same areas (I guess tie on programmability) for the Xeon Phi for benchmarks like LINPACK and STREAM, but the one benchmark I am especially looking forward to is HPCG (and anything else that tries to stress the memory system along with compute). While NVIDIA and Intel systems on the TOP500 list struggle to get 2% of their LINPACK score on HPCG[0], we should be performing 25x+ better[0]. Based on our simulations, we should be performing roughly equally across all 3 BLAS levels, which has been unheard of in HPC since the days of the original (Seymour designed) Cray machines.
Of course, my naivety from 2 years ago haunts me now ;) When the linked comment was written, I had yet to "see the light". Once I understood (through my co founder, the brilliant Paul Sebexen) the 'magic' that is possible when a toolchain has enough information to make good compilation decisions, did I realize that the simplicity of a VLIW decoding system made the most sense (and gave us a lot of extra abilities). It was about ~3 months after I made that comment that we started to go down this path, which early prototyping that applied to existing VLIW and scratchpad based systems led to our DARPA and later seed funding. It is only because of the fact that our hardware is so simple (and mathematically elegant in its organization) that the compiler can efficiently schedule instructions and memory movement. While I've only lived through a small fraction of the last 50 years of computer architecture, I think of myself as a very avid historian of it, and it really shocks me that no one has gone about thinking of the memory system quite like we have. I totally agree with my younger self on long pipelines though.
TL;DR: We think we'll succeed because we are combining old hardware ideas with new software ideas to make (in our opinion) the best architecture, plus this is the best time for a new fabless semiconductor startup. We have actually built the mythical "sufficiently smart compiler" due to some very clever (but simple) hardware that enables people to actually effectively program for this. We think we will be more energy efficient, performant, and easier to program for than our competition in our target areas (HPC, high end DSP).
[0] http://www.hpcg-benchmark.org/downloads/sc15/hpcg_sc15_updat...
I wish you and your project all the best. Hardware, and especially CPUs and alike are tough and rare. We haven't seen much new competitors (any) in that area, especially relevant ones.
When you say you rest your high hopes on toolchain, aren't you a bit scared of what happened to Itanium? Intel had toolchain under their r&d and it failed because they couldn't deliver. I'm interested to hear more about "mythical 'sufficiently smart compiler' and how it relates to your architecture.
Based on our software results so far, I wouldn't say I'm scared, but am definitely anxious. Since our main focus up to this point has been building the first test chip along with software tool prototyping, our progress in compiling "real" libraries and small applications is fairly early, but we're happy with the results. Now that we've taped out, we can devote more resources, and once we have real hardware, we will be able to test our applications ~1000x faster than the cycle accurate software simulation capabilities we have right now.
All that being said, we have good reason to believe that our approach is valid and won't suffer the flaws of "Itanic" that I've mentioned on this page and many times elsewhere. Unlike any prior VLIW (Intel called their bastardized version implemented in Itanium "EPIC"), our hardware was built with an emphasis on hard real time guarantees and strict determinism at every level of the design, which allows for a level of optimization that is impossible on any other architecture.
Basically, if the compiler has to make worse case assumptions almost all the time to prevent control and data hazards (as did Itanium due to a very convoluted design), how do you expect to have any compiler generated programs to be at all performant/efficient?
Does this means that users have to recompile the world for every cpu generation because of microarchitectural changes? I.e. is the pipeline exposed? Are you planning a Mill-like intermediate level bytecode?
Yes, and in certain cases of the same generation of chip (e.g. same microarchitecture but fewer number of cores and/or less memory per core; no problem if you compiled for a small number of cores/less memory and it is run on a "bigger" chip) as the compiler would need to remap the program and data location based on the global address map.
It is a very simple pipeline, and we expose the exact latencies required for all operations, along with things like branches with delay slots. As I have mentioned ad infinitum, determinism is a key part of our architecture, and having a fixed pipeline is necessary. Plus, we want anyone crazy and skilled enough who wants to hand write assembly the freedom to be crazy ;)
For the applications (HPC and DSP-like stuff) we are targeting, source code is always available, there are very long periods between when you have to recompile due to source code change, and optimization is a key factor. Our customers aren't only accepting with recompiling for every new generation of hardware, they expect it and want to be able to take advantage of any new improvements that the compiler would be able to make.
Did you stick with the parallel, SERDES-less interfaces for your interchip I/O? 48 GB/s implies a pretty high signalling rate to not have a CTLE, DFE, etc.
Why 3 interchip links? What network topology are you planning to use to scale to large numbers of chips? If you're still using parallel I/O, how are you planning to communicate beyond a single PCB?
What memory interface are you using? The article seems to confuse your interchip links with your memory controller.
We have partnered with a startup (we'll announce who soon enough) who shared a lot of ideas about chip to chip I/O with myself. While they call it a SerDes, it is infact a source synchronous (clock forwarded) link that is 5 bits over 6 wires. It is silicon proven, and is capable of up to 125Gb/s over 12mm while being a little over 10x more energy efficient (in terms of pJ/bit) than other available VSR SerDes. Obviously it is short reach over PCB, but we imagine (yet to be tested) we can extend that reach a bit more using a more exotic PCB laminate (Megtron, Rogers, etc), or going over wire (tested to go over 6 inches using a HuberSuhner SMA cable). Right now, we are only using it to go between chips in a Multi Chip Module, or under 12mm on a PCB. Big bonus is as of a month ago, it is a JEDEC standard!
Most of the information in the linked article is very outdated (~16 months old), so we have decided to ditch the idea of having a separate DRAM and "External I/O" and just have our chip-to-chip on all four sides of the chip. The chip-to-chip interface uses the same protocol as our Network On Chip, and expands in the same 2D mesh. We are also looking into (with a sketched out plan) on how to directly interface this I/O with HBM dies that can be in the same MCM package. As far as supporting other memories/IOs, we are leaning towards having "adapter chips" that would convert our chip-to-chip interface to DDR4, Ethernet, Infiniband, etc.
As far as bandwidth numbers, our aggregate bandwidth for this test chip we have just taped out (16 cores + 2 chip-to-chip I/O macros on TSMC 28nm, 12mm^2 in size) is 60GB/s though for the planned production chip, we will be over 256GB/s. I have a good feeling we will be a fair margin higher than that, but I would rather under promise and over deliver.
25 gbps for a very short reach interconnect sounds possible, although having to go through an adapter chip is going to kill your latency from a system perspective. If you haven't already, you should check out the DE Shaw Research Anton 2 chip. It is an older process, but it has 66 4-way processor cores running at 1.65 Ghz and a roughly comparable network (although 6-way rather than 4-way), in addition to all of the md-specific hardware. It uses a similar memory hierarchy (although it does use non-coherent caches). Getting good performance out of software managed caches is very difficult in practice, even if you know your problem extremely well. With very carefully written software (and a sufficiently friendly problem) good performance is possible, but it definitely isn't easy.
Would it be possible to interface with HyperTransport or QPI? Can you name the JEDEC standard?
I highly doubt that a direct interface would be possible with either of them, though if you really wanted it, you could make an adapter (though fat chance Intel would open up QPI enough to allow for it). We haven't officially announced the partnership, though I can point you at JESD247.
have you published any white papers detailing any of the following: architecture, instruction set, software availability, benchmarking / application porting and performance etc.
I read a couple of times that you got funding from various govt agencies. Most of these funding agencies publish rfp responses or slide decks unless you insisted on an NDA and was approved. I couldnt find any documents talking in depth about your work.
I am in the HPC space (academic, research) and am genuinely interested in learning more about your work.
We'll be releasing a whitepaper by September covering the architectural basics, which will coincide with a public release of a SDK. We did have a paper[0] at last years Memsys conference that goes over some of the basic ideas of our compiler, though it is pretty vague (due to our reluctance to share prior to having patent protection at that time).
Hey Thomas, I actually applied for a hardware position a couple of days ago, but I'm interested in software as well (and a mix of the two)!
Think you're doing some very exciting work with REX and would love to be a part of the team :).
Is it possible to get a Developer Kit for it?
It would be great if there would be some raspberry pi like distribution with a Chip included. I think this could speed up the adoption.
We'll send an announcement on the mailing list when tools (software based and FPGA based simulation, along with actual silicon) will be available. We will only be getting 200 chips back from this initial test run, so we have to be fairly stringent in who will be getting hardware eval units in the coming months, but if you have a compelling application idea, feel free to send me an email (in my HN profile) and let me know.
This sounds good. One of the big problems with Mill CPUs is there is that they don't have working silicon yet. I would say getting as much people as possible to play with it, is crucial for a new architecture to get traction.
Even better than that would be a open architecture like RISC-V. Though, open architecture has its own drawbacks.
Also, as a side note, what do you think about the possibility of using Genetic Algorithms and Machine Learning to generate more efficient types of interconnect architectures.
I am looking forward to hearing how this goes.
I'd be happy to give you an update. Shoot me an email if you want to catch up.
Also: Looking forward to see more in the self driving car racing ;)
From what I understood, a lot of the software stack would require rewriting. As it is, it doesn't look like it would be friendly to a Linux environment running natively on it, but could be more amenable to a coprocessor-like environment where the host would load programs and the Neo would run them.
Did I get it right?
In the near term, yes, though that is primarily a business reason for us. Supporting Linux is technically possible (old projects such as uCLinux were built around running on MMU-less systems like ours; Mainline Linux 4.2 started to have limited support for a couple of MMU-less systems), though our target areas (HPC and DSP-like tasks) don't necessarily need anything more than a microkernel/RTOS. A full OS like Linux kind of gets in the way if you get the basic stuff like memory allocation, garbage collection, and job scheduling handled separately (by our software tools). Since we are a small startup and focusing on a small area, we want to take off a part of the problem we can easily chew, rather than trying to immediately jump at Linux, so we chose our target applications/market accordingly.
I am perfectly fine with the idea to have a supercomputer running a specialized OS and a front-end machine running the sysadmin-friendly OS. It feels like a Connection Machine with fewer blinking lights.
From the article: “Caches and virtual memory as they are currently implemented are some of the worst design decisions that have ever been made,” Sohmers boldly told a room of HPC-focused attendees at the Open Compute Summit this week.
As a lay processor designer, I couldn't agree more. I don't like VLIW, but this architecture makes a lot of sense. I think it took up to this point for compiler technology to catch up with what is possible in hardware.
Almost all the good ideas in computing were mined out long ago, the trick I think is to get the computing world to give up on those which are holding things back (cold dead hands if necessary).
This is a 2015 story that I remember reading, then. Google news search shows only a couple articles this year about Rex Computing and only one tiny bit of news, that they're at tapeout. That's probably par for the course for a startup creating product (or prototype) one. http://semiengineering.com/power-centric-chip-architectures/
also a speaking engagement: http://insidehpc.com/2016/01/call-for-papers-supercomputing-...
and a comment elsewhere that mentions another approach: the "Mill CPU of Mill Computing"
As I recollect (perhaps quite wrongly) Itanium (VLIW) failed because compiler-writers couldn't really be bothered or couldn't mount the learning curve. So I'm most curious about what progress is being made on the compiler side.
You are correct that we have already taped out, though we haven't made any announcements yet, though will be talking publicly about it in the future with a big focus on the "magic" on the software side.
You can read my comments on the Mill architecture elsewhere on HN (not a fan of stack machines), but my biggest disappointment in them is the fact that they have been working on Mill for ~10 years with a team ranging of 5 to 20 (from what I have heard) and have yet to get to silicon, while we have gone from a complete custom architectural idea to tapeout in ~11 months from closing our first seed funding.
The big technical failure point for Itanium (in my opinion) is the fact that Intel took the relatively pure VLIW research by Josh Fisher @ HP Labs and tried to add a ridiculous number of features (and attempted x86 compatibility) that impacted the ability to statically schedule instructions. The resulting bastard architecture Intel called "EPIC" (rather than VLIW) had a very difficult job in getting the compiler to generate instruction parallel code since Intel added a huge amount of indeterminism into the architecture that goes against the original VLIW tenets. If your compiler has to assume the worst case latency for all instructions and memory operations, you are going to have a bad time.
> while we have gone from a complete custom architectural idea to tapeout in ~11 months from closing our first seed funding.
To my understanding, the Mill project is not financed. They're enthusiasts working for sweat equity, and are likely going to seek (non-controlling?) investment to finally hit silicon when they're ready.
For the scope of what they're doing, I think it's a defensible enough approach. It's not something that can be created in evolutionary stages; all designs of all parts need to be working together properly for there to be benefit from any part, and it's quite complex while also trying out tons of novel designs.
(and the Mill isn't stack-based or stack-related. It's basically a crossbar of recent ALU/Load results being fed into further ALU/Store inputs in parallel. The belt is just some way to represent the set of recent results.)
Itanium failed for the same reason every other VLIW failed as a general purpose CPU: there just isn't enough information a compile time to model the dynamic properties of a program. In fact many of Itanium additions (strange instruction packing, alias disambiguation hardware) were attempts at overcoming this issue.
The only moderately successful general purpose VLIW are Conroe and the related Denver, and they use a runtime translation layer to collect the required dynamic informations.
The vast majority of the dynamic parts of program that matter for scheduling (both when it comes to ILP/avoiding hazards within a core and when it comes to handling memory management for our scratchpad based memory system) are due to indeterminate latencies for memory accesses and executing instructions (due to variable length pipelines). Throw in horrible (for determinism) things like out of order execution and and branch prediction and no wonder a compiler can't determine things statically! While we are not really targeting general purpose (though I would say we have the capability to evolve to it in the future) it seems painfully obvious to me where these issues have been in any general-leaning VLIW attempts in the past, and I can't understand the clinging nature to bad architectural decisions in the past by hardware folks 30 years ago that could not imagine the ability of software in the future. </rant>
Targeting general purpose from the get go is a bad idea, but it NOT impossible to do efficiently and without sacrificing performance. You just need a well defined and constrained architecture, and a clean way to describe it.
You have your causality relations reversed: the reason that branch prediction and dynamic caches exist is that because jump targets and working sets are hard to impossible to compute statically.
Even in the restricted world of HPC, GPGPUs have been moving from statically scheduled exposed pipeline VLIW machies to more conventional SIMD with caches, virtual memory and branch prediction (no meaningful OoO yet as the large amount of thread parallelism can hide the memory latency).
Also GPGPU have the benefit of having the large, lucrative GPU gaming market to pay for their development. How can a pure HPC machine be competitive in this market? Even for Intel Xeon Phi is more of a prestige project than actually meant to make money.
I've spent a long time debating with VLIW haters (that I presume you are with), but I'd love to see any citations you have for your claim that my causality is reversed, as I have a ton of evidence (to be fair not published yet) going for my side. While not as generally applicable as our architecture, you can take a look at basically any DSP from the past 15 years and see that VLIW works great from a performance and efficiency standpoint when your data is in a constrained form. We're showing that a compiler can structure a lot of different types of data (and the code required to actually operate on it) effectively if there are enough constraints on the hardware. Fairly pointless to try to convince you without documentation on hand for all parties, but hope you'll take a look in a couple of months.
As far as market, we are going after a decent sized market where the customers care the most about efficiency and performance, and are not only willing but very eager to switch their current solutions for whatever is best. As the typical startup claims, we are able to do it for a fraction of the cost and in a fraction of the time as one of the big guys, and have a solution that is 10x better than is out there. NVIDIA boasts that they spent $1 Billion developing the Pascal architecture, with them selling the Tesla series GPUs for it at $5,000+ a unit. We've shown we can prototype something that can theoretically beat it for under $2 million, and our hope/bet is that we can take it to market (and actually beat it by an order of magnitude) for less than $25 million. That's just HPC, which doesn't include the very interesting high end DSP area that is now using very expensive and power hungry FPGAs for wireless baseband solutions which we think are a very good fit for us.
Just to clarify: are you trying to compete with Nvidia, or with Intel? If you're going against GPUs, is your chip something that can run neural networks (better than Nvidia)?
Short answer: If we were to implement SIMD FP16 support similarly to how we have a planned dual FP32 in our FP64 FPU, we would be able to easily match GPU performance by throwing more cores at the problem, while still being more efficient. While neural nets/machine learning is interesting, and we could potentially enable it in new forms as we can provide a desktop GPU's capability in a much smaller/lower power form factor, it is not our main focus. As the other commenter noted, there are ASICs that do a good job at that, though since we are more generally programmable than those sort of ASICs, we would be able to handle changes in algorithms over time while some may not be able to.
The more interesting problems for us are things that GPUs can't do well, such as level 1 (vector) and level 2 (matrix-vector) BLAS operations. While most GPUs (and CPUs when utilizing SIMD instructions) only get a couple of percent the performance on level 1 and level 2 BLAS compared to level 3 (matrix-matrix), we are equally performant across all three (and at a very high percentage of theoretical peak).
Interesting. Which applications require vector-vector or matrix-vector operations as opposed to matrix-matrix?
Also, custom ASICs are the current state of the art for NN.
edit: missing word
Which custom ASICs are you talking about?
I'm referring to Google's TPU.
I couldn't find any details about that chip. How do you know it's state of the art?
I only know what is publicly know. It was discussed a while ago on HN. I think that google claims the best performance per Watt and manages to do that with specialized 8 bit floating point ALUs. I don't think it has been publicly available yet so the claims lack third party verification.
VLIW have been used very successfully as DSPs for a long time, I do not think anybody is debating that. It is outside that niche that they have repeatedly been found lacking.
I'm sure your architecture would work fine for a subset of HPC problems like those that are currently run on a traditional GPGPU, but even in the HPC world many problems are ill suited for a GPU (think particle transport).
Yeah, something like this is very much needed, but it's not the hard part. The software is the hard part. The software is the reason we have the multiple levels of cache we have now. Without solving the software challenges, there can be no challenger for the existing architectures.
It's interesting to note that convolutional neural nets (CNNs) are one solution to the software challenge. It's an imperfect solution, in the sense that CNNs are not as general purpose (at the same efficiency) and have strict data requirements for training, but it is a solution, and the big N are investing heavily to the point of designing ASICs.
Eventually, though, we need to solve the software problem. That will require rethinking programming languages.
Having written programs for this iteration of the REX Neo architecture, the architecture is not so dramatically different that programming languages will have to be rewritten. I'm not the smartest programmer in the world and I was able to figure out the assembly language fairly easily.
Some concepts, like how to manage concurrent data processing and thread communications, need to be handled carefully, but that's more at the level of 'standard library' than the compiler. There is a clear pathway to getting C working on the architecture, and a reasonable direction (that will need some fleshing out) to getting performance-enhancing optimization of something like LLVM IR.
I wouldn't expect the assembly language level to be too far off from the common paradigms. Where I'd expect the software challenges to be would be in managing large amounts of memory, if the application programmer must manage shuffling data between the local scratchpad, specific locations in foreign scratchpads that must be (manually?) DMA'd around, and DRAMs.
Our whole goal, as talked about in the software section of our website (and the ACM paper linked in it), is to have the scratchpads be entirely automated by our toolchain. While we want to allow for especially adventurous programmers to have full freedom with the scratchpads, existing and future programs written in C/C++/other languages supported in the future will handle memory allocation identically (from the programmers perspective) as existing architectures.
One other thing to point out is that our actually addressing of a cores local scratchpad, as well as "foreign" scratchpads of other cores on the same chip and/or any other attached chip is handled exactly the same. All memory operations are handled through the exact same load/store instructions as part of a global flat address map that is the same for all cores in a system (one or multiple chips interconnected).
True. I shoulda added "so to speak", since this is a still more extreme approach and might simply break any compiler/language combination we have, as you say.
While we have been exploring some ideas on how to have better programming approaches to address the unique features of our architecture, we have from the beginning though that we would be required to have some level of portability for existing applications. As of right now, we support standard C/C++ that runs through our Clang+LLVM backend, with the ability to support any language that has a LLVM frontend.
Personally, I find the actor model to be the easiest existing way to take advantage of things like our network on chip and having hard time guarantees on memory movement. That being said, right now our focus is on C and C++ along with our API and custom library ports.
I recall an interview with someone formerly in upper management for the Itanium development project where he acknowledged that the most significant factor in the demise of Itanium was the exclusionary pricing structure Intel imposed on them.
> there is no virtual memory translation happening, which in theory, will significantly cut latency (and hence boost performance and efficiency). This means that there is one cycle to address the SRAM, so “this saves half the power right off the bat just by getting rid of address translation from virtual memory.”
In protected mode (i.e., what the kernel is using), will an Intel processor not also disable virtual memory lookup? Couldn't we just recompile scientific software to a protected mode environment to get those same benefits?
Also, I think it is more useful and fair to compare against a GPU than a general purpose CPU.
(As an aside, I don't see where the reduced latency gives such a big advantage. There will be latency anyway, so in any case your software has to deal with waiting in an efficient way (doing useful stuff in the mean time). Shaving off some latency will only help if your software design was bad to begin with.)
It would just be great to get in a decent chip that does not have built in, and unblockable, back-door hacking, like those on Intel, AMD and probably ARM.
I see a good opportunity for government to make this a reality. Not per se a fan of gov regulation for many things but I don't see this moving forward very fast. There are initiatives left and right (e.g. Talos) but if a significantly large government body (EU?) would make it a requirement, that might change the game. Lobbyists would probably convince them otherwise (you need closed HW to catch terrorists... etc).
I'm curious about the thermal issues.
From the article, the power density is (4 W)/ (0.1mm^2), or 40W/mm^2. Intel's Haswell chip has a TDP of ~ 65W, an area of 14.7mm^2, for a power density of 4.4W/mm^2.
Is this power density a cooling challenge?
First note: The article is ~16 months old, so is outdated on some measures. I've corrected the numbers below, but in either case, you seem to have been confused between the size of a core and the size (and power) of an entire chip consisting of multiple cores.
After tapeout of our first test chip, the final size for one of our cores is 0.27mm^2 (including the SRAM that makes up the scratchpad memory) on TSMC's 28nm process. We actually came in using less gates than originally anticipated, and our size without SRAM is a little less than 0.01mm^2.
Now, for just going by what is on the linked article: The diagram comparing sizes are for single cores (0.1mm^2 estimate back then for a Neo core, 14.5mm^2 for a single Intel Haswell core). The power numbers in the table below that are for entire chips. You are quoting 65W for a single core, which is incorrect... The 65W Haswell chip I believe you may be referring to is the 4770S, which is 4 cores @ 65 watts, and looks like it has a die size of 177mm^2.
Calculating this out using our current numbers, our planned full 256 core chip has changed a bit (doubled the performance since last year, doubled the power due to adding more stuff) and we estimate the TDP to now be 8 Watts and ~100mm^2, which gives us a power density of 0.08W/mm^2. Intel would then have 65W / 177mm^2 = 0.367W/mm^2.
As would make sense in the case where we are claiming lower power operation, our power density is also lower.
Thanks very much for clarifying. That the 4W didn't apply to a single core fell through a cognitive crack.
The power density is impressively low, indeed. Looking forward to more info in Sept.
This chip was discussed on RealWorldTech a while ago: http://www.realworldtech.com/forum/?threadid=151566
Let's say it wasn't well received.
There is nothing to disrupt. Exascale computing is a haux perpetrated on the US government by unscrupulous hardware vendors. Kudos to Rex for grabbing a piece of that action.