The Future of Hardware Is Software

60 points by blueplastic 4 years ago · 34 comments

Reader

"The Future of Hardware Is Software" ... says everyone that wants to sell you AI software on commoditized hardware.

Yet here I am, I recently won an AI competition by reviving an old 2005 algorithm and just using the fact that compute power has 7000x-ed since then (from 5 GFLOPS on a P4 to 35 TFLOPS on a 3090). No AI was needed.

And now I'm building custom electronics for a new type of 3D camera because even after years of AI and deep learning research, structured light and/or stereoscopic 3D depth estimation is still unusable. Try training a NERF from "only" 3 UHD images and you know what I mean.

metadat 4 years ago

You're such a tease!
Seriously, please share the deets on this 2005 "junker" algo.
It was super unsatisfying to read your post with so much interesting information glossed over.
- metadat 4 years ago
  
  FYI fxtentacle, people keep upvoting this comment- there seems to be enormous interest, it's already at +17.
  Will be sad if we never hear from you.
- fxtentacle 4 years ago
  
  Sorry for the late reply. I got frustrated trying to figure out how to improve our Bomberland AI and decided to spend the rest of the day building lamps and shadow caster shapes in Lego ^_^
  By now, I'm down to 5th place on the Sintel Clean rankings: http://sintel.is.tue.mpg.de/quant?metric_id=6&selected_pass=... but my entry H-v3 was 1st place when I submitted it. The algorithm is
  Mota C., Stuke I., Aach T., Barth E. Divide-and-Conquer Strategies for Estimating Multiple Transparent Motions. In: Jähne B., Mester R., Barth E., Scharr H. (eds) Complex Motion. IWCM 2004.
  https://doi.org/10.1007/978-3-540-69866-1_6
  (so I misremembered the year. it was end of 2004 instead of 2005)
  I did tweak it in a few details such as using a 5x5px ica instead of the constant brightness assumption but mainly I replaced the gauss seidel iteration (12) with brute forcing (10) so in effect I'm approximating the c* with Monte Carlo sampling on the GPU. Then as the last step, I use LUTs to fill in gaps in the prediction with their maximum likelihood prior as memorized from a large collection of real-world flow maps.
  BTW as luck would have it, we are currently leading Bomberland (team CloudGamepad) with a deep learning AI trained for more than 200 million simulation steps. Yet JFB (the 2nd ranked team) uses handcrafted C++ rules and they beat us every time. It's just that against other opponents our probabilistic AI is random enough to confuse them, which is why we're still barely on the 1st place. But unless we can significantly improve things soon, I expect us to lose the tournament later this month because we will not be able to beat JFB in a fair duel. I bet on deep learning here and I'm already regretting it.
  I'll reply about the camera to TaylorAlexander
  - metadat 4 years ago
    
    Thanks for following up! It got up to 23 points of interested people, which in my experience is actually a huge number of upvotes for HN.
TaylorAlexander 4 years ago

> I'm building custom electronics for a new type of 3D camera
I would love to know more. I am working on an open source farming robot and vision is an important component. Are you able to share more?
- fxtentacle 4 years ago
  
  We're using the camera for an autonomous toy car racer, so I need reliable and real-time depth estimates. Existing cameras such as the Stereolabs ZED max out at 1080p @ 30 fps and they use rolling shutter which isn't even perfectly hardware-synchronized. Plus those sensors are tiny and, hence, as noisy as a laptop webcam.
  The result is that the Stereolabs AI needs to be extremely lenient when doing the stereo matching because objects will almost never look exactly the same in both images, be it due to the noise or the rolling shutter skew. If I see a pattern repeat itself on both images with 5% RGB intensity, then on the Stereolabs ZED I need to ignore that, because it's most likely just sensor noise. If the image was almost noise-free, then I could treat this pattern as a reliable correspondence and triangulate depth from it.
  Also, tracking fast movements at 30 fps is really difficult, due to the large movement offsets. If you scan for them, you need lots of compute power and you risk recognizing repetitive patterns as fast movement.
  If you increase the hardware from 1080p to 4K, from 30 FPS to 120 FPS, from "really noisy" to "practically noise-free", and from "rolling shutter" to "hardware-synchronized global shutter", then suddenly you have 4x the data to make a decision on, all your offsets are 4x smaller due to higher FPS, and you can treat much weaker patterns as reliable.
  And all that together means that surfaces like reflective wooden floor are now doable. Whereas before, most of the visible patterns would drown in sensor noise.
  EDIT: And maybe one more comment: Our camera uses USB3 10gbit/s with a high-speed FPGA and it was completely designed in the excellent open-source KiCad. I even forked it to make things look nicer and more like Altium: https://forum.kicad.info/t/kicad-schematics-font-is-a-deal-b...
  - TaylorAlexander 4 years ago
    
    Very cool! Thanks for the details. I love Kicad. I presume your designs are not open source? But I am curious which cameras/sensors you are using, and any other chip specifics you can share.
    
    fxtentacle 4 years ago
    
    Sadly, i can't share much there. NDAed Sony sensor with NDAed Infinion FPGA. Manufacturing was interesting because I didn't get export permissions for sending the FPGAs to China for pcb assembly. Eurocircuits got the job done but was kinda slow to work with.
    
    TaylorAlexander 4 years ago
    
    Sure I understand. Well best of luck with the project!
imperialdrive 4 years ago

I appreciate this train of thought and enjoy seeing it in action by others. Kudos!

aidenn0 4 years ago

> A classic example is Itanium: it is only a historical footnote today, but Itanium’s explicit parallelism and focus on scalability once made it look like the future of CPUs. The problem was never the hardware itself—it was difficult compilation and backward compatibility with the x86 software ecosystem that doomed Itanium.

The problem with the Itanium was the hardware itself. Finding sufficient ILP on general purpose loads for a VLIW like Itanium is an unsolved problem in compiler design. Saying the problem with Itanium was software would be like entering a drag-racer in formula 1 and saying the problem was that the drivers weren't good enough at steering.

xscott 4 years ago
There are a lot of important numerical algorithms which would have really benefited if Itanium had gone through iteration and growth. A mainstream VLIW could've had it's place, and it's trivial to find parallelism in FFTs, SVDs, matrix multiplies, and so on.
To me, there is a spectrum of parallelism on the desktop:
```
    multi-server,
    multi-process,
    multi-threaded (shared mem),
    <Itanium would go here>,
    SIMD instructions
```
Yeah, Itanium might have required assembly to exercise that niche, and maybe new programming languages would've come about. There has to be some middle ground between Verilog/VHDL and C, right? Maybe a CUDA-like language could've done the trick (it certainly works for GPUs).
I think it's a shame Itanium failed, and I think it failed for the wrong reasons. At the time, I remember everyone criticizing it for not running legacy x86 applications very well. As though word processor, spreadsheet, and presentation software wasn't fast enough. Saying legacy apps in existing languages don't make it easy to find the ILP seems like a slight generalization of that.
The AMD64 ISA (which is what really killed Itanium) was a blessing and a curse. It made x86 just better enough to not be awful, but it killed desktop/server alternatives for at least 25 years. Maybe ARM will make inroads, but it isn't that much better either.
- aidenn0 4 years ago
  
  > There are a lot of important numerical algorithms which would have really benefited if Itanium had gone through iteration and growth. A mainstream VLIW could've had it's place, and it's trivial to find parallelism in FFTs, SVDs, matrix multiplies, and so on.
  DSPs (which have great perf/watt for the numerical algorithms you mention) have used VLIW for decades, so of course there is a place for it. GPUs have moved in for all of those operations at this point though. The bet with Itanium was that compilers could be made sufficiently smart to make VLIW work for non-numeric workloads, and that bet failed to pay off. Intel and HP had hundreds of smart people trying to solve the "software problem" of Itanium and they did not succeed.
  > I think it's a shame Itanium failed, and I think it failed for the wrong reasons. At the time, I remember everyone criticizing it for not running legacy x86 applications very well. As though word processor, spreadsheet, and presentation software wasn't fast enough. Saying legacy apps in existing languages don't make it easy to find the ILP seems like a slight generalization of that.
  Desktop applications is a red-herring given that Itanium was targeted primarily at the workstation and server market. There was also a bad-timing issue as it was at about the same time that PC hardware was displacing dedicated workstations and server hardware.
  - xscott 4 years ago
    
    Yeah, there's a whole other universe of specialized chips for special purposes, and I used PCI-style DSP cards when I could. I just think a standard VLIW on the desktop/server would've been useful for the stuff I'm interested in.
    GPUs can definitely carry that load, but I avoided them in my career because I could rarely guarantee that my customer's computers would have a sufficient GPU. In the world where I worked, x86 and AMD64 became standard - I could always count on that. It had to be a pretty special project for my customers to let me dictate a dedicated rack of specific hardware was required.
    > Intel and HP had hundreds of smart people trying to solve the "software problem" of Itanium and they did not succeed.
    Yeah, but that's tied up in the market too. A big name customer screaming, "But I don't want to retrain my programmers, it has to work with Java/C++" would certainly sway them from a Verilog or Cuda style language. Hell even OpenCL and Cuda have to look like C++. Double hell, the FPGA folks have been trying to make a C++-like language for decades so that they can increase their market. That doesn't mean another possibility couldn't exist for Itanium.
    It's very clear that Itanium is dead. Maybe I'm just saying the market was foolish, and you're saying Intel/HP couldn't satisfy the market.
  - wahern 4 years ago
    
    > Intel and HP had hundreds of smart people trying to solve the "software problem" of Itanium and they did not succeed.
    I've also heard a contrary story that Intel and HP simply assumed the compilers would show up, or at least failed to put in sufficient effort to advance the industry. I'm curious if you have any sources. I've always wondered what the true story was, though neither need be mutually exclusive.
    It would seem foolhardy for Intel and HP not to heavily invest in compiler research given the stakes. OTOH, the norm seems to be for hardware vendors to suck at deliberately building and evolving software ecosystems around their hardware, especially as commodity hardware and open source software became ubiquitous. And "sufficient effort" is definitely a matter of opinion.
    By way of example, early examples of polyhedral compilation go back to the 1990s, but it wasn't until the 2010s that implementations shipped in GCC and clang, long after Itanium failed. I doubt it would have saved Itanium, but I would have expected to see such contributions earlier and coming directly from Intel and HP. But maybe my expectations are too high.
    
    aidenn0 4 years ago
    
    The Intel C compiler was already well established as a top IA-32 compiler by the late 90s (prior to any IA-64 release). This article[1] from 1999 assumes Intel is responsible for the compiler. My recollection is that the primary focus on 3rd party software was getting systems software ported.
    I don't think Intel was banking on 3rd parties making compilers. A lot of 32-bit architectures not named "68000" from the 80s/early 90s suffered from poor first-party compilers and a lack of good 3rd party compiler support; in 1980 an optimizing compiler was not considered an important part of a microprocesor's ecosystem, but by the time IA-64 came around the importance was fairly well understood by hardware vendors. Given the quality of the first-party IA-32 compilers, I think Intel (and everyone else) expected that the first-party IA-64 compilers would be good.
    Certainly by the release of Merced (and likely well before), compiler engineers internal to Intel were aware of how hard it was to codegen for IA-64. Certainly during the time period that Intel was pushing IA-64, they had an insatiable desire for compiler developers with advanced degrees.
    1: https://www.cnet.com/news/intels-merced-chip-may-slip-furthe...
  - kaba0 4 years ago
    
    I am way out of my depth here, but wouldn’t a machine code to machine code JIT compiler solve the problem of underutilization of itanium? (I remember reading a paper on x86->x86 jit compiler as well that could provide some speed up)
    If so complex branch prediction and pipelining can be done in hardware alone, much more clever (and patchable!) optimizations can be done in software, or I would think so. So while mainstream languages may not be able to reuse Itanium’s architecture efficiently at compile time, a separate program could reorder instructions to make use of some instruction level parallelism, couldn’t it?
- deepnotderp 4 years ago
  
  Also, it’s worth noting that OoO brings more than just ILP/scheduling, it also brings MLP and dynamism. Take for instance latency hiding a cache miss or a mispredicted branch. Stuff like this is impossible to know in advance, no matter how much you redesign your language to expose ILP.
- jdsully 4 years ago
  
  >A mainstream VLIW could've had it's place, and it's trivial to find parallelism in FFTs, SVDs, matrix multiplies, and so on.
  There are already DSPs for this purpose, but typical server workloads don't generally use those algorithms. Perhaps Itanium would have made a good DSP but it wasn't really aimed at that market.
  - xscott 4 years ago
    
    > There are already DSPs for this purpose
    I should've been more clear: Most open source projects, or my projects for the customers I used to have, can't/couldn't rely on a DSP chip or card being installed. If Itanium had gone mainstream, I could've counted on it's VLIW instructions.
    We can /almost/ count on a GPU nowadays, but programming in Cuda ties you to NVidia, and OpenCL doesn't seem to have taken off the same way.
    > Perhaps Itanium would have made a good DSP but it wasn't really aimed at that market
    I suspect there are a lot of FFTs, SVDs, and large matrix multiplies in software now. Deep learning, convolutional nets, image and audio algorithms, TikTok "filters", and so on. Of course there was almost none of that on desktops in the late 90s.
    
    aidenn0 4 years ago
    
    > I should've been more clear: Most open source projects, or my projects for the customers I used to have, can't/couldn't rely on a DSP chip or card being installed. If Itanium had gone mainstream, I could've counted on it's VLIW instructions.
    So to sum up: you can't convince customers to buy special hardware and neither could HP/Intel?
    
    xscott 4 years ago
    
    > So to sum up: <snarky shit reply>
    I wish it was possible to have a discussion that wasn't about who could get the best zinger in to burn the other person. This isn't Reddit, and you aren't in high school.
    
    aidenn0 4 years ago
    
    I agree my response was snarky, I disagree it was shit. I wasn't looking for a "sick burn" sort of reaction.
    Here's a slightly longer and more boring version of what I posted:
    Itanium came out at the tail-end of a long movement from special-purpose to commodity hardware; servers and workstations were moving from 68k/MIPS/Sparc to PC-based hardware. It was a DSP that ran general-purpose loads "okay" when most people were looking for a general-purpose CPU that ran DSP type loads "okay" (i.e. the various SIMD extensions to x86 and POWER).
    Anything that starts with "If Itanium had gone mainstream" is a counterfactual. Maybe it would have delayed GPGPU as the performance advantage of programmable shaders over running on CPU would have been smaller and maybe without AMD's competition, it would have allowed Intel to keep bus-speeds lower for longer.
    My original point that Itanium was a failure to deliver the hardware people wanted rather than the failure of software to appear on said hardware stands.
- deepnotderp 4 years ago
  
  Itanium got maximum penetration in HPC, so people were aware of this. The challenge is that GPUs and DSPs (many are VLIW) are even better at parallel.

msandford 4 years ago

And the future of software is hardware. Which way is the pendulum swinging now? From which perspective? Why?

If you've been around for a swing or two this is nothing new. If not, it's earth shattering.

Anyone remember thick clients, then thin clients, and now thick clients again? Anyone want to guess when mobile-first starts becoming web-first?

blueplasticOP 4 years ago

Agreed. I don't think the point of the article to is to say that this is a never-before-seen type of event, but rather that the landscape is shifting again in the hardware space, as it does maybe once every couple of decades... and that software (compilers specifically) are going to be needed to enable and accelerate the shift of many workloads to ML Accelerators.

karmakaze 4 years ago

I don't buy this inevitable future where the best software runs on the best hardware because that's obviously optimal. How many times has history played out that way vs large entities entrenched in their moat of hw, sw, or a mix?

I mean what's the closest that we have to a good example, maybe ARM hardware running Linux? What about mobile, we have Android Open Source Project which is a bit early to see what it will amount to. I still hope and wait, but wouldn't bet on it.

tw04 4 years ago

Ahh, yes, back to 2010 when everyone told companies like Hitachi they were doing storage wrong by relying on custom ASICs.

Meanwhile Google and Facebook and Amazon are making hardware offload engines because they've figured out there's a limit to the performance of general purpose CPUs and it's a lot of wasted power.

You can't have it both ways, efficiency and speed or flexibility, choose one.

AtlasBarfed 4 years ago

I was going to write this in the rant article about React and how everything is rewrapped bloat.
Yeah I think it will be the opposite in the medium term future.
Moore's law can't last forever, the slowdown has already occurred, and then you'll need two things for a couple generations to get better:
1) code optimization / stack reduction / api efficiency / less abstraction
2) moving software to hardware to get that sweet speedup and efficiency

mikesabbagh 4 years ago

You remember the modular phone where you can replace parts and not replace everything when you want to update your phone? This hardware failure tells you that you need to use the latest hardware to run the latest software.

dusted 4 years ago

Nah, the future of software is hardware.

synergy20 4 years ago

It already is, every hw/chip designer is pretty much a software engineer coding in verilog(c-like), all testbench/verification is also pure software, designer normally uses powerful CAD software daily, the hardware part for many designers are minimal.

the true hardware is those who design the boards (PCB), which, before COVID, was mostly outsourced to China, I'm unsure if this "hardware" will ever move back.

Settings

The Future of Hardware Is Software

Keyboard Shortcuts