Google AI Cloud TPUs
google.ai180 teraflops per TPU seems great.
For reference latest Titan X offers 12 TFLOPs [1] and upcoming AMD card for Deep Learning [2] offers 13 . Though its not clear if TPU performance is calculated at fp16 or fp32[2]. The best GPUs currentLY available on AWS offer mere 2 TFLOPs per GPU [3].
[1] https://blogs.nvidia.com/blog/2017/04/06/titan-xp/
[2] http://pro.radeon.com/en-us/vega-frontier-edition/
[3] http://images.nvidia.com/content/pdf/tesla/NVIDIA-Kepler-GK1...
https://www.nvidia.com/en-us/data-center/tesla-v100/
Tesla V100 is the thing to be compared with, as it's the first chip optimized for training with the Tensor Core operation (4x4 matrix multiplication and accumulation with mixed fp16/fp32 precision: the inputs to be multiplied are fp16, the accumulation is fp32). V100 performance 100 TFlops this way.
TESLA V100 is apparently ridiculously expensive at 65,0000$.
In fact to an extent even NVidia has realized that there is more money in creating a GPU cloud from scratch rather than selling GPUs.
I think the net losers are Apple, Amazon/AWS (I believe NVidia is responsible for their lackluster GPU offerings.) & Intel (Who are still hoping for Multi-core to work. And are on track to be disappointed just like they lost mobile market to ARM hoping for Atom to be eventually adopted.)
Hmm, not quite _that_ expensive... you can buy the "DGX Station" with four NVIDIA Tesla V100 GPUs for $69,000. http://wccftech.com/nvidia-volta-tesla-v100-dgx-1-hgx-1-supe...
A single chip is only 45 TFLOPS. It is 180 TFLOPs per module (4x TPU chips): https://arstechnica.com/information-technology/2017/05/googl...
> To solve this problem, we’ve has designed an all-new ML accelerator from scratch
I feel like that should be "we have designed" or "we've designed". It seems like someone was in the middle of rewriting it and only got halfway there.