Supercharging TensorFlow.js with SIMD and multi-threading

blog.tensorflow.org

63 points by Marat_Dukhan 5 years ago · 19 comments

Reader

wffurr 5 years ago

Unfortunately, this feature is (still) stuck behind an origin trial and requires serving three different WebAssembly binaries to get correct fallback behavior across different browsers.

Feature detection for WebAssembly[0] is stuck in spec discussions, and SIMD general availability is blocked on either that or its own mechanism for backwards compatibility[1].

The issue is that a WebAssembly binary that contains instructions unknown to the engine (e.g. SIMD instructions not supported by a particular engine) won't validate, even if the functions aren't used at runtime. The only way to work around this is to compile your binary NxMx... times and detect which feature set is supported before loading a binary. It's a real pain in the tail when trying to support new WebAssembly features.

e.g. check out this snippet from canvas.apps.chrome which supports WebAssembly threads on Chrome with a non-thread fallback for e.g. mobile / Firefox:

        var X;
        try {
            X = (new WebAssembly.Memory({
                initial: 1,
                maximum: 1,
                shared: !0
            })).buffer instanceof SharedArrayBuffer ? !0 : !1
        } catch (a) {
            X = !1
        }
        var ua = r(X ? ["js/threads/ink.js", "defines_threads.js"] : ["js/nothreads/ink.js", "defines.js"])
          , va = ua.next().value
          , wa = ua.next().value;

[0]: https://github.com/WebAssembly/conditional-sections [1]: https://github.com/WebAssembly/simd/issues/356

etaioinshrdlu 5 years ago

If I read this right, this is much faster than the WebGL backend on the devices tested.

If the CPU is really faster than the GPU, that really demonstrates how inefficient the WebGL backend really is, compared to something like CUDA.

tsbinz 5 years ago

Note that these are light models that are designed to be run quickly on a cpu with batch size 1. It's not that uncommon to see multithreaded cpu code beat the gpu in that setting also for other backends.
SimplyUnknown 5 years ago

One of the advantages of using the CPU rather than GPU for inference (especially with batch size 1) is that it doesn't need data transfer from host to device, which is a notoriously slow, asynchronous process. This could also explain the difference in total run time, if measured correctly.
- wffurr 5 years ago
  
  Especially since WebGL doesn't have mapped buffers[0]. There's no way to do asynchronous texture (aka data) uploads. At best, you can read back asynchronously but even that's not guaranteed by the spec[1]. Async data transfer gives much higher throughput for sending data and retrieving results.
  This is especially painful on mobile where GPU and CPU memory are the same physical RAM, and the "map buffer" operation corresponds to an actual instruction to the memory controller rather than synchronizing memory across PCIe lanes.
  [0]: https://www.khronos.org/registry/webgl/specs/latest/2.0/#5.1... [1]: https://www.khronos.org/registry/webgl/specs/latest/2.0/#3.7... - Note the "non-normative" block describing the potential to bypass the specified blocking behavior for getBufferSubData.
Marat_DukhanOP 5 years ago

Even WebGL2 doesn't expose compute shaders, so any NN computations work by abusing the graphics pipeline, with many inefficiencies involved. Shader dispatch is more expensive, no access to local memory, no control over dispatch blocks. Hopefully the upcoming WebGPU specification will close these efficiency gaps.

drej 5 years ago

As for traditional TensorFlow, the easiest way we found to improve performance (easily 2x) was to find/create builds tailored to our machines. Using Python, we had prebuilt wheels, which have (understandably) low feature requirements. If you find/build your own (e.g. if you have AVX-512), you can easily get pretty detect performance gains.

(Yes, there are unofficial wheels for various CPUs, but, not sure if that passes your security requirements.)

tpetry 5 years ago

Looks alot like https://github.com/microsoft/onnxjs but onnx.js adds multithreading by web workers which will tske a long time to be available on wasm

dzhiurgis 5 years ago

28ms on 2018 iPhone without threads or SIMD, 24ms on Chrome MBP 2019 with threads and no SIMD, 11ms with SIMD.

skohan 5 years ago

What's the use-case for tensorflow on web/mobile web? I thought tensorflow was mostly for training models, and my assumption would be that this is mostly relevant for the server/workstation context, but maybe I'm missing something
- alquemist 5 years ago
  
  Some people care to deliver ML-enhanced services without peeking over the shoulder of their users for every keypress. Client inference performance matters.
- netheril96 5 years ago
  
  > I thought tensorflow was mostly for training models
  You need tensorflow to actually use the models trained with tensorflow.
  - the_svd_doctor 5 years ago
    
    Why? You could export your weights and everything into any other framework, no?
    
    hansvm 5 years ago
    
    Probably, but looking at that chain of comments I think the emphasis was that you need _something_ like tensorflow to do inference client-side using pre-trained models, not that you need _tensorflow_ specifically to operate on tensorflow models.
    
    postalrat 5 years ago
    
    Wouldn't you need the topology as well?
    
    gridlockd 5 years ago
    
    Tensorflow-JS is that framework that you export to from Tensorflow. Tensorflow-JS is not Tensorflow, it's just the same brand.
- dzhiurgis 5 years ago
  
  I was curious of iPhone performance - looks like latest one will outperform my MacBook again...

ajtulloch 5 years ago

Awesome work Marat.

The_rationalist 5 years ago

Couldn't tensorflow leverage webgl / webgpu? Also it's really sad that there no webCL adoption yet

Settings

Supercharging TensorFlow.js with SIMD and multi-threading

Keyboard Shortcuts