DLVM: A modern compiler framework for neural network DSLs

dlvm.org

75 points by protomok 8 years ago · 19 comments

Reader

wcrichton 8 years ago

Current tally of high-performance, deep-learning-oriented DSLs/IRs/compilers, in no particular order:

- TensorComprehensions (Facebook): https://github.com/facebookresearch/TensorComprehensions

- XLA (Google): https://www.tensorflow.org/performance/xla/

- taco (MIT): http://tensor-compiler.org/

- DLVM (UIUC): http://dlvm.org/

- nGraph (Intel): http://ngraph.nervanasys.com/docs/cpp/

- TVM (DMLC): https://github.com/dmlc/tvm

Honorable mention to Julia (http://julialang.org) as well.

hedgehog 8 years ago

As far as I know Tile/PlaidML (Vertex.AI) is the only DSL+compiler that's usable for real workloads across a variety of hardware. https://github.com/plaidml/plaidml
- grandmczeb 8 years ago
  
  Tensorflow + XLA seems pretty usable. Also, it's generally good practice to note that you're a cofounder of Vertex.AI in discussions like this.
  - hedgehog 8 years ago
    
    Yes, I'm cofounder and I pretty much live and breathe the company. I see how my comment reads as soulless shilling so I'll lay out my perspective and you can make of it what you will. This is all my personal opinion and not necessarily related to our product or company.
    At a basic level I think making new powerful technology accessible to more people is on average strongly positive. There are various efforts making good progress to address different parts of deep learning accessibility such as Keras (developer-friendly Python API), OpenAI (open basic research & safe AI), fast.ai (practical training for developers), etc. I'm a fan of all of that work. PlaidML is the company's contribution to making adoption easier.
    For the purposes of proliferation and democratization making deep learning work on the most readily available hardware helps people get started with less friction. PlaidML is a step in that direction. It's fully open source and you can right now 'pip install' it on Mac/Win/Linux with Intel/AMD/NVIDIA GPU and have a Keras net running in a couple minutes. There are certainly warts and some missing features but as far as I know it's the only one an ordinary practitioner can use right now.
    From a "what problem does this solve" standpoint PlaidML is most similar to Tensor Comprehensions and TVM. Each makes different tradeoffs but might eventually be able to share components like code generation for OpenCL, LLVM, etc. Layers like XLA, nGraph, ONNX, NNVM, etc, you can mostly think of as being stacked on top (they are ways to talk to lower layer runtimes like PlaidML). For example it would be reasonable for a future version of PlaidML to support TensorFlow integration via XLA or deployment of ONNX models on OpenCL-capable GPUs.
    Anyway, I personally care most about what people can use. There's a cute demo that will run the pre-trained Keras examples against images from your webcam on your local GPU. It's quick to try and can serve as the basis for prototyping a real application: https://github.com/plaidml/plaidvision

deepnotderp 8 years ago

Why are all the neural network DSLs JIT obsessed?

grandmczeb 8 years ago

Lots of modern models have very late binding variables which are hard to precompile for (sentence length in MNT, for example). That means you're going to need to do some form of specialization at runtime, so a JIT makes sense.
- deepnotderp 8 years ago
  
  Just treat it as an infinite loop , there's no need to JIT in an optimized version that late.
  - grandmczeb 8 years ago
    
    One of the core operations of the transformer network[1] is a (LxL) x (LxE) matrix multiply (where L is the sentence length and E is the network width). Can you be more specific about how you would get good performance without specializing on L?
    [1] https://arxiv.org/abs/1706.03762
    
    deepnotderp 8 years ago
    
    You use the loop based GEMM kernel and inject the loop counters as the input size.
    
    grandmczeb 8 years ago
    
    L can be as small as 1 and bigger than 512. For small L it makes sense to do different optimizations than large L. A loop based GEMM doesn’t help with that.
joe_the_user 8 years ago

Well, the success of neural nets over the past few years has come through harnessing massive processing power.
The problem is a lot of the programming can be low level and ad-hoc. I think the idea of the various DSLs is to allow the model to be compactly specified while having the programs go as fast as possible. A JIT may be one way to accomplish this.
- deepnotderp 8 years ago
  
  Relying upon a JIT often means the ability to create things which preclude the use of static compilers, which means that accelerated hardware, like ours, cannot be used efficiently.
  - yorwba 8 years ago
    
    The kind of optimizations a static compiler might apply can be done by a JIT as well, with the added benefit of actually knowing what kind of workload is going to run. Most of deep learning is applying comparatively small computation graphs to very large arrays of numbers in parallel, so the overhead of compilation is only a small portion of the overall computation time. A smart JIT that decides on the optimal tiling pattern for the array dimensions observed at runtime and rewrites loops accordingly can easily pay for itself.
    
    Firadeoclus 8 years ago
    
    As a counterpoint (and not necessarily the one the GP is referring to), if you were compiling for, say, an FPGA the overhead of compilation would be very significant.
    
    deepnotderp 8 years ago
    
    Our processor is analogous to a CGRA, so compilation to it would indeed be hindered by a JIT based compiler.
    
    gugagore 8 years ago
    
    That's "course-grained reconfigurable architecture", for anyone else who didn't know.

stealthcat 8 years ago

What does it mean by modern?

Settings

DLVM: A modern compiler framework for neural network DSLs

Keyboard Shortcuts