Deep Neural Networks from Scratch in Zig

235 points by nyberg 3 years ago · 33 comments

Reader

brrrrrm 3 years ago

I've been working on mix of ML performance and abstraction for a while and this is a breath of fresh air. Zig is truly a fascinating language, and the semantics revealed in this post are surprisingly powerful.

"By hard coding the forward and backward methods at comptime we have some more comfort with the general correctness and expected errors we would receive if we passed in incorrectly shaped data at runtime."

This solves the issues of shape checking incredibly cleanly. Python libraries have been struggling with this for a decade. It seems like you could also extending comptime usage to also calculate allocations ahead of time.

Honestly, this whole thing makes me want to invest quite a bit of time into using Zig. Great post!

andrepd 3 years ago

What does this offer that basic dependent typing doesn't?
- brrrrrm 3 years ago
  
  nothing really, that's effectively what's achieved by comptime.
  it's funny to see it written as "basic" dependent typing, considering it's an extremely complex type system to implement.

version_five 3 years ago

This is interesting, thanks. I know this is just a demo, but I'm convinced we're going to see a swing back to simple, purpose written NNs for many simple applications, when the alternative is bringing in a couple GB of python and cuda libraries, which is serious overkill for something on the scale of MNIST (which many real problems are).

That said, I'm curious about how well the compiler can optimize matrix operations, in Zig or other, say C or Rust, and when it's worth linking in BLAS or mkl some other library. I wonder if there is a sweet spot where it's worth doing.

brabel 3 years ago

I didn't investigate this more deeply , but I think if you used an arena allocator on each iteration of the loop, and simply freed the whole allocator after each iteration, you might get much better performance.

EDIT: the MNIST link requires authentication: http://yann.lecun.com/exdb/mnist/ How can we get access to the test run at the end?

camel-cdr 3 years ago

Why would any DNN ever need to deallocate other than at program termination? (And allocate other than at startup)
- brabel 3 years ago
  
  That's the point. With an arena allocator, you don't "really" de-allocate the OS memory, you just free the bytes to be re-used again in the next iteration (by using a FixedBufferAllocator as the backing allocator, if you know how much memory will be needed in advance). This is much better than trying to re-use the structs manually as the OP mentioned at the end.
  EDIT: FixedBufferAllocator has a reset method , so maybe you just need to use that to "free" all memory on each iteration? Perhaps using that directly is enough if you know the max memory usage at compile-time... I was just thinking that it should be possible to use an arena allocator to basically tell the delegate allocator to make all memory available again without freeing it back to the OS?! If anyone knows more about allocators I would be happy to know this. https://ziglang.org/documentation/master/std/#A;std:heap.Fix...
  - dundarious 3 years ago
    
    FixedBufferAllocator is just a "bump allocator", i.e., just a size and a pointer that moves forward. You allocate the memory it uses from somewhere else (stack, a big alloc from a general purpose allocator (think malloc), some pages from the OS via mmap or VirtualAlloc, etc. -- zig has simple cross platform functions for all of these).
    Individual "allocations" just bump the pointer forward, and it succeeds unless the result is past the end of the fixed size given to it initially. You don't free any individual allocations (except maybe the most recent one by just moving the pointer back again), you just reset the pointer to the very start when you're done with all the allocations.
kuroguro 3 years ago

> the MNIST link requires authentication
doesn't ask for auth for me?
- brabel 3 years ago
  
  It's working now! I guess someone opened access after I checked.
version_five 3 years ago

I just tried and they downloaded for me...

waynecochran 3 years ago

Lacking a Linear Algebra / Tensor library is what kills performance for DNN. For example, this kind of element by element manipulation must be avoided:

            while (i < I * O): (i += 1) {
                self.weights[i] -= 0.01 * grads[i];
            }

What are the options for vector / matrix / tensor operations in zig?

hansvm 3 years ago

1. The compiler will vectorize simple operations like that pretty well.
2. Zig has built-in @Vector types that are fixed-size data types designed to compile down to things like SIMD as efficiently as possible given that you might be asking it to do 16x operations on a CPU only supporting 8x width SIMD. You'd often write your high-level code as a runtime-known iteration count over those comptime-known vector widths.
2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).
3. You can always link to LAPACK and friends. Zig has a great c interface.
4. Matrix/tensor ops aren't built-in. That doesn't matter for a lot of what this demo shows since it'll be RAM/cache bandwidth bound, but you'd definitely need to link in or hand-code inner product routines to have better asymptotics and cache friendliness if you were doing too many large matrix multiplies.
5. Wrapping any of the above into a library would be pretty easy. That sort of code is easy to write, so I haven't looked to see what other people have made in that space, but I'm sure there's something.
- hansvm 3 years ago
  
  And their async/await is being redesigned IIRC, but at least the last version would make for a low overhead (in both developer time and runtime) way to parallelize any of the above. In the old version it was totally trivial to write a high performance parallel variant (knight's move and whatnot) sudoku solver, and I doubt the new version will be any worse. Not that the interior of a linear algebra operation is often the best place to apply parallelization, but if you had a good reason to do so it wouldn't be hard.
  I'm not aware of anything in particular that would make multi-machine computations even slightly less painful than other languages, but maybe someone can chime in here with ideas.
  - waynecochran 3 years ago
    
    C++ / Eigen performs a lot of optimizations that you might expect from a Fortran compiler. For example, it ability to broadcast and perform reductions is pretty slick, e.g.:
    X.noalias() = (A - B).colwise().squaredNorm().mean();
    Short of having tensor cores, this does a good job of static optimization for underlying vectorization support.
- wyldfire 3 years ago
  
  > 2a. Inline assembly or inspecting the architecture before choosing the @Vector width are both options, so you can write your high-level code with that information in mine if necessary (e.g., to make bolt vector quantization work well in Zig I'm pretty sure you need to inline-assembly one of the swizzling operations).
  Inline assembly is great but support for intrinsics would be really valuable for Zig IMO.
  - hansvm 3 years ago
    
    You can link in intrinsics too. Again, C compatibility is high.
    
    wyldfire 3 years ago
    
    I could write a `pub fn main()` that consisted only of a call to a C library that implemented world_peace(). But it wouldn't change the fact that "support for intrinsics would be really valuable for Zig IMO."
brundolf 3 years ago

The goal of the post is to de-mystify, not to make something production-ready. Doing everything in plain code without any magic libraries helps people understand what's happening
krastanov 3 years ago

To echo the sibling, while this should be avoided in Python, languages like Zig, C++, Julia, Rust can expect the compiler to SIMD-ify these expressions.
- dundarious 3 years ago
  
  Somewhat, but I think people vastly over-estimate their ability.
  A common example is if there's any accumulation/reduction, compilers will almost entirely fail to generate SIMD unless you use -funsafe-math-optimizations type flags, because of non-associativity of floating point. Sum of squares is the classic example (not saying that specific operation is used in NN).
  Explicit vectorization (e.g., using intrinsics) is almost always a relatively simple way to get orders of magnitude speedup compared to auto-vectorization, because of the above. Also because data layouts usually need to change as well (AoS vs SoA, etc.), though NN people seem to write decent data layouts.
  I don't have any experience with `#pragma omp` type approaches which may be a middle ground.
Freire_Herval 3 years ago

You need to move these sections into the GPU. Neither zig, rust, C++ nor C can do that alone. You'll need to use cuda or futhark. That's the real speedup for these types of things.
Optimization in this case largely concerns memory management in the GPU and keeping data transfer between cpu to gpu at a minimum.
Essentially a massively parallel API combined with a massively parallel processor. I'm thinking of doing an end to end tutorial about this.

remorses 3 years ago

It would be cool too see a port of llama.cpp and ggml to Zig

https://github.com/ggerganov/llama.cpp

https://github.com/ggerganov/ggml

luizfelberti 3 years ago

I wonder how much of the heavy lifting on GGML can be done with `zig translate-c ggml.c`
GGML seems to be just one gigantic 10K LOC C file anyways...

afking 3 years ago

Awesome, thanks for the write up! Profiling would be a great addition to see.

BLanen 3 years ago

I want DNN from scratch in Scratch.

kookamamie 3 years ago

> I would guess the constant memory allocation and frees in the training loop are the bottleneck

No, the bottleneck would be not utilizing the idling GPU.

remorses 3 years ago

Using the CPU with quantized weights on GPT models makes sense, an example is llama.cpp, that’s because these models are constrained by memory bandwidth and not compute (low arithmetic density)
https://github.com/ggerganov/llama.cpp
- kookamamie 3 years ago
  
  LLAMA is a LLM, very little to do with a model trying to learn MNIST. CNNs in particular gain from using a GPU (or ten), as they're optimizing weights for convolutional/spatial kernels.
mdp2021 3 years ago

Pls see the post from version_five nearby: https://news.ycombinator.com/item?id=35699260
> simple, purpose written NNs for many simple applications [... as opposed to] python and cuda libraries
mjdowney 3 years ago

You can go 100x faster using SIMD on the CPU, instead of doing the linear algebra by hand, then another order of magnitude or two on the GPU.

cleanchit 3 years ago

Machine learning feels a lot like cpu fabrication in that a homegrown solution is almost certainly going to be inferior to just taking whats on the market and configuring it to fit your needs. If you aren't going to specialize in this field, is there a point in learning how to roll your own ML anymore?

Settings

Deep Neural Networks from Scratch in Zig

Keyboard Shortcuts