Differentiable Programming – A Simple Introduction

159 points by dylanbfox 4 years ago · 49 comments

Reader

The most interesting thing I've seen on AD is "The simple essence of automatic differentiation" (2018) [1]. See past discussion [2], and talk [3]. I think the main idea is that by compiling to categories and pairing up a function with its derivative, the pair becomes trivially composable in forward mode, and the whole structure is easily converted to reverse mode afterwards.

[1]: https://dl.acm.org/doi/10.1145/3236765

[2]: https://news.ycombinator.com/item?id=18306860

[3]: Talk at Microsoft Research: https://www.youtube.com/watch?v=ne99laPUxN4 Other presentations listed here: https://github.com/conal/essence-of-ad

tome 4 years ago

> the whole structure is easily converted to reverse mode afterwards.
Unfortunately it's not. Elliot never actually demonstrates in the paper how to implement such an algorithm, and it's very hard to write compiler transformations in "categorical form".
(Disclosure: I'm the other of another paper on AD.)
- orbifold 4 years ago
  
  I think JAX effectively demonstrates that this is indeed possible. The approach they use is to first linearise the JAXPR and then transpose it, pretty much in the same fashion as the Elliot paper did.
  - tome 4 years ago
    
    A JAXPR is a normal functional-style expression tree, with free variables, like
    let b = f a in g (a, b)
    Elliot's "compiling to categories" requires you to translate this to
    g ∘ (id × f) ∘ dup
    It's pretty baffling to work with terms like the latter in practice! The main ideas that JAX is based on were already published in "Lambda the ultimate backpropagator" in 2008. Elliot's work is a nice way of conceptualising the AD transformations and understanding how they all relate to each other, but it's not particularly practical.
- amkkma 4 years ago
  
  which paper?

yauneyz 4 years ago

My professor has talked about this. He thinks that the real gem of the deep learning revolution is the ability to take the derivative of arbitrary code and use that to optimize. Deep learning is just one application of that, but there are tons more.

SleekEagle 4 years ago

That's part of why Julia is so exciting! Building it specifically to be a differentiable programming language opens so many doors ...
- mountainriver 4 years ago
  
  Julia wasn’t really built specifically to be differentiable, it was just built in a way that you have access to the IR, which is what zygote does. Enzyme AD is the most exciting to me because any LLVM language can be differentiable
  - SleekEagle 4 years ago
    
    Ah I see, thank you for clarifying. And thank you for bringing Enzyme to my attention - I've never seen it before!
    
    celrod 4 years ago
    
    Enzyme.jl works quite well (but the possibility of using it across languages is appealing).
melony 4 years ago

I am just happy that the previously siloed fields of operations research and various control theory sub-disciplines are now incentivized to pool their research together thanks to the funding in ML. Also many expensive and proprietary optimization software in industry are finally getting some competition.
- SleekEagle 4 years ago
  
  Hm I didn't know different areas of control theory were siloed. Learning about control theory in graduate school was awesome and it seems like a field that would benefit from ML a lot. I know they use RL agents for control networks for e.g. cartpole, but I would've thought it would be more widespread! Do you think the development of Differentiable Programming (i.e. the observation of more generality beyond pure ML/DL) was really the missing piece?
  Also, just curious, what are your studies in?
  - melony 4 years ago
    
    Control theory has a very, very long parallel history alongside ML. ML, specifically probabilistic and reinforcement learning, uses a lot of dynamic programming ideas and Bellman equations in its theoretical modeling. Lookup the term cybernetics, it is an old term in the pre-internet era to mean control theory and optimization. The Soviets even had a grand scheme to build networked factories that could be centrally optimized and resource allocated. Their Slavic communist AWS-meets-Walmart efforts spawned a Nobel laureate; Kantorovich was given the award for inventing linear programming.
    Unfortunately the CS field is only just rediscovering control theory while it has been a staple of EE for years. However, there haven't been many new innovations in the field until recently when ML became the new hottest thing.
    
    erwincoumans 4 years ago
    
    Interesting, I didn't know about cybernetics.
    For the ones interested there is a book that discusses both: 'Reinforcement Learning and Optimal Control', by Dimitri P. Bertsekas. It covers exact and approximate Dynamic Programming, finite and infinite horizon problems, deterministic and stochastic models, model-based and model-free optimization.
    Aside from this book, Ben Recht has some interesting blog about Optimal Control and Reinforcement learning: http://www.argmin.net/2018/06/25/outsider-rl
    
    SleekEagle 4 years ago
    
    This is some insanely cool history! I had no idea the Soviets had such a technical vision, that's actually pretty amazing. I've heard the term "cybernetics" but honestly just thought it was some movie-tech term, lol.
    It seems really weird that control theory is in EE departments considering it's sooo much more mathematical than most EE subdisciplines except signals processing. I remember a math professor of mine telling us about optimization techniques that control systems practitioners would know more about than applied mathematicians because they were developed specifically for the field, can't remember what the techniques were though ...
    
    melony 4 years ago
    
    There is this excellent HN-recommended fiction called Red Plenty that dramatised the efforts on the other side of the Atlantic.
    https://news.ycombinator.com/item?id=8417882
    > It seems really weird that control theory is in EE departments considering it's sooo much more mathematical than most EE subdisciplines except signals processing.
    I agree, apparently Bellman's reasoning for calling dynamic programming what it is was because he needed grant funding during the Cold War days and was advised to give his mathematical theories a more "interesting" name.
    https://en.m.wikipedia.org/wiki/Dynamic_programming#History
    The generalised form of the Bellman Equation (co-formulated by Kalman of the Kalman filters fame) to control theory and EE is in some ways what the Maximum Likelihood function is to ML.
    https://en.m.wikipedia.org/wiki/Hamilton%E2%80%93Jacobi%E2%8...
    
    SleekEagle 4 years ago
    
    Looks really cool, added to my amazon cart. Thanks for the rec!
    That hilarious and sadly insightful. I remember thinking "what the hell is so 'dynamic' about this?" the first time I learned about dynamic programming. Although "memoitative programming" sounds pretty fancy too, lol
    
    dreamcompiler 4 years ago
    
    Laplace transforms are one such trick. Given a linear differential equation describing your system, Laplace transforms let you solve it using basic algebra. Unfortunately this doesn't work on nonlinear systems.
    
    melony 4 years ago
    
    The reduction of convolution to FFT is beautiful ;)
potbelly83 4 years ago

How do you differentiate a string? Enum?
- 6gvONxR4sf7o 4 years ago
  
  The answer to that is a huge part of the NLP field. The current answer is that you break down the string into constituent parts and map each of them into a high dimensional space. “cat” becomes a large vector whose position is continuous and therefore differentiable. “the cat” probably becomes a pair of vectors.
  - nerdponx 4 years ago
    
    It's weirder than that. You typically are differentiating a loss function of strings and various opaque weights. You are optimizing the loss function over the weight space, so in some informal contrived sense you are actually differentiating with respect to the string.
- geysersam 4 years ago
  
  Not all functions are differentiable.
  Sometimes there are other better ways to describe "how does changing x affect y". Derivatives are powerful but they are not the only possible description of such relationships.
  I'm very excited for what other things future "compilers" will be able to do to programs besides differentiation. That's just the beginning.
- titanomachy 4 years ago
  
  If you were dealing with e.g. English words rather than arbitrary strings, one approach would be to treat each word as a point in n-dimensional space. Then you can use continuous (and differentiable) functions to output into that space.
- bmc7505 4 years ago
  
  https://en.wikipedia.org/wiki/Brzozowski_derivative
- adgjlsfhk1 4 years ago
  
  generally you consider them to be piecewise constant.
  - tome 4 years ago
    
    Or more precisely, discrete.

choeger 4 years ago

Nice article, but the intro is a little lengthy.

I have one remark, though: If your language allows for automatic differentiation already, why do you bother with a neural network in the first place?

I think you should have a good reason why you choose a neural network for your approximation of the inverse function and why it has exactly that amount of layers. For instance, why shouldn't a simple polynomial suffice? Could it be that your neural network ends up as an approximation of the Taylor expansion of your inverse function?

ChrisRackauckas 4 years ago

It's a good question, which I answer in much more depth here: https://www.reddit.com/r/MachineLearning/comments/ryw53x/d_f... . Specifically that answer is about NNs vs fourier series, but it's the same point: polynomials, Fourier series, and NNs are all universal function approximators, so why use a NN? If you write out what happens though, you can easily see the curse of dimensionality in action and see why a 100-dimensional function shouldn't be approximated by tensor products of polynomials. See the rest there.
So it's not really about NNs vs polynomials/fourier/Chebyshev, etc., but rather what is the right thing to use at a given time. In the universal differential equations paper (https://arxiv.org/abs/2001.04385), we demonstrate that some examples of automatic equation discovery are faster using Fourier series than NNs (specifically there the discovery of a semilinear partial differential equation). That doesn't say that NNs are a bad way to do all of this, it just means that they are one trainable object that is good for high dimensional approximations, but libraries should allow you to easily move between classical basis functions and NNs to best achieve the highest performance.
- choeger 4 years ago
  
  Great answer! Very much appreciated!
SleekEagle 4 years ago

I think for more complicated examples like RL control systems a neural network is the natural choice. If you can incorporate physics into your world model then you'd need differentiable programming + NNs, right? Or am I misunderstanding the question.
If you're talking about the specific cannon problem, you don't need to do any learning at all you can just solve the kinematics, so in some sense you could ask why you're using any approximation function,
simulate-me 4 years ago

Requiring an exact justification for a specific NN architecture is not productive. Simply put, such justification almost never exists, yet NN clearly out-perform simple polynomials in a wide variety of tasks. Even if you know that a NN performs better than a simple polynomial on a task, you're very unlikely to get an exact explanation on why a specific NN was chosen vs. the infinite number of other possible NNs. If you require such an explanation, then you're going to miss out on a lot of performance.
potatoman22 4 years ago

Most of the time we hear about neural networks because the linear models didn't work.
adgjlsfhk1 4 years ago

in low dimensional spaces, there are a lot of good techniques, but once you go above 10 or so, NN is generally the only option. Pretty much everything else has exponential degradation with more dimensions.

PartiallyTyped 4 years ago

The nice thing about differentiable programming is that we can use all sorts of different optimizers compared to gradient descent that can offer quadratic convergence instead of linear!

SleekEagle 4 years ago

Yes exactly! This is huge. Hessian optimization is really easy with JAX, haven't tried it in Julia though
- ChrisRackauckas 4 years ago
  
  Here's Hessian-Free Newton-Krylov on neural ODEs with Julia: https://diffeqflux.sciml.ai/dev/examples/second_order_adjoin... . It's just standard tutorial stuff at this point.
- PartiallyTyped 4 years ago
  
  And very fast given that you compile the procedure! I am considering writing an article on this and posting it here because I have seen enormous improvements over non jitted code, and that excluded jax.vmap.
  - SleekEagle 4 years ago
    
    There's a comparison of JAX with PyTorch for Hessian calculation here!
    https://www.assemblyai.com/blog/why-you-should-or-shouldnt-b...
    Would definitely be interested in an article like that if you decide to write it
applgo443 4 years ago

Why can't we use this quadratic convergence in deep learning?
- PartiallyTyped 4 years ago
  
  Well, quadratic convergence usually requires the Hessian, or an approximation of it, and that's difficult to get in deep learning due to memory constrains, and difficulty computing second order derivatives.
  Computing the derivatives is not very difficult with e.g. Jax, but ... you get back to the memory issue. The Hessian is a square matrix, so in Deep Learning, if we have a million of parameters, then the Hessian is a 1 trillion square matrix...
  - tome 4 years ago
    
    Not only does it have 1 trillion elements, you also have to invert it!
    
    PartiallyTyped 4 years ago
    
    Indeed! BFGS (and derivatives) approximate the inverse but they have other issues that make them prohibitively expensive.
    
    SleekEagle 4 years ago
    
    https://c.tenor.com/enoxmmTG1wEAAAAC/heart-attack-in-pain.gi...
  - ssivark 4 years ago
    
    To add, one could think of schemes like "momentum" and cousins as attempts to estimate something in the spirit of the inverse Hessian using various hacks/heuristics.

fennecs 4 years ago

Does someone have an example where the ability to “differentiate” a program gets you something interesting?

I understand perfectly what it means for a neural network, but how about more abstract things.

Im not even sure as currently presented, the implementation actually means something. What is the derivative of a function like List, or Sort or GroupBy etc? These articles all assume that somehow it just looks like derivative from calculus somehow.

Approximating everything as some non smooth real function doesn’t seem entirely morally correct. A program is more discrete or synthetic. I think it should be a bit more algebraic flavoured, like differentials over a ring.

fghorow 4 years ago

At first glance, this approach appears to re-invent an applied mathematics approach to optimal control. There, one writes a generalized Hamiltonian, from which forward and backward-in-time paths can be iterated.

The Pontryagin maximum (or minumum, if you define your objective function with a minus sign) principle is the essence to that approach to optimal control.

SleekEagle 4 years ago

I've never heard of Pontryagin's maximum principle, but thanks for bringing it to my attention. I think my knowledge of dynamical systems and control theory isn't quite up-to-snuff at the moment to fully understand it, but it's bookmarked for another day! thanks again

noobermin 4 years ago

The article is okay but it would have helped to have labelled the axes of the graphs.

Settings

Differentiable Programming – A Simple Introduction

Keyboard Shortcuts