Settings

Theme

Mandala: Python programs that save, query and version themselves

amakelov.github.io

34 points by benkuhn 3 years ago · 4 comments

Reader

emptysongglass 3 years ago

I write Python (not well) and I don't understand any of this:

> mandala is a tool that implements this idea. It turns function calls into interlinked, queriable, content-versioned data that is automatically co-evolved with your codebase. By applying different semantics to this data, the same ordinary Python code, including control flow and collections, can be used to not only compute, but also save, load, query, batch-compute, and combinations of those.

I wanted to call it out because I think it's such an obvious example of where programmers need to find ways of more clearly expressing their programs in ways everyone can understand.

  • amakelov 3 years ago

    Hi, author here. Sorry about the confusion - this blog post's intention was to give a more programming-language-themed introduction to the project (discussion on r/programminglanguages is here: https://www.reddit.com/r/ProgrammingLanguages/comments/12im9...). As such, it doesn't really talk very much about what you'd actually use this for - tracking computational experiments (for example). As for the part you highlighted, I was looking for a two-sentence summary that conveys the main features of the project... well, I guess I succeeded with the two-sentence part! :)

    Let me try to unpack this in a hopefully saner format:

    - the whole thing is based on memoization. You put a memoization decorator on all the functions whose results you want to persist/reuse, and you compose entire programs by calling such functions on the outputs of other functions. In between calls, you can also mix in data structure operations - so say a function returns a list, you can call another memoized function on just an element of this list.

    - as such programs execute, some metadata is passed around that links together the inputs and outputs of each call, and the elements of each data structure. This dynamically builds a computational graph of the program behind the scenes.

    - why would you need this graph? One reason is that there is a principled way to extract a SQL query from this graph that pattern-matches to all analogous computational graphs that have already been memoized. This gives you a flexible and natural interface to ask the queries you usually ask of your programs - "How do these outputs depend on these inputs across all the experiments of this kind?" - without writing extra code.

    - since memoized results can go stale when you change your code, there is also a versioning system that tracks the dependencies that each memoized call accessed, and alerts you when a dependency changes. Tracking dependencies dynamically on a per-call basis, rather than statically on a per-function basis, gives you more opportunity to reuse computation automatically - for example, a function can have two branches that depend on different dependencies. If only one of the dependencies changes, you won't recompute the calls that used the other dependency. The "content-versioned" part refers to how the system recognizes which version a function is at: by looking at its code (content), instead of by you explicitly providing it with some arbitrary version name/number. This means that e.g. you can "go back in time" w.r.t a given function(s) by restoring the old code, and the storage will recognize that it's back in this world.

    I hope that helps clarify things, and thanks a lot for bringing this up.

    • xcombelle 3 years ago

      Is the graph/SQL thing used only to manage dependencies?

      • amakelov 3 years ago

        The dependency tracking and the graph/SQL thing are largely independent pieces.

        The graph/SQL thing is used to query the memoization tables of the memoized functions by "joining them along a given computational graph". This roughly means that you can point to a computation, and ask: give me a table of the values of such-and-such variables in this computation across all analogous computations in the storage. Here, "analogous" could mean, for example, running the same composition of functions, but with different input parameters. The motivation is to be able to easily ask a broad class of natural queries of your storage (e.g., how given outputs depend on given inputs across all experiments of a given kind).

        For example, say you have two memoized functions, `increment` and `add`, and run this:

        with storage.run():

            for i in [1, 2, 3]:
        
                x = increment(i)
        
                y = add(i, x)
        
        
        Running this code will memoize a bunch of calls to `increment` and `add` in a way that recognizes that the output of the call to `increment` is the second input to the call to `add` (behind the scenes, they point to the same saved object). Then, if you call `storage.similar(x, y)`, you'll see a message like this:

        Pattern-matching to the following computational graph (all constraints apply):

            a0 = Q() # input to computation; can match anything
        
            a1 = Q() # input to computation; can match anything
        
            x = increment(a=a0)
        
            y = add(a=a1, b=x)
        
            result = storage.df(x, y)
        
        
        and a table like this:

        x y

        2 3

        3 5

        4 7

        which tells you all the values `x, y` have taken in executions of this kind of program. Since you've only ran this with `i = 1, 2, 3`, you get back just the result from this computation - but if you've been running similar things for weeks, you may not remember all the settings you've ran this with so far - and this query will reveal them.

        How this works is that you look for values `a0, a1, x, y` in the storage that satisfy both the constraints `x = increment(a=a0), y = add(a=a1, b=x)`. Finding such values amounts to joining the memoization tables of `increment` and `add` in a certain way. Since this way is implicit in the computation itself, you can just point to the variables and get the result.

        Hope that helps! I realize that this is a fairly non-standard way to query, but I believe it's well worth the benefits in terms of reducing the boilerplate required to interact with computational data. Let me know if you think of ways to improve the presentation or have any other questions!

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection