Data wrangling in Elixir with Explorer, the power of Rust, the elegance of R

news.livebook.dev

183 points by hugobarauna 3 years ago · 55 comments

Reader

Hi everyone, member of the Livebook team here.

We’ve been investing a lot in making Elixir great for data exploration.

Today we’re taking one step further in this journey by contributing to the Explorer library and integrating it with Livebook.

Explorer is an Elixir dataframe library built on top of Polars (from Rust) and inspired by dplyr (from R).

Its integration with Livebook (open-source code notebook for Elixir) makes it easier to explore and transform dataframes interactively.

Let me know if you have questions about these new features or anything related to Livebook’s launch week. :)

cjf4 3 years ago

Can you make a pitch to a Python/R user to give this a try?
What you’ve built looks very nice and heard nothing but good things about elixir elsewhere, but would take a lot to leave those much more robust ecosystems. Do you hope to grow into that over time? Is there enough in terms of viz, statistical models, and ml to survive?
- josevalim 3 years ago
  
  José from the Livebook team. I don't think I can make a pitch because I have limited Python/R experience to use as reference.
  My suggestion is for you to give it a try for a day or two and see what you think. I am pretty sure you will find weak spots and I would be very happy to hear any feedback you may have. You can find my email on my GitHub profile (same username).
  In general we have grown a lot since the Numerical Elixir effort started two years ago. Here are the main building blocks:
  * Nx (https://github.com/elixir-nx/nx/tree/main/nx#readme): equivalent to Numpy, deeply inspired by JAX. Runs on both CPU and GPU via Google XLA (also used by JAX/Tensorflow) and supports tensor serving out of the box
  * Axon (https://github.com/elixir-nx/axon): Nx-powered neural networks
  * Bumblebee (https://github.com/elixir-nx/bumblebee): Equivalent to HuggingFace Transformers. We have implemented several models and that's what powers the Machine Learning integration in Livebook (see the announcement for more info: https://news.livebook.dev/announcing-bumblebee-gpt2-stable-d...)
  * Explorer (https://github.com/elixir-nx/explorer): Series and DataFrames, as per this thread.
  * Scholar (https://github.com/elixir-nx/scholar): Nx-based traditional Machine Learning. This one is the most recent effort of them all. We are treading the same path as scikit-learn but quite early on. However, because we are built on Nx, everything is derivable, GPU-ready, distributable, etc.
  Regarding visualization, we have "smart cells" for VegaLite and MapLibre, similar to how we did "Data Transformations" in the video above. They help you get started with your visualizations and you can jump deep into the code if necessary.
  I hope this helps!
  - josevalim 3 years ago
    
    Edit: I wrote down an introduction to all of these on our elixir-nx organization page on GitHub: https://github.com/elixir-nx
- gangstead 3 years ago
  
  Jose's reply suggests the basics have Elixir equivalents. I can't really speak to that side but I can say the usability story is much much better.
  The last time I gave Jupyter notebooks a go it was a full session of installing and updating various Python tools: pip, conda, jupyter then struggling with Python versions. You end up piecing together your own bespoke setup based on other people's outdated bespoke setups you find while searching for your error messages. Maybe that's better now, this was a few years ago. For Livebook it's "download the app and run it." Other options exist and are well documented and straight forward. I set up a livebook server on our k8s dev cluster with a pretty simple Deployment I wrote just from looking at the livebook README notes on docker. We've made livebooks that connect to the elixir app running in a different namespace on the cluster. Very cool.
  Once you have Livebook going the `.livemd` file is both version control friendly AND very readable markdown file rather than the big json objects used in `.ipynb`.
  For Livebook rebuilding cells is a lot more repeatable. It also does a good job of determining if a cell re-execution is necessary or not if a previous cell is modified which can save you a lot of time. Likewise the dependencies installed are captured at the top so I've never had a problem when sharing a livebook. The other person always gets the same results that I had. I don't remember how it worked for Jupyter but it's really cool to collaborate with someone by both going to the same notebook session. It's like working on the same Google Doc but you are writing and executing code.
  Now with the Publish functionality I can see using a livebook to throw together some functionality and share it with non-technical users in your org, while having it backed up to git for posterity.
  I avoided Smart Cells for a while because I didn't like the "magic-ness" of the UI hiding what the code was doing, but as Jose has shown in the launch videos this week you can easily see the code they are backed with and replace the cell with the code if you want to take full control. Maybe it was always like that but I didn't realize it at first. They really make setting up stuff very easy without limiting you later on.
- peoplefromibiza 3 years ago
  
  I'm pitching it to the data science department of the company I work for (huge insurance company in my Country) next week.
  They do a lot of prototyping from CSV/parquet sources in Python and R.
  I've waited to show them Livebook because Elixir syntax is somewhat alien to many, but now that the Livebook team has integrated ML models and dataframes (Explorer through polar.rs) as smart cells, I think they have a killer feature in their hands, much like Liveview was for Phoenix framework.
  Let's see how it goes, I'm fairly optimistic about it.
chasers 3 years ago

Happy Livebook user here!
Congrats on all the new stuff this week. Really great work.

peregrine 3 years ago

When José first launched LiveBook I thought it was Ambitious to take on the Jupyter project, but here its getting better and more usuable every single day.

Huge props, the Livebook Project is an incredible example what is possible with Elixir.

peoplefromibiza 3 years ago

Once again, Elixir community takes something and brings it to the next level.

In my opinion Jose Valim is the Linus Torvalds of programming languages, but differently from Linus not only he is a 100x engineer, he's also one of the humblest and kindest person I've ever met (don't get me wrong, I love Linus, but he can be "too honest" at times and comes out as harsh or rude).

From Elixir came Phoenix, from Phoenix came Liveview, from Liveview we got Livebook, iterating from good ideas to quality products like it's an easy task.

Can't wait to see what's the next trick in their sleeve.

faitswulff 3 years ago

IIRC he was behind the very popular Rails authentication gem Devise, as well. Really unbelievable how much of a boon he’s been to the open source community.
rozap 3 years ago

The dude is superhuman. An absolute machine in terms of programming output. Very engaged with the community. And extremely patient with people who have wrong opinions :)

cardosof 3 years ago

I'm excited about this project, it's very refreshing to see that the source of inspiration is R/tidyverse instead of Python/Pandas.

clircle 3 years ago

Whoa, you know this is hacker news right? You don't get to call R elegant around these parts.

chubot 3 years ago

R isn't elegant, but tidyverse is
If you learn tidyverse, then you're going to cringe whenever you use Pandas or most things in the Python data science ecosystem
https://www.rstudio.com/wp-content/uploads/2015/02/data-wran...
- clircle 3 years ago
  
  Oh, I do cringe at Pandas.
bryanrasmussen 3 years ago

I think R is elegant, what's inelegant in it (other than the code lots of non-programmers write with it)?
- nerdponx 3 years ago
  
  Core language design? Definitely elegant in my opinion. F-expressions, clever "formula" syntax, everything is an array, great C/C++/Fortran interop.
  Standard library? Absolute chaos. Some elegant subsets, but mostly a mess that you learn to live with.
- fny 3 years ago
  
  It has 4 ways to do object orientation if IIRC, none of which are compatible.
  - bbertelsen 3 years ago
    
    S3, S4, R6, and reference classes. To be fair they are situational and not one size fits all. The stricter ones are mainly used in biostats where significant metadata makes more sense in OO. S3 is nice and easy, primarily just a list with dispatches. Everything else is less so.
    
    kgwgk 3 years ago
    
    Don't forget R7.
    https://rconsortium.github.io/OOP-WG/
- isoprophlex 3 years ago
  
  The type system is so abhorrent, it makes me wonder if it's actually proper to call R a real programming language
  - kgwgk 3 years ago
    
    "There are only two kinds of languages: the ones people complain about and the ones nobody uses."
    
    dunefox 3 years ago
    
    This crappy quote is always used to justify bad design.
    
    isoprophlex 3 years ago
    
    I'm sorry, did you mean
    [1] "There are only two kinds of languages: the ones people complain about and the ones nobody uses."
    :^)
  - goatlover 3 years ago
    
    And from a different point of view, real programming languages have built-in vectorization and 1-based arrays ;)
  - mike_ivanov 3 years ago
    
    what specifically is abhorrent about it?
    
    isoprophlex 3 years ago
    
    The coercion always gets on my nerves, JavaScript gets a bad rep but R is pretty damn warty too; weird ass data types ('ordered factor', anyone) that just seem so very far away from design choices in other languages without being particularly ergonomic or aesthetically appealing
    
    kyllo 3 years ago
    
    The data types make sense to statisticians. Ordered factors are great when you need to fit, say, an ordered logit regression model.
    
    mike_ivanov 3 years ago
    
    Which coercion specifically are you talking about? Could you give an example?
    Weird data types: R had been designed for and by statisticians with their specific needs in mind, which indeed could look weird to regular people.
adonig 3 years ago

I remember when we first used R in a stochastic class. The professor (a mathematician) was in love with the language and the students (computer science) considered the language to be the PHP of science.
- peoplefromibiza 3 years ago
  
  as a computer scientist and programming languages nerd, I think R is a much better language than Python (comparing the two only because Python is leading in the data science field)
  I also believe that the tools available are superior, RStudio is very good IMO.
  I wonder why R has such a bad reputation.
  - bart_spoon 3 years ago
    
    Because it’s built around a very specialized set of needs (data manipulation, visualization, and statistical modeling), and it is essentially best in class at it, but it has quirks as a result. Anyone coming to R from a background in another language will feel those quirks intensely and assume it’s bad.
  - kyllo 3 years ago
    
    A lot of programmers dislike R for the same reason they dislike PHP: weak typing.
    If you can get over that, though, there's a lot to like about R. It's sort of like a hybrid between a Lisp and an array language.
    
    goatlover 3 years ago
    
    Javascript and C also have weak typing. And Python isn't statically typed, it's just that like Ruby, it doesn't allow implicit conversion, except with numbers.
    
    mike_ivanov 3 years ago
    
    Weak typing? No, it's not. R is a strongly typed language.
    > 1 + "string"
    Error in 1 + "str" : non-numeric argument to binary operator
    
    chaxor 3 years ago
    
    Iirc the scoping is also very strange. Certainly not as bad as mathematica, but bad nonetheless.

tao_oat 3 years ago

Extremely cool! This is the first time I'm learning about Livebook's smart cells, and seeing how easily you can toggle between the UI and the underlying chain of dataframe operations in code is pretty mind-blowing.

kyllo 3 years ago

Elixir runs on the Erlang VM, which is designed for distributed multi-process concurrency, right? Why would this be advantageous for interactive data analysis work, which is typically done on a single node? I don't quite understand the use case, hoping someone can explain.

nestorD 3 years ago

Uses cases that comes to mind are data distributed across nodes and things like distributed training of machine learning models (which is getting more and more focus as models get bigger).
josevalim 3 years ago

Hi, maybe I can try to answer your question.
First, to makes ure we are on the same page, distribution in Erlang happens across nodes and concurrency happens within a single operating system process. Erlang calls its concurrency primitive processes (because they are also isolated and preemptive) but that can cause some confusion (hence this comment).
From now on, when I mention process, I mean the Erlang VM processes, and they are very lightweight and you can create millions of them.
I can think of a few different ways where concurrency can help interactive data analysis:
1. Livebook supports rich outputs where each output is a process. This means your notebook can communicate with outputs as it executes. For example, it is very easy for you train a neural network and push data to the graph as it comes. Or to process data and plot it as you go.
2. You can use concurrency to run several experiments at once within the same notebook. We support this in Livebook via "Branched sections". You can prepare the data and then start several branches/processes to digest the data in different ways without a need to start several notebooks.
3. It is also very easy to build applications where multiple users can collaborate and interact with it, which we showed yesterday: https://news.livebook.dev/build-and-deploy-a-whisper-chat-ap...
When it comes to distribution, it is quite similar to above, because the concurrency and distribution primitives in the Erlang VM are the same. Here is an example of how easy it is to take a ML model from concurrent to distributed: https://news.livebook.dev/distributed2-machine-learning-note...
Generally speaking, I think we should start from the opposite side: we should try to make everything concurrent by default and fallback to serial only when we cannot. Specially for data analysis, where moving data is expensive, we may end-up incurring a lot of overhead if the only form of concurrency is via the network or inter-process communication.
One last note, perhaps the most important bit of the Erlang VM machine for data analysis is that it favors a functional style. Livebook notebooks are strongly reproducible. I expand on this in this video: https://www.youtube.com/watch?v=EhSNXWkji6o
I hope this helps (and feel free to tell me if I missed the mark!).
PS: I know many of the videos above are machine learning related and that's because we have started our data journey only now. Although the principles should generally apply. Hopefully more data videos will come soon! :)
peoplefromibiza 3 years ago

Watch this
https://youtu.be/MSMyRBJAoSs

bgorman 3 years ago

What is the rationale for using a programming language runtime designed for distributed computing for local data analysis?

josevalim 3 years ago

Languages and runtimes often grow beyond their original scope. And since the introduction of Dirty NIFs to the Erlang VM five years (or so) ago, integrating with native code (which is what powers a lot of data analysis and machine learning tools in high-level languages) has become a real possibility. There is a similar-ish discussion here: https://news.ycombinator.com/item?id=35572128

losvedir 3 years ago

The video overview says the journey in data is "just starting". That's exciting! Any ideas or vision for where it's going in the future?

Also, I noticed in the demo that installing polars was very fast! Was the video trimmed, or was it cached or something? I remember when I last tried out polars in Elixir a while ago, it had to build the rust library and everything and took like 10 minutes.

josevalim 3 years ago

Things that are in our roadmap in relation to our data vision:
* Data management within your notebook (https://github.com/livebook-dev/livebook/issues/1604) - we want you to be able to link files, urls, and object storages to your notebook and automatically manage/download it
* We want to make it easier to build visualizations (even easier than the current Chart smart cell) and also be able to filter a dataframe by selecting a visualization: https://github.com/livebook-dev/livebook/issues/1545
We have other ideas, such as making SQL a more prominent citizen in Livebook and be able to build a custom canvas as your work on your notebook, but those will likely take longer to realize.
- jasonpbecker 3 years ago
  
  A vote here for a SQL cell. I want folks to use Livebook and Explorer more, but a very easy win for data folks who are not familiar with Ecto and are mostly writing complex select statements would be a sql code block that can easily reference a connection.
  That would let people who are getting into Elixir for data work run a query, get an Explorer.DataFrame, and interact further that way.
  - hugobaraunaOP 3 years ago
    
    Good feedback, thanks.
    Livebook already has a SQL Smart cell (https://livebook.dev/integrations/sql). It doesn't integrate with Explorer yet, but it's already possible to reference a database connection, run a SQL query, and visualize the results in a table.
    Here's a video showing how to do that https://www.youtube.com/watch?v=F98OWdigCjY
cigrainger 3 years ago

We use the excellent Rustler Precompiled [1] library now so prebuilt binaries ship with the Elixir package. No Rust toolchain needed. :)
[1] https://github.com/philss/rustler_precompiled

mgdev 3 years ago

Jose, you are prolific. Love it.

lbrito 3 years ago

"The elegance of R" is a phrase I never expected to read :)

yewenjie 3 years ago

Is there any plans for making Livebook embeddable in regular JS apps?

PKop 3 years ago

You need an Elixir server process to run Livebook, so what is meant by "regular JS apps" that you intend to embed into?

b800h 3 years ago

R, elegant??

Settings

Data wrangling in Elixir with Explorer, the power of Rust, the elegance of R

Keyboard Shortcuts