A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0

58 points by maartenbreddels 5 years ago · 16 comments

Reader

sradman 5 years ago

Why this hybrid dataframe? [1]:

> [Arrow] adoption will take time, and most people are probably more comfortable seeing NumPy arrays. Therefore a Vaex version 4 a DataFrame can hold both NumPy arrays and Apache Arrow arrays to make the transition period easier.

There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems. I didn't realize this transition impacted NumPy arrays in addition to Pandas dataframes in Python.

[1] https://vaex.io/blog/a-hybrid-apache-arrow-numpy-dataframe-w...

nomel 5 years ago

> There seems to be agreement that Apache Arrow is the future of dataframes across ML ecosystems
Not, until there is proper multidimensional array support.

wodenokoto 5 years ago

It seems like it has some nice advanced features that the data engineering team might appreciate once an application gets large.

But as the person who needs to load up some data and do some transformations, this article gives me very little information about why I should switch from pandas.

But I am excited to hear about new solutions in the data frame space!

musingsole 5 years ago

If you're able to comfortably do your processing in Pandas, I don't think there is any justification to switch to Vaex. But Pandas begins to strain in the GB territory. If you switch to Vaex at that point, it'll be night and day. Working from the REPL, no more half second pauses for results. And of course the payoff only grows with more data.
Vaex is stupid fast at all the data operations it supports to the point where I've used in it in place of a database for an API.
- maartenbreddelsOP 5 years ago
  
  Thanks, glad you find Vaex useful.
  Indeed, for small data there is not much to gain, at least this is not the focus of this article. Although even with small amounts of the, the automatic pipelines are useful https://vaex.io/blog/ml-impossible-train-a-1-billion-sample-...
- wodenokoto 5 years ago
  
  I was more concerned about its api / methods.
  Does it make things hard that was easy in pandas or does it make things that are hard in pandas easy?
  - musingsole 5 years ago
    
    I'm coming from a Pandas dominated codebase. Working with Vaex, I felt the interface was almost 1 for 1. I have a note from then about joins being more awkward than with Pandas. If I recall that is more that Pandas' joins have more flexibility, but that most of the functionality was there.
    At the time, I had issues with some string operations, though it appears with v4.0 that may no longer be the case.

ZeroCool2u 5 years ago

How does Vaex compare with Modin[1]?

[1]: https://modin.readthedocs.io/en/latest/index.html

devin-petersohn 5 years ago

I'm one the maintainers of Modin, so I can chime in here. Dataframes are the focus of my PhD thesis, and Modin started as my PhD project. Most of the differences come down to functionality and support. Truthfully, the goals of the projects are quite different so it's a bit of apples-to-oranges.
As a part of developing Modin, we identified a low-level algebra and data model that both generalizes and encompasses all of the pandas and R dataframe functionalities. Modin is an implementation of this data model and algebra[1]. Based on our studies, Vaex's architecture can support somewhere in the range of 35-40% of the pandas DataFrame API, including the exclusion of support for row indexes. Compare this to Dask, currently at 44% of the pandas API, and Modin, currently at 90%.
Vaex is great if you're already working with a compatible memory-mapped file format; it'll be exceptionally fast in that case. That is the use case I believe they are (successfully) targeting.
[1] https://arxiv.org/pdf/2001.00888
- ZeroCool2u 5 years ago
  
  Got it, that's really helpful. Thank you for clarifying and all your hard work on Modin!
maartenbreddelsOP 5 years ago

AFAIK Modin tries to be the API compatible with Pandas, but then faster/distributed. Vaex tries to be a DataFrame library that is as fast as possible on a single machine to keep things simple and fast (although distributed is on the horizon, we don't need it currently). We're not afraid to break compatibility with pandas, because we care about performance. Both libraries try to hide the laziness from the user.
- ZeroCool2u 5 years ago
  
  Got it, that makes sense! Will give it a try in the future for sure!

claytonjy 5 years ago

A killer use-case here might be interop with Delta Lake; you could allow data scientists to work with the data there (parquet in S3) in a local-like manner, using an API that might be preferable to Spark's!

Has anyone tried this, or know if it's possible?

maartenbreddelsOP 5 years ago

My guess is that should be possible, feel free to hop onto https://github.com/vaexio/vaex/discussions !

liminal 5 years ago

Did something important change in Arrow 3.0 in terms of working with text data? Is Arrow good for text data in general? I think I'm missing some context needed in order to be impressed by this post

maartenbreddelsOP 5 years ago

Yes, Arrow 3.0 has much more string kernels https://arrow.apache.org/docs/cpp/compute.html

Settings

A Hybrid Apache Arrow/Numpy DataFrame with Vaex Version 4.0

Keyboard Shortcuts