Settings

Theme

Apache Arrow – Powering Columnar In-Memory Analytics

arrow.apache.org

48 points by bertzzie 9 years ago · 10 comments

Reader

PDoyle 9 years ago

Oops... The first sentence in the "Fast" section says "SIMD (Single input multiple data)".

filereaper 9 years ago

Asking the stupid question here, but why create a new Apache project for this?

Apache Arrow seems to be targeting the use of SIMD which is a very JVM/Runtime dependent feature. If the runtime can't detect this out-of-the-box then create recognized method or some sort of intrinsic to coax the runtime to SIMD-ize the operation.

I understand the performance gains of this but why not add this functionality to existing projects like Parquet or HTable etc...

This just comes to mind: https://xkcd.com/927/

  • infinite8s 9 years ago

    The idea behind Apache Arrow (you can see this in the list of people supporting it) is to provide a common serialization/exchange format among different data science tools/languages/platforms (Hadoop, Spark, pandas, R's datatable). Typically data scientists will cobble together a pipeline across various tools to leverage their strengths (for example, using spark to clean up data and then pandas for timeseries analysis), and this often involves an expensive serialization/deserialization step at the boundaries. The goal of Arrow is to provide a near zero-cost format that all tools can support.

  • rz2k 9 years ago

    I don't know the answer, but in this case does columnar store imply that it is a collection of arrays, perhaps for a scientific database, and a bit different than HBase?

    Here's someone else's blog post from 2010 on different categories of columnar store DBs:

    http://dbmsmusings.blogspot.com/2010/03/distinguishing-two-m...

    • infinite8s 9 years ago

      That "someone else" is Daniel Abadi, one of the researchers who re-popularized the idea of column stores during his graduate work at MIT (in addition to researchers at CWI).

ljoshua 9 years ago

Is this similar to how QlikView's in-memory engine works?

threeseed 9 years ago

It really is a confusing title for the project. It's more of a high speed interchange format e.g. send data to Cassandra from Spark or Storm.

Nothing that end users will ever really have to know anything about.

axman6 9 years ago

I'm confused, is this just Structure of Arrays as a service for columnar data? It's not clear to me what this actually does.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection