RDDs are the new bytecode of Apache Spark

ogirardot.wordpress.com

24 points by ssaboum 11 years ago · 5 comments

Reader

pacala 11 years ago

The API is great and very useful for Python coders. The Scala version uses strings to identify fields. Haven't run the example code, but it seems to essentially encode a form of run time type checks instead of compile time type checks. Which is an odd API choice for Scala.

rxin 11 years ago

It is actually not possible to enforce type checks at compile time, due to the dynamic nature of data (e.g. you can generate a DataFrame from JSON files whose schema are automatically inferred by Spark, or generate a DataFrame by loading a table in Hive). There is simply not enough type information available at compile time, unless we rule out all these cool use cases.
There's one past attempt at making it more type safe by using macros. However, there are a lot of caveats for that to be used in practice. https://github.com/marmbrus/sql-typed
- pacala 11 years ago
  
  It's software, everything is possible :) It is fairly straightforward to add compile time information via a simple case class definition. There is still a dynamic type check that json matches the case class, but the spec of the structure of the data is done in one place, not spread out arbitrarily over the query code base.
  - rxin 11 years ago
    
    That part is actually coming: https://github.com/apache/spark/pull/5713
  - threeseed 11 years ago
    
    Dynamic type checks are actually pretty damn useful.
    For example the MongoDB adapter samples the data at runtime to determine what the DataFrame definition i.e. types is.

Settings

RDDs are the new bytecode of Apache Spark

Keyboard Shortcuts