Settings

Theme

RDDs are the new bytecode of Apache Spark

ogirardot.wordpress.com

24 points by ssaboum 11 years ago · 5 comments

Reader

pacala 11 years ago

The API is great and very useful for Python coders. The Scala version uses strings to identify fields. Haven't run the example code, but it seems to essentially encode a form of run time type checks instead of compile time type checks. Which is an odd API choice for Scala.

  • rxin 11 years ago

    It is actually not possible to enforce type checks at compile time, due to the dynamic nature of data (e.g. you can generate a DataFrame from JSON files whose schema are automatically inferred by Spark, or generate a DataFrame by loading a table in Hive). There is simply not enough type information available at compile time, unless we rule out all these cool use cases.

    There's one past attempt at making it more type safe by using macros. However, there are a lot of caveats for that to be used in practice. https://github.com/marmbrus/sql-typed

    • pacala 11 years ago

      It's software, everything is possible :) It is fairly straightforward to add compile time information via a simple case class definition. There is still a dynamic type check that json matches the case class, but the spec of the structure of the data is done in one place, not spread out arbitrarily over the query code base.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection