RDDs are the new bytecode of Apache Spark
ogirardot.wordpress.comThe API is great and very useful for Python coders. The Scala version uses strings to identify fields. Haven't run the example code, but it seems to essentially encode a form of run time type checks instead of compile time type checks. Which is an odd API choice for Scala.
It is actually not possible to enforce type checks at compile time, due to the dynamic nature of data (e.g. you can generate a DataFrame from JSON files whose schema are automatically inferred by Spark, or generate a DataFrame by loading a table in Hive). There is simply not enough type information available at compile time, unless we rule out all these cool use cases.
There's one past attempt at making it more type safe by using macros. However, there are a lot of caveats for that to be used in practice. https://github.com/marmbrus/sql-typed
It's software, everything is possible :) It is fairly straightforward to add compile time information via a simple case class definition. There is still a dynamic type check that json matches the case class, but the spec of the structure of the data is done in one place, not spread out arbitrarily over the query code base.
That part is actually coming: https://github.com/apache/spark/pull/5713
Dynamic type checks are actually pretty damn useful.
For example the MongoDB adapter samples the data at runtime to determine what the DataFrame definition i.e. types is.