PySpark Style Guide
github.comHere's the Scala Spark style guide: https://github.com/MrPowers/spark-style-guide
The chispa README also provides a lot of useful info on how to properly write PySpark code: https://github.com/MrPowers/chispa
Scala is easier than Python for Spark because it allows functions with multiple argument lists and isn't whitespace sensitive. Both are great & Spark is a lot of fun.
Some specific notes:
> Doing a select at the beginning of a PySpark transform, or before returning, is considered good practice
Manual selects ensure column pruning is performed (column pruning only works for columnar file formats like Parquet). Spark does this automatically and always manually selecting may not be practical. Explicitly pruning columns is required for Pandas and Dask.
> Be careful with joins! If you perform a left join, and the right side has multiple matches for a key, that row will be duplicated as many times as there are matches
When performing joins, the first thing to think about is if a broadcast join is possible. Joins on clusters are hard. Then it's good to think about using a data stores that allows for predicate pushdown aggregations.
Agree that they missed broadcast joins, which can greatly impact how you’d go about a query versus plain SQL for big data. One of the best parts about Spark is how it supports rapid iteration—- you can use it to discover what joins are computationally infeasible.
It’s notable that in Spark 3.x, Koalas is standard, which adopts the Pandas API. Yet this style guide uses the Spark DataFrame API. So the guide might be a little stale anyways.
In my experience, it’s helpful to write queries in plain portable (or mostly portable) SQL, because once a Spark job becomes useful it often gets translated or refactored into something else. Definitely depends on the team / context, but plain SQL is often more widely accessible. For fast-moving data science stuff, it’s important to think about extensibility in terms of not just code (style & syntax) but people (who is going to remix this?).
I'd argue Koalas is an anti-pattern but will have to justify that in a blog post ;)
I've written some popular Scala Spark (https://github.com/MrPowers/spark-daria) and PySpark (https://github.com/MrPowers/quinn) libraries that have been adopted by a variety of teams. Not sure how to make a reusable with pure SQL, but sounds like it's possible. Send me a code snippet or link if you have anything I can take a look at to learn more about your approach.
I’m also not happy with Koalas but at least it’s a step towards API unification.
Pure SQL vs DataFrame— just write any typical join, groupby & count OLAP query as SQL and again using the DataFrame API. I’m saying the SQL query is more accessible to non-Spark users (e.g. a DBA who might need to approve your code) and as-is can be thrown into Hive/Presto or any RDBMS pretty easily. The DataFrame version is definitely more extensible, but in my experience Spark is more often used to inform the design of a larger data pipeline versus serve as the pipeline year after year. Appreciate there are places where the opposite is true.
I wonder whether I should read about best practices from Palantir.
Fun fact: Palantir developers wrote what was, for a time, the most starred TypeScript linter (until it was decided to unify it with ESLint). They were working on what was supposed to be the world's largest TS monorepo (at least according to the MS rep).
They'll know whether you read it or not...
There is also a blog post on this: https://medium.com/palantir/a-pyspark-style-guide-for-real-w...
I worked quite a lot in pandas, dplyr, data.table and pyspark for a few years. And even occasionally some scala spark and sparkR. But after getting a bit fed up with F.lit()-this, F.col()-that, and the umpteenth variation on SQL, nowadays I pretty much just stick with plain SQL. I believe I've found my Enlightenment.
I have opposite experience. After trying pyspark functional pipelines (so many handy functions) plain SQL seems so hard to read/understand. The main probem is that order of execution is not equal to order of code lines. https://i.stack.imgur.com/6YuwE.jpg
another thing is that python is so cool for data processing, and when working with plain sql I feel lack of
.rdd.map(my_python_processing_function)Same for me. Python and Scala let users break up the logic into DataFrame transformations that can be unit tested, packaged into Wheel / JAR files, and easily reused in multiple contexts. Maintaining big, complex SQL codebases isn't easy.
Could not agree more. Similar to the The Principle of Least Privilege [1], I prefer to use SQL over pyspark if possible. [1] https://us-cert.cisa.gov/bsi/articles/knowledge/principles/l...
I'm so confused.
These examples are all using the SQL-like features of Spark. Not a map() or flatMap() in sight.
So... why not just write SQL?
df.registerTempTable('some_name')
new_df = spark.sql("""select ... from some_name ...""")
All of this F.col(...) and .alias(...) and .withColumn(...) nonsense is a million times harder to read than proper SQL. I just don't understand what any of this is intended to accomplish.The spark.sql("""select ... from some_name ...""") bit is how to write pure SQL from the Scala or Python execution context. If you're in the SQL execution context, this syntax isn't required. I never write Spark code like this.
At first glance, the F.col(), .withColumn() syntax isn't as intuitive as pure SQL, but it has a lot of advantages when you get used to it. You can make abstractions, use programming language features like loops, and use IDEs.
I find the PySpark syntax to be uglier than Scala. Lots of teams are terrified to use Scala and that's the reason PySpark is so popular.
Because the examples use the Dataframe API and not the RDD API? At least as of Spark 2.4.7, both `Dataset.map()` and `Dataset.flatMap()` are still marked experimental.
> The preferred option is more complicated, longer, and polluted - and correct.
This is the definition of bad design.
So much eye rolling. It's a mixture of some common sense, some bad suggestions (such as this one), and a couple of mandates such as "It is highly recommended to avoid UDFs in all situations" rather than providing any real guidance.
Was also rolling my eyes at that one. Furthermore, if you are this concerned with the refactoring limitations of your IDE, then don't use Python, but the Scala or Java API instead.
I think this guide mostly dates from 2017, when Palantir was rolling out Spark and Spark SQL code authoring in their Foundry data platform. It mostly targets their untrained "delta" and "echo" employees, most of whose jobs rotated around writing ETL code for customers. I have no idea why this glorified Quip doc was open-sourced.
Looking at the list of contributors on Github, I think I remember that the main author was actually James Thompson (UK), and not anyone on the contributor list. JTUK was called that because the company had another employee, based in the US, who had the same name. James Thompson (US) is now at Facebook and is a pretty cool designer. His astronomer.io media project from 2011 comes up on HN periodically.
Of the people listed on Github, Andrew Ash (now at Google) is the original evangelist for Spark on Palantir Foundry, and Francisco is the PM for SkyWise, Palantir's lucrative, but ill-fated effort to save the Airbus A380.
Why not just use Scala and frameless? No style guide needed :)
Spark is a cancer. Sooner or later, 99.9% of the people using Spark will wake up to the fact that "hey, I got 1TB of RAM, why do I need this?"
Spark and PySpark are just PITA to the max.
People that take this attitude against distributed processing usually have never had to process an amount of data bigger than would fit on one machine or always have the budget to fit everything on one very expensive large machine. It's lack of experience masquerading as cleverness. The only ones that have a right to make this arguments are the people that spread all their processing out as streams on one machine or using distributed streams but even that has serious limitation.
If there was something better than Spark for distributed processing, we would be using it. The rest of your comment is a straw man argument, assuming everybody uses it for datasets fitting in memory of a single node.
What do you recommend for distributed data processing?
Dask is a great alternative for distributed computing as well: https://github.com/dask/dask
IMO, Spark is better for some tasks and Dask is better for others.
First step is decide if you really need distributed data processing. I think this is the point author is making. I've seen GB sized data considered "BIG DATA" and its unbelievable the architectural patterns used to support this "BIG DATA".