Unified End-to-End Data Workflows on JupyterHub
At Tubi, we’ve standardized on Jupyter notebooks as the unified platform for doing analytics and data science work. Whether we’re exploring user behavior, doing time series forecasting for company KPIs, or building deep learning object detection models, we’re doing it on a customized JupyterHub installation we call Tubi Data Runtime (TDR).
Get Chang She’s stories in your inbox
Join Medium for free to get updates from this writer.
We mentioned TDR in a previous post, and in this post we go into further detail. We will describe optimized data connectors, custom Jupyter cell magic and extensions, and domain-specific visualization tools. In a follow-up post, we will describe our experience running JupyterHub itself on Kubernetes.
Optimized data connectors
For most Jupyter users, when they open a notebook, the first step is always getting data. At Tubi, the majority of our data sources are from AWS Redshift and S3. By analyzing our Redshift query metrics, we found that 95% of our queries returned a result set under 1GB in size, comfortably fitting in RAM. However, before TDR, even at that size it took a long time to retrieve the data. Many of our workflows involved spinning up a Spark cluster, executing the Redshift query, then converting the Spark DataFrame into a Pandas DataFrame. The remainder of the workflow would continue only in pandas. This was incredibly wasteful and slow.
df = query_redshift(sc, sql).toPandas() # 95% of Redshift queriesThe main reason why we were forced into this kind of workflow was that the native pandas.read_sql is really only good for small amounts of data. At the same time, the pandas API was more familiar and lighter weight than PySpark. Koalas may help adoption, but it’s still young and does not solve the resource waste issue. In order to have our cake and eat it too, we created a custom Redshift connector that takes advantage of the UNLOAD command in Redshift, which moves partial results from each node to S3 as CSV files. For added efficiency, our connector uses the multiprocessing module in Python to read the partial results in parallel then combine them using pandas.concat in the end.
The create_executor method above uses the starmap method from multiprocessing.Pool. The _read_part function (code not shown) will call pd.read_csv to deserialize each query result CSV file.
starmap from multiprocessing.PoolWhen reading CSVs, it is very easy to fail due to mismatching quote characters, NAs, delimiters, etc. But because our connector takes care of both supplying those to the UNLOAD command as well as pd.read_csv, we can guarantee that they’re consistent with each other and avoid these pitfalls. Redshift also creates a manifest file that we use to add additional validation on row counts and to help with data type conversions.
Press enter or click to view image in full size
At the end of the day, users simply run df = tdr.query_redshift(query).to_df() and all of the plumbing is done under the hood. We find using this approach gives us comparable or faster “time-to-df”, does not require working with a Spark context, and uses far less compute resources.
There are two main gotcha’s with the UNLOAD command. First, UNLOAD does not support LIMIT clauses in the innermost query. This we can solve using a common table expression.
WITH result_table AS (SELECT * FROM raw_table LIMIT 100)
SELECT * FROM result_tableUNLOAD also does not respect ORDER BY clauses. This we solve by doing the sorting in pandas after we get the results.
Low code data exploration
While Jupyter makes it easy for data scientists to analyze data, we want to lower the barrier to entry for analysts who are familiar with SQL but are just starting out with Python. For this purpose, we created a cell magic and a custom extension.
SQL cell magic
Many of us forget our own hurdles when we first started to learning coding, but for beginners, even “basic” concepts like working with variables, function calls, and string manipulation do not come automatically. To make it easier for them to analyze data using SQL, we created a custom cell magic named “%%sql”. The core logic of this cell magic is quite simple:
query_redshift` and set the output as the user variable `df`When the magic is invoked in a cell, the user can enter a SQL query directly into the cell which TDR will execute using the query_redshift functionality described previously.
You’ll notice there is a {{ott_app}} clause which isn’t actually valid SQL. This snippet is a companion feature to the SQL magic called…well…snippets. Using Jinja2, we can associate commonly used SQL snippets with aliases that can be referred directly in SQL queries. This saves us lots of repetitive typing, prevents careless mistakes, and helps us unify definitions. For example, the {{ott_app}} snippet can be used as a WHERE clause that the field “app” is any one of our many OTT platforms:
Integrating nteract data explorer
Once we have our dataframe, we almost always need to visualize the data as the beginning of our exploratory data analysis process. The nteract data explorer from Netflix automatically visualizes dataframes. However, it is part of a different notebook project and the default Jupyter extensions require a separate tab, making the user experience very awkward. Instead, we decided to create a simple JupyterLab extension that can directly render the explorer as the output of display(df). In addition, the nteract data explorer did not store state, which meant that whenever you refreshed your notebook, it would go back to the default state, making both analysis and collaboration difficult. Using the same extension, we store the data explorer state in the notebook metadata itself so that when you share the URL with others, your visualization choices are preserved.
When we put together the cell magic and this extension, the only line of code required to run a SQL query and visualize the results is display(df). This drastically reduces the barrier to entry for Python beginners.
Press enter or click to view image in full size
Jovyan collaboration
Shareable links
By default, the URLs in JupyterHub are specific to the user’s JupyterLab server (e.g., https://<jupyterhub host>/user/cshe/). This means that the default “shareable” links from JupyterLab aren’t actually shareable. Instead, JupyterHub recommends using the /user-redirect/ route, which will then redirect the URL to the current user. We implemented our own plugin to rewrite the URL using /user-redirect/ and we’ll look to contribute this back to JupyterHub (if a kind soul has not already done so).
Deep Copy / Cut / Paste
Another minor annoyance we found was that the “Copy” and “Paste” menu items in the file browser do not work on directories. As a temporary workaround, we created “Deep Copy” (also cut and paste) extensions to deal with this particular issue.
We’ve open-sourced our JupyterLab extensions for public use and we hope to contribute this back to the main JupyterHub and JupyterLab featureset.
Additional features
TDR is chock-full of features that we only have room to summarize here. The JupyterLab in TDR comes with pandas_profiling preinstalled. This allows us to easily check data quality by invoking the profile_report function that’s added to pandas dataframes.
Press enter or click to view image in full size
TDR also comes with a data catalog written in Python that combines the technical metadata from Redshift from the user metadata (like descriptions etc) created as part of our data warehouse. Using the data catalog, we can easily browse data objects, search for data using keywords or regular expressions, and use tab-completion to make data discovery much easier without jumping out of the notebook at all.
TDR also uses plotly to create an interactive 3-D visualization to help everyone explore our tens of thousands of movies and tv series. We call this the “TDR content explorer”.
Press enter or click to view image in full size
Specifically for machine learning, we created visualization tools to compare different versions of our ranking algorithms so that the results are easy to understand for non-ML experts. This greatly reduces the occurrence of bugs and helps to make the models more interpretable.
Press enter or click to view image in full size
The road ahead
Adoption for Tubi data runtime started out in data science and has expanded across product, engineering, QA, and operations teams. We’ve moved TDR onto Kubernetes which we continue to improve and will describe in a follow-up post. We’ve even launched a “Tubi Data Certification” course across the company to use TDR as an interactive training environment to help SQL beginners learn how to work with our data.
Currently we’re integrating Tensorboard and other tools specifically for deep learning workflows. In the future, we also plan to allow TDR to connect to remote Spark clusters or Tensorflow clusters to enable truly large scale data processing. If building data tools to democratize analytics and data science sounds exciting to you, come join your friendly neighborhood Tubi data engineering team.