Introducing Tubi Data Runtime

8 min read Original article ↗

Chang She

Unified End-to-End Data Workflows on JupyterHub

At Tubi, we’ve standardized on Jupyter notebooks as the unified platform for doing analytics and data science work. Whether we’re exploring user behavior, doing time series forecasting for company KPIs, or building deep learning object detection models, we’re doing it on a customized JupyterHub installation we call Tubi Data Runtime (TDR).

Get Chang She’s stories in your inbox

Join Medium for free to get updates from this writer.

We mentioned TDR in a previous post, and in this post we go into further detail. We will describe optimized data connectors, custom Jupyter cell magic and extensions, and domain-specific visualization tools. In a follow-up post, we will describe our experience running JupyterHub itself on Kubernetes.

Optimized data connectors

For most Jupyter users, when they open a notebook, the first step is always getting data. At Tubi, the majority of our data sources are from AWS Redshift and S3. By analyzing our Redshift query metrics, we found that 95% of our queries returned a result set under 1GB in size, comfortably fitting in RAM. However, before TDR, even at that size it took a long time to retrieve the data. Many of our workflows involved spinning up a Spark cluster, executing the Redshift query, then converting the Spark DataFrame into a Pandas DataFrame. The remainder of the workflow would continue only in pandas. This was incredibly wasteful and slow.

df = query_redshift(sc, sql).toPandas() # 95% of Redshift queries

The main reason why we were forced into this kind of workflow was that the native pandas.read_sql is really only good for small amounts of data. At the same time, the pandas API was more familiar and lighter weight than PySpark. Koalas may help adoption, but it’s still young and does not solve the resource waste issue. In order to have our cake and eat it too, we created a custom Redshift connector that takes advantage of the UNLOAD command in Redshift, which moves partial results from each node to S3 as CSV files. For added efficiency, our connector uses the multiprocessing module in Python to read the partial results in parallel then combine them using pandas.concat in the end.

Read partial results in parallel. Put them together using pd.concat. Orchestrate using the UNLOAD manifest.

The create_executor method above uses the starmap method from multiprocessing.Pool. The _read_part function (code not shown) will call pd.read_csv to deserialize each query result CSV file.

Read partial CSV results using starmap from multiprocessing.Pool

When reading CSVs, it is very easy to fail due to mismatching quote characters, NAs, delimiters, etc. But because our connector takes care of both supplying those to the UNLOAD command as well as pd.read_csv, we can guarantee that they’re consistent with each other and avoid these pitfalls. Redshift also creates a manifest file that we use to add additional validation on row counts and to help with data type conversions.

Press enter or click to view image in full size

How our custom pandas-redshift connector works end-to-end

At the end of the day, users simply run df = tdr.query_redshift(query).to_df() and all of the plumbing is done under the hood. We find using this approach gives us comparable or faster “time-to-df”, does not require working with a Spark context, and uses far less compute resources.

Very non-scientific performance comparison between PySpark and TDR. Ran the same query (20 columns) with different LIMIT parameters on an 8-node Databricks cluster vs in a TDR Jupyter notebook (2 CPU + 8GB RAM). Cluster startup time and resizing time excluded from PySpark numbers.

There are two main gotcha’s with the UNLOAD command. First, UNLOAD does not support LIMIT clauses in the innermost query. This we can solve using a common table expression.

WITH result_table AS (SELECT * FROM raw_table LIMIT 100)
SELECT * FROM result_table

UNLOAD also does not respect ORDER BY clauses. This we solve by doing the sorting in pandas after we get the results.

Low code data exploration

While Jupyter makes it easy for data scientists to analyze data, we want to lower the barrier to entry for analysts who are familiar with SQL but are just starting out with Python. For this purpose, we created a cell magic and a custom extension.

SQL cell magic

Many of us forget our own hurdles when we first started to learning coding, but for beginners, even “basic” concepts like working with variables, function calls, and string manipulation do not come automatically. To make it easier for them to analyze data using SQL, we created a custom cell magic named “%%sql”. The core logic of this cell magic is quite simple:

Evaluate the cell using `query_redshift` and set the output as the user variable `df`

When the magic is invoked in a cell, the user can enter a SQL query directly into the cell which TDR will execute using the query_redshift functionality described previously.

We also add additional options and output to help with debugging

You’ll notice there is a {{ott_app}} clause which isn’t actually valid SQL. This snippet is a companion feature to the SQL magic called…well…snippets. Using Jinja2, we can associate commonly used SQL snippets with aliases that can be referred directly in SQL queries. This saves us lots of repetitive typing, prevents careless mistakes, and helps us unify definitions. For example, the {{ott_app}} snippet can be used as a WHERE clause that the field “app” is any one of our many OTT platforms:

Embed composable snippets into SQL statements using Jinja2

Integrating nteract data explorer

Once we have our dataframe, we almost always need to visualize the data as the beginning of our exploratory data analysis process. The nteract data explorer from Netflix automatically visualizes dataframes. However, it is part of a different notebook project and the default Jupyter extensions require a separate tab, making the user experience very awkward. Instead, we decided to create a simple JupyterLab extension that can directly render the explorer as the output of display(df). In addition, the nteract data explorer did not store state, which meant that whenever you refreshed your notebook, it would go back to the default state, making both analysis and collaboration difficult. Using the same extension, we store the data explorer state in the notebook metadata itself so that when you share the URL with others, your visualization choices are preserved.

Simplified extension code to save data explorer state

When we put together the cell magic and this extension, the only line of code required to run a SQL query and visualize the results is display(df). This drastically reduces the barrier to entry for Python beginners.

Press enter or click to view image in full size

Cell magic and Jupyter extension help beginners write only SQL

Jovyan collaboration

Shareable links

By default, the URLs in JupyterHub are specific to the user’s JupyterLab server (e.g., https://<jupyterhub host>/user/cshe/). This means that the default “shareable” links from JupyterLab aren’t actually shareable. Instead, JupyterHub recommends using the /user-redirect/ route, which will then redirect the URL to the current user. We implemented our own plugin to rewrite the URL using /user-redirect/ and we’ll look to contribute this back to JupyterHub (if a kind soul has not already done so).

Deep Copy / Cut / Paste

Another minor annoyance we found was that the “Copy” and “Paste” menu items in the file browser do not work on directories. As a temporary workaround, we created “Deep Copy” (also cut and paste) extensions to deal with this particular issue.

Extensions modify existing menu items or add new menu items

We’ve open-sourced our JupyterLab extensions for public use and we hope to contribute this back to the main JupyterHub and JupyterLab featureset.

Additional features

TDR is chock-full of features that we only have room to summarize here. The JupyterLab in TDR comes with pandas_profiling preinstalled. This allows us to easily check data quality by invoking the profile_report function that’s added to pandas dataframes.

Press enter or click to view image in full size

TDR comes with pandas_profiling pre-installed for easy data quality checks

TDR also comes with a data catalog written in Python that combines the technical metadata from Redshift from the user metadata (like descriptions etc) created as part of our data warehouse. Using the data catalog, we can easily browse data objects, search for data using keywords or regular expressions, and use tab-completion to make data discovery much easier without jumping out of the notebook at all.

Work with the data catalog without leaving your keyboard. Tab-completion available as well.

TDR also uses plotly to create an interactive 3-D visualization to help everyone explore our tens of thousands of movies and tv series. We call this the “TDR content explorer”.

Press enter or click to view image in full size

Explore our content neighborhoods

Specifically for machine learning, we created visualization tools to compare different versions of our ranking algorithms so that the results are easy to understand for non-ML experts. This greatly reduces the occurrence of bugs and helps to make the models more interpretable.

Press enter or click to view image in full size

Visually compare the results of changes to personalization

The road ahead

Adoption for Tubi data runtime started out in data science and has expanded across product, engineering, QA, and operations teams. We’ve moved TDR onto Kubernetes which we continue to improve and will describe in a follow-up post. We’ve even launched a “Tubi Data Certification” course across the company to use TDR as an interactive training environment to help SQL beginners learn how to work with our data.

Currently we’re integrating Tensorboard and other tools specifically for deep learning workflows. In the future, we also plan to allow TDR to connect to remote Spark clusters or Tensorflow clusters to enable truly large scale data processing. If building data tools to democratize analytics and data science sounds exciting to you, come join your friendly neighborhood Tubi data engineering team.