Data development using Jupyter and Airflow

3 min read Original article ↗

Data Products come in many shapes and sizes - tables, metrics, dashboards and ML models. But they have 2 things in common:

  1. They need to be created

  2. They need to be updated on a schedule (hourly, daily, weekly, monthly)

This tutorial shows how to setup Jupyter and Airflow to build data products and run them on a schedule. Both these tools are open-source, free and market leaders in their domain - so you can’t go wrong

Docker is a great tool for testing locally, install docker desktop before you start

Open the Terminal (or PowerShell for windows) and follow along to build a sample data product using crypto data.

Create a python3 virtual environment for data tasks and activate it

python3 -m venv ~/.venvs/dpenv
source ~/.venvs/dpenv/bin/activate

Phidata converts data tools into plug-n-play apps. Enable an app and run it with 1 command. It’s the fastest way to run data tools locally.

Install phidata in your virtual env and follow the steps to initialize

pip install phidata
phi init -l

Workspace is a directory that contains the code for your data products. Create a new workspace using

phi ws init

Press Enter to select the default workspace name and template

Your workspace comes pre-configured with a jupyter notebook, start your workspace to run it

phi ws up

Press Enter to confirm, give a few minutes for the image to download and container to run.

Verify the container is running using the docker dashboard or docker ps

Open localhost:8888 in a new tab to view the jupyterlab UI.

  • Password: admin

Open notebooks/examples/crypto_nb.ipynb and run all cells using Run → Run All Cells

This will download crypto prices and store them in a CSV Table at storage/tables/crypto_prices

Open the workspace/settings.py file and uncomment dev_airflow_enabled=True (line 19). Start the workspace using

phi ws up

Press Enter to confirm. Give about 5 minutes for the containers to run and database to initialize.

Check progress using: docker logs -f airflow-scheduler-container

Open localhost:8310 in a new tab to view the Airflow UI.

  • User: admin

  • Pass: admin

Switch ON the crypto_prices DAG which contains the same task as the crypto_nb.ipynb notebook, but as a daily workflow.

Checkout the workflows/crypto/prices.py file for the full code. The table is written to the storage/tables/crypto_prices directory.

Play around, create notebooks, DAGs and read more about phidata

Stop the workspace using

phi ws down

This tutorial showed how to run Jupyter and Airflow to setup a local data development environment. In the next tutorial, we’ll run this in production on AWS. Leave a comment to let me know if you finished this in under 30 minutes :)

Love to all,
Ashpreet

Leave a comment