Press enter or click to view image in full size
In this post, we introduce Viewflow, a newly released open-source framework that allows data scientists to create data models without writing Airflow code.
To provide DataCamp with business insights, our data scientists build their data models by creating materialized views — the results of complex data manipulations. These materialized views are created based on raw data and are often dependent on other materialized views. We use Apache Airflow, a framework that allows users to create and schedule data workflows programmatically. Programming Apache Airflow, however, can be complex and time-consuming. The more views we have in a data workflow, the more Airflow code we must write and maintain.
At DataCamp, we strongly believe that our data scientists should focus on writing views (i.e., writing SQL, R, or Python code that provides actionable insights), and not writing Airflow code.
To enable our data scientists to avoid writing Airflow code, we built Viewflow. Viewflow is a framework that automatically transforms a series of SQL, R markdown, and Python files into an Apache Airflow workflow, i.e., a directed acyclic graph (DAG) made up of tasks. A DAG is the Airflow implementation of a logical data model, and a task is the Airflow implementation of a view.
Today, we are excited to announce our contribution to the open-source community by releasing the source code of Viewflow! With Viewflow, not only will your data scientists be able to focus on what they do the best — providing business insights — but they will do so at a much faster pace because they won’t need to worry about data engineering tasks.
Data-Driven Business Decisions at DataCamp
At DataCamp, not only do we teach data science, but we also do data science. We have a great team of data scientists that conduct financial analyses (ARR, revenue share, subscriptions, churn, …), content analyses (course rating, projects completions, users time spent), marketing analyses (acquisitions, conversions, ad spend), and also implement advanced analytics such as course recommendations or skill assessments.
To perform these analyses, our data scientists create materialized views. A materialized view results from (potentially complex) data transformations, such as aggregations, joins, or the implementation of business rules.
Depending on the views' complexity, our data scientists create them in SQL, R, or Python. They are free to use the language they believe is the best to make their views.
Once the views are created or updated, they are used to develop our dashboards, and they are accessible to every DataCamp employee. DataCamp employees can query these views in SQL (for instance, via Metabase), in R or Python. Once again, we provide users with the language that suits best their use cases.
Curious to find out how we allow our users to easily connect to our data warehouse in R or Python? We created two internal packages — datacampr and datacampy that both handle authentication to our data warehouse and that provide our data scientists with utility methods to query and manupilate the data. These packages will be the topic of a future blog post!
These views are considered the source of truth for every business decision we make in the company, be it for a product manager for implementing new product features or for our VP of Finance for critical financial decisions.
Airflow, DAGs, and the Complexity of Adding Views
We use Apache Airflow to schedule the SQL queries’ execution and the R and Python scripts that create our views. In Airflow, a view is created by a task, and we organize these tasks in different DAGs — one (or multiple) DAGs per business unit (e.g., marketing, finance, content). This organization allows us to have clear and readable DAGs, where it is easy to find a task we are looking for (for debugging purposes, for instance, or to manually trigger it after code changes).
We have a total of 19 DAGs that create 500+ views in our Redshift data warehouse. Here is our finance DAG:
Press enter or click to view image in full size
As you can see in the Finance DAG, many views have dependencies, that is, they use data from previously created views. These dependencies can be either internal or external. An internal dependency is an upstream view in the same DAG (i.e., the views are connected within the DAG). An external dependency is a view that belongs to a different DAG. In the image above, internal dependencies are depicted by two tasks being connected. External dependencies are depicted by a task with a dark background (i.e., an external task sensor in Airflow).
As depicted in the image above, our DAGs can be pretty complicated, with many internal and external dependencies. To write a new view without Viewflow, our data scientists would need to:
- Write the SQL query or the Rmd or Python script that creates the view.
- Update the Python DAG script for the new task using the Airflow API and correct Airflow operators.
- Write Python code to define the task’s internal and external dependencies using the Airflow API.
Only the first step creates data science value. The last two steps do not add to the scientific value but are required to materialize the new view.
Get Gaëtan’s stories in your inbox
Join Medium for free to get updates from this writer.
Moreover, the last two steps can be problematic for several reasons. First, they require specific knowledge of Airflow. Second, they are prone to programming bugs due to requiring context switching between data science and data engineering and manually writing logic in code. Third, logical errors may not produce programming errors or exceptions and could pass all tests but produce bad end-user data that is difficult to identify and debug.
For example, if view C depends on data of views A and B (see the left-most side of the Figure below), but A was not manually set as a dependency of C (see the right-most side of the Figure below), then C might run before A was updated (depending on how the scheduler picks-up the tasks). In this case, view C will be updated but not up to date. It will contain stale data due to the error in setting logical dependencies. If that view is used for high-stakes business decisions, the consequences of that logical error can be large.
Press enter or click to view image in full size
Viewflow to the Rescue
Viewflow is a framework built on the top of Airflow that handles the creation of Airflow DAGs and tasks and automatically manages the task dependencies (be it internal or external). Viewflow enables data scientists to focus on the view creation logic in their preferred tool (SQL query or R script, or Python). This significantly reduces the data engineering burden on data scientists.
To write a new view with Viewflow, our data scientists would need to:
- Write the SQL query or the Rmd or Python script to create the view and provide additional metadata (e.g., view’s description, columns’ description, see code snippets below).
And that’s it! Viewflow takes care of the remaining steps:
- Automatically creates and adds the new task to its DAG
- Automatically sets the task’s internal and external dependencies
By taking care of creating the task and setting the task’s dependencies, Viewflow provides:
- Quality of life improvements for data scientists. Data scientists choose their preferred language to solve the problem and focus their contribution on writing data science code.
- Reduced context switching. Since data scientists do not need to write Airflow code and manually set logical dependencies, data science and data engineering concerns are separated. Adding a view does not require both contexts.
- Increased velocity. By simplifying data scientists' contribution flow, new views can be added faster, and the team can be more responsive to requests.
DataCamp’s mission is to democratize data science. This is true for our users, but this is also true for ourselves. We want to allow every DataCamper (and everyone, really) to become data-driven.
Viewflow is a step towards our goal. Viewflow effectively reduces the barrier to entry. We have seen DataCamp employees without data science backgrounds (for example, engineers with SQL knowledge or Product Managers familiar with R) are able to create new views with Viewflow without writing a single line of Airflow code.
With Viewflow, whether you are a data scientist or not, all you need to do is write SQL, Python, or R code to produce new insights.
Open-sourcing Viewflow
Viewflow has been extremely useful for us and kept its promise: data scientists should write views, not Airflow code. Because of the benefits of Viewflow at DataCamp, we believe it could help other data organizations become more efficient. We are proud to contribute Viewflow to the open-source community.
Want to see Viewflow in action? Run our demo on your local machine! The demo creates an Airflow docker container shipped with Viewflow and a Postgres Docker container used as a data warehouse in which the views are created. Make sure to follow the README instructions and judge by yourself if Viewflow is made for you!
We decided to release a subset of the current features that Viewflow supports at the moment (i.e., SQL and Python views). Some components of the Viewflow version we use internally were written for DataCamp’s specific use cases. For instance, Rmd views are executed on Amazon Fargate. In its current state, it would be complicated for a user to use all of the Viewflow features we have at the moment. However, we strongly believe that the SQL and Python views can be extremely useful. We will continue to add new view types (e.g., Rmd, Jupyter notebook) in the coming months. Of course, if you see value in Viewflow and would like to contribute, don’t hesitate to create a pull request or a new feature request!
We Are Hiring
If you would like to be a part of a talented team working to democratize data science, check out our open positions!