5 Interesting datasets from the Data Engineering Zoomcamp

The DataTalks.Club Data Engineering Zoomcamp recently wrapped up. It was a way for data folks to learn and hone their skills, and provided a series of free modules that take you through the whole process of building a data pipeline, from data ingestion to dashboard. There’s no reason you still can’t take part now, by the way. The modules are all online, and you can still join the DTC Slack for help.

The ultimate aim of the course was to put these skills in action by building your own pipeline. There were no limits on what you could create, and students could select and use any dataset they wanted. This resulted in some really interesting projects and surfaced some datasets that I didn’t know existed. Datasets are like gold for data engineers, especially when you want to try out new tech, so having hundreds of engineers all showing off what they found is really valuable.

Here are 5 projects that I found with interesting datasets that you might use for your own projects. Technically, these pipelines should all be reproducible if you clone and set up the pipeline, but at the very least they provide a resource for inspiration for your own projects. If you like the dataset and project, don’t forget to give a star to show your support.

1. Daily data on Berlin bike thefts

Lisa Reiber made a project that analyses bike theft data for Berlin. The dataset (CSV) is interesting because it is updated daily and includes data such as:

location of theft
time of theft
bicycle stats

The update frequency makes it perfect for projects that analyze changing trends and those that focus on data quality as the dataset is ever changing. The dataset was sourced from Berlin Open Data, which also has lots of other datasets for you to explore.

Press enter or click to view image in full size

Looker dashboard by Lisa Reiber on Berlin bike theft

Check the GitHub repository for more information, and the Looker dashboard to explore the data.

2. Capital Bikeshare

Another bike related project, this one from Muhammad Irfan Fadhlurrahman who created a project based around the Capital Bikeshare dataset. The dataset includes data on around 6 million trips from January 2021 to January 2023.

Muhammad’s approach was to start with questions that would need answering from the data and dashboard, much like would be the case in a business setting:

Which hour of the day has the most active members to use the bike?
What are the most popular start and end station pairings?
What are the number of rides and average duration by day of week?
Which bikes have been ridden the most?

Between the dbt modeling stage, and resulting Metabase dashboard, PipeRider was used to analyze the dataset to ensure data quality. The data profiling reports generated from PipeRider provide interesting insights into the data, and help you ensure data quality and adjust your dbt models, before needing to lift a finger to create a dashboard.

Press enter or click to view image in full size

Using PipeRider to compare data before and after cleaning

The above image from a PipeRider comparison report shows the before and after stats following data cleaning.

Check out Muhammad’s Github repository for more information.

3. AIS Data pipeline

Lars Skaret’s project uses AIS (Automatic Identification System) data from the Danish Maritime Authority. The dataset is frequently updated and goes all the way back to 2006.

Lars used the data to answer questions about maritime traffic in Danish waters, and created a Looker dashboard that contained heat maps and Vessel tracking.

Vessel tracking heatmap by Lars Skaret

Check out Lars’ GitHub repository for more information.

4. Air Quality

Grzegorz Gatkowski used data from Open Weather’s Air Pollution API to create an Air Quality dashboard for Poland’s cities. The data includes historical, current, and forecast data for cities; and the Open Weather API allows 1000 API calls per-day for free. The various data available from the collection of Open Weather APIs Makes this an interesting data source.

Press enter or click to view image in full size

Air Quality dashboard by Grzegorz Gatkowski

Check out Grzegorz’s GitHub repository and Looker dashboard.

5. San Francisco eviction data

Sanyassyed used a dataset made available from DataSF on eviction notices served. The Eviction Notices dataset seems to be regularly updated and contains data going back to December 2014.

Sanyassyed answers questions relating to the data such as:

trends in evictions
most/least common reasons for evictions
neighborhoods with most evictions

Press enter or click to view image in full size

SF eviction dashboard by Sanyassyed

Check out Sanyassyed’s GitHub repository and dashboard for more information.

Other datasets

Here are some other datasets that I saw used that you might find interesting:

Minneapolis 311 data, used by Mike Cole
NYC Restaurant Inspection data, used by kawczy83
Steam Reviews data, used by Alicia Escontrela
Magic the Gathering data, used by Vincenzo Galante

Conclusion

I chose these projects based on the dataset used. You might have particular topics or types of data that interest you, so my advice would be to search for dezoomcamp on GitHub and check through the repos for yourself. Not only can you find good datasets, but each engineer’s techniques differ, so you can learn a lot by checking the readme and code.

If you find any other interesting repos, please share them in the comments.

1. Daily data on Berlin bike thefts

2. Capital Bikeshare

3. AIS Data pipeline

4. Air Quality

5. San Francisco eviction data

Other datasets

Conclusion

More from Dave