The DataTalks.Club Data Engineering Zoomcamp recently wrapped up. It was a way for data folks to learn and hone their skills, and provided a series of free modules that take you through the whole process of building a data pipeline, from data ingestion to dashboard. There’s no reason you still can’t take part now, by the way. The modules are all online, and you can still join the DTC Slack for help.
The ultimate aim of the course was to put these skills in action by building your own pipeline. There were no limits on what you could create, and students could select and use any dataset they wanted. This resulted in some really interesting projects and surfaced some datasets that I didn’t know existed. Datasets are like gold for data engineers, especially when you want to try out new tech, so having hundreds of engineers all showing off what they found is really valuable.
Here are 5 projects that I found with interesting datasets that you might use for your own projects. Technically, these pipelines should all be reproducible if you clone and set up the pipeline, but at the very least they provide a resource for inspiration for your own projects. If you like the dataset and project, don’t forget to give a star to show your support.
1. Daily data on Berlin bike thefts
Lisa Reiber made a project that analyses bike theft data for Berlin. The dataset (CSV) is interesting because it is updated daily and includes data such as:
- location of theft
- time of theft
- bicycle stats
The update frequency makes it perfect for projects that analyze changing trends and those that focus on data quality as the dataset is ever changing. The dataset was sourced from Berlin Open Data, which also has lots of other datasets for you to explore.
Press enter or click to view image in full size
Check the GitHub repository for more information, and the Looker dashboard to explore the data.
2. Capital Bikeshare
Another bike related project, this one from Muhammad Irfan Fadhlurrahman who created a project based around the Capital Bikeshare dataset. The dataset includes data on around 6 million trips from January 2021 to January 2023.
Muhammad’s approach was to start with questions that would need answering from the data and dashboard, much like would be the case in a business setting:
- Which hour of the day has the most active members to use the bike?
- What are the most popular start and end station pairings?
- What are the number of rides and average duration by day of week?
- Which bikes have been ridden the most?
Between the dbt modeling stage, and resulting Metabase dashboard, PipeRider was used to analyze the dataset to ensure data quality. The data profiling reports generated from PipeRider provide interesting insights into the data, and help you ensure data quality and adjust your dbt models, before needing to lift a finger to create a dashboard.
Press enter or click to view image in full size
The above image from a PipeRider comparison report shows the before and after stats following data cleaning.
Check out Muhammad’s Github repository for more information.
3. AIS Data pipeline
Lars Skaret’s project uses AIS (Automatic Identification System) data from the Danish Maritime Authority. The dataset is frequently updated and goes all the way back to 2006.
Lars used the data to answer questions about maritime traffic in Danish waters, and created a Looker dashboard that contained heat maps and Vessel tracking.
Check out Lars’ GitHub repository for more information.
4. Air Quality
Grzegorz Gatkowski used data from Open Weather’s Air Pollution API to create an Air Quality dashboard for Poland’s cities. The data includes historical, current, and forecast data for cities; and the Open Weather API allows 1000 API calls per-day for free. The various data available from the collection of Open Weather APIs Makes this an interesting data source.
Press enter or click to view image in full size
Check out Grzegorz’s GitHub repository and Looker dashboard.
5. San Francisco eviction data
Sanyassyed used a dataset made available from DataSF on eviction notices served. The Eviction Notices dataset seems to be regularly updated and contains data going back to December 2014.
Sanyassyed answers questions relating to the data such as:
- trends in evictions
- most/least common reasons for evictions
- neighborhoods with most evictions
Press enter or click to view image in full size
Check out Sanyassyed’s GitHub repository and dashboard for more information.
Other datasets
Here are some other datasets that I saw used that you might find interesting:
- Minneapolis 311 data, used by Mike Cole
- NYC Restaurant Inspection data, used by kawczy83
- Steam Reviews data, used by Alicia Escontrela
- Magic the Gathering data, used by Vincenzo Galante
Conclusion
I chose these projects based on the dataset used. You might have particular topics or types of data that interest you, so my advice would be to search for dezoomcamp on GitHub and check through the repos for yourself. Not only can you find good datasets, but each engineer’s techniques differ, so you can learn a lot by checking the readme and code.
If you find any other interesting repos, please share them in the comments.