Wikidata Extraction
A toolkit for extracting and querying structured data from Wikidata. It converts Wikidata's N-Triples dump into Parquet files for efficient querying with DuckDB, and provides data extraction pipelines for specific entity types (Spotify artists, YouTube channels, Letterboxd films, etc.).
The triplets parquet dataset and all data extraction are available at huggingface.co/datasets/piebro/wikidata-extraction.
There is also a website at piebro.github.io/wikidata-extraction/random.html to get random entities using DuckDB WASM querying the .parquet files on huggingface. It can be a lot of fun to just explore random artists, youtube channels or other stuff.
Usage
Install uv und run the folling command to download a dataset from huggingface.co/datasets/piebro/wikidata-extraction.
uv run --with huggingface-hub hf download piebro/wikidata-extraction --repo-type=dataset --local-dir=.
You can then query the data to get all GitHub usernames in wikidata using for example DuckDB with python:
uv run --with duckdb python << 'EOF' import duckdb print(duckdb.sql(""" SELECT object AS github_username FROM 'triplets_all/*.parquet' WHERE predicate = 'http://www.wikidata.org/prop/direct/P2037' LIMIT 10 """).df()) EOF
Look at AGENTS.md for how to use the dataset and some common code patterns.
Developing Setup
Install uv and git and run the following commands to download the code and setup everything.
git clone https://github.com/piebro/wikidata-extraction.git
cd wikidata-extraction
uv sync --python 3.13
uv run pre-commit installCreate wikidata parquet files
# Download: https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 (has ~8 Billion rows and is ~40GB) wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 # Create triplet parquet files uv run scripts/wikidata_parquet/parse_all_triples.py latest-truthy.nt.bz2 triplets_50M --num-of-lines 50_000_000 # this should take just a few seconds uv run scripts/wikidata_parquet/parse_all_triples.py latest-truthy.nt.bz2 triplets_1000M --num-of-lines 1_000_000_000 uv run scripts/wikidata_parquet/parse_all_triples.py latest-truthy.nt.bz2 triplets_all # Creating triplets_all (~60GB) on my laptop running Ubuntu 24.04 takes ~5:30h on my `Intel Core Ultra 7 155U × 14` using less then 4GB of RAM # Creates all predicate_labels.parquet uv run scripts/wikidata_parquet/create_predicate_labels_data.py triplets_50m data_50m uv run scripts/wikidata_parquet/create_predicate_labels_data.py triplets_1000m data_1000m uv run scripts/wikidata_parquet/create_predicate_labels_data.py triplets_all data_all # Creates all entity_labels.parquet uv run scripts/wikidata_parquet/create_entity_labels_data.py triplets_50m data_50m uv run scripts/wikidata_parquet/create_entity_labels_data.py triplets_1000m data_1000m uv run scripts/wikidata_parquet/create_entity_labels_data.py triplets_all data_all
Run data extraction pipelines
uv run scripts/music/spotify_artist/create_spotify_artist_data.py triplets_all data_all uv run scripts/music/spotify_album/create_spotify_album_data.py triplets_all data_all uv run scripts/music/spotify_track/create_spotify_track_data.py triplets_all data_all uv run scripts/video/youtube_channel/create_youtube_channel_data.py triplets_all data_all uv run scripts/video/letterboxd_film/create_letterboxd_film_data.py triplets_all data_all uv run scripts/video/youtube_video/create_youtube_video_data.py triplets_all data_all uv run scripts/social/bluesky/create_bluesky_data.py triplets_all data_all uv run scripts/social/subreddit/create_subreddit_data.py triplets_all data_all uv run scripts/social/patreon/create_patreon_data.py triplets_all data_all uv run scripts/other/github/create_github_data.py triplets_all data_all uv run scripts/other/website/create_website_data.py triplets_all data_all uv run scripts/other/non_profit_organization/create_non_profit_organization_data.py triplets_all data_all uv run scripts/book/goodread_book/create_goodread_book_data.py triplets_all data_all uv run scripts/book/gutenberg_book/create_gutenberg_book_data.py triplets_all data_all
Create a new data extraction
There are instructions at skills/add-data-pipeline.md which is a good starting point for adding new data pipelines.
They also work well as a starting point for LLM Agents like claude code.
An example prompt looks like this:
Read skills/add-data-pipeline.md and create a new topic called "letterboxd_film". It should contain all entities with a letterbox film id. The topic should be saved in "video".
Ideas
A collection of ideas I might do in the future or that others can try to do:
- Create a website to use query (with examples) to see which data is missing (e.g. do all nonprofit organizations have a donation link available). This could return a list (sorted by "number of properties" as a popularity metric).
- use
latest-all.nt.bz2instead oflatest-truthy.nt.bz2to get qualifiers, references, deprecated statements, and rank information (3-4x larger but provides full RDF structure, which would for example allow
social media followers)
Ideas for more Datasets
- use the billboard dataset, to further enrich the artist/album/song data: https://www.kaggle.com/datasets/ludmin/billboard
- enrich the artist/album/song data using Music Brainz Dataset or some other datasets like lastfm.
Contributing
Contributions are welcome. Open an issue if you want to report a bug, have an idea for another data pipeline or want to propose a change.
License
All code in this project is licensed under the MIT License.
Wikidata content is available under CC0 1.0 Universal.