GitHub - piebro/wikidata-extraction: Toolkit for converting Wikidata N-Triples dumps to Parquet files and extraction structured data.

Wikidata Extraction

A toolkit for extracting and querying structured data from Wikidata. It converts Wikidata's N-Triples dump into Parquet files for efficient querying with DuckDB, and provides data extraction pipelines for specific entity types (Spotify artists, YouTube channels, Letterboxd films, etc.).

The triplets parquet dataset and all data extraction are available at huggingface.co/datasets/piebro/wikidata-extraction.

There is also a website at piebro.github.io/wikidata-extraction/random.html to get random entities using DuckDB WASM querying the .parquet files on huggingface. It can be a lot of fun to just explore random artists, youtube channels or other stuff.

Usage

Install uv und run the folling command to download a dataset from huggingface.co/datasets/piebro/wikidata-extraction.

uv run --with huggingface-hub hf download piebro/wikidata-extraction --repo-type=dataset --local-dir=.

You can then query the data to get all GitHub usernames in wikidata using for example DuckDB with python:

uv run --with duckdb python << 'EOF'
import duckdb
print(duckdb.sql("""
    SELECT object AS github_username
    FROM 'triplets_all/*.parquet'
    WHERE predicate = 'http://www.wikidata.org/prop/direct/P2037'
    LIMIT 10
""").df())
EOF

Look at AGENTS.md for how to use the dataset and some common code patterns.

Developing Setup

Install uv and git and run the following commands to download the code and setup everything.

git clone https://github.com/piebro/wikidata-extraction.git
cd wikidata-extraction
uv sync --python 3.13
uv run pre-commit install

Create wikidata parquet files

# Download: https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2 (has ~8 Billion rows and is ~40GB)
wget https://dumps.wikimedia.org/wikidatawiki/entities/latest-truthy.nt.bz2

# Create triplet parquet files
uv run scripts/wikidata_parquet/parse_all_triples.py latest-truthy.nt.bz2 triplets_50M --num-of-lines 50_000_000 # this should take just a few seconds
uv run scripts/wikidata_parquet/parse_all_triples.py latest-truthy.nt.bz2 triplets_1000M --num-of-lines 1_000_000_000
uv run scripts/wikidata_parquet/parse_all_triples.py latest-truthy.nt.bz2 triplets_all
# Creating triplets_all (~60GB) on my laptop running Ubuntu 24.04 takes ~5:30h on my `Intel Core Ultra 7 155U × 14` using less then 4GB of RAM

# Creates all predicate_labels.parquet
uv run scripts/wikidata_parquet/create_predicate_labels_data.py triplets_50m data_50m
uv run scripts/wikidata_parquet/create_predicate_labels_data.py triplets_1000m data_1000m
uv run scripts/wikidata_parquet/create_predicate_labels_data.py triplets_all data_all

# Creates all entity_labels.parquet
uv run scripts/wikidata_parquet/create_entity_labels_data.py triplets_50m data_50m
uv run scripts/wikidata_parquet/create_entity_labels_data.py triplets_1000m data_1000m
uv run scripts/wikidata_parquet/create_entity_labels_data.py triplets_all data_all

Run data extraction pipelines

uv run scripts/music/spotify_artist/create_spotify_artist_data.py triplets_all data_all
uv run scripts/music/spotify_album/create_spotify_album_data.py triplets_all data_all
uv run scripts/music/spotify_track/create_spotify_track_data.py triplets_all data_all
uv run scripts/video/youtube_channel/create_youtube_channel_data.py triplets_all data_all
uv run scripts/video/letterboxd_film/create_letterboxd_film_data.py triplets_all data_all
uv run scripts/video/youtube_video/create_youtube_video_data.py triplets_all data_all
uv run scripts/social/bluesky/create_bluesky_data.py triplets_all data_all
uv run scripts/social/subreddit/create_subreddit_data.py triplets_all data_all
uv run scripts/social/patreon/create_patreon_data.py triplets_all data_all
uv run scripts/other/github/create_github_data.py triplets_all data_all
uv run scripts/other/website/create_website_data.py triplets_all data_all
uv run scripts/other/non_profit_organization/create_non_profit_organization_data.py triplets_all data_all
uv run scripts/book/goodread_book/create_goodread_book_data.py triplets_all data_all
uv run scripts/book/gutenberg_book/create_gutenberg_book_data.py triplets_all data_all

Create a new data extraction

There are instructions at skills/add-data-pipeline.md which is a good starting point for adding new data pipelines. They also work well as a starting point for LLM Agents like claude code. An example prompt looks like this:

Read skills/add-data-pipeline.md and create a new topic called "letterboxd_film". It should contain all entities with a letterbox film id. The topic should be saved in "video".

Ideas

A collection of ideas I might do in the future or that others can try to do:

Create a website to use query (with examples) to see which data is missing (e.g. do all nonprofit organizations have a donation link available). This could return a list (sorted by "number of properties" as a popularity metric).
use latest-all.nt.bz2 instead of latest-truthy.nt.bz2 to get qualifiers, references, deprecated statements, and rank information (3-4x larger but provides full RDF structure, which would for example allow
social media followers)

Ideas for more Datasets

use the billboard dataset, to further enrich the artist/album/song data: https://www.kaggle.com/datasets/ludmin/billboard
enrich the artist/album/song data using Music Brainz Dataset or some other datasets like lastfm.

Contributing

Contributions are welcome. Open an issue if you want to report a bug, have an idea for another data pipeline or want to propose a change.

License

All code in this project is licensed under the MIT License.

Wikidata content is available under CC0 1.0 Universal.