MusicBrainz Canonical Metadata – MetaBrainz Blog

The MusicBrainz project is proud to announce the release of our latest dataset: MusicBrainz Canonical Metadata. This geeky sounding dataset packs an intense punch! It solves a number of problems involving how to match a piece of music metadata to the correct entry in the massive MusicBrainz database.

The MusicBrainz database aims to collect metadata for all releases (albums) that have ever been published. For popular albums, there can be many different releases, which begs the question “which one is the main (canonical) release?”. If you want to identify a piece of metadata, and you only have an artist and recording (track) name, how do you choose the correct database release?

This same problem exists on the recording level – many recordings (songs) exist on many releases – which one should be used?

The MusicBrainz Canonical Metadata dataset now solves this problem by allowing users to lookup canonical releases and canonical recordings. Given any release MBID, Canonical Release Mapping (canonical_release_redirect.csv) allows you to find the release that we consider “canonical”. The same is now true for recording MBIDs, which allows you to look up canonical recordings using the Canonical Recording Mapping (canonical_recording_redirect.csv). Given any recording MBID, you can now find the correct canonical recording MBID.

These datasets solve some tricky problems for our data users, but the last table gets really interesting: Canonical Metadata (canonical_musicbrainz_data.csv). This table contains all the string metadata necessary to make effective use of the two datasets above. Artist names, release names and recording names are all present in this table, indexed against artist_credit_id, release_mbid and recording_mbid.

The real power of this table comes from its ability to provide an easy and compact way to identify your own music files, or to match/correct music metadata. Looking up your music tracks on MusicBrainz can be a challenge, since you need to understand the intricate schema underneath. Not so with this dataset – import the dataset into your favorite datastore and start looking up tracks to clean up. (That is, if our Picard tagging application isn’t sufficient for your needs!)

In a follow up blog post, I will walk you through the process of using this dataset to identify music tracks, using Python. Update: Learn how to build your own music tagger in this post!

Find out more about this dataset on the canonical dataset documentation page, and browse all of our datasets from our new datasets documentation page.