Building What Michelin Wouldn’t: Its Awards History

9 min read Original article ↗

A few years back, I decided to gather Michelin restaurant data for fun. Over the years, something unexpected happened. My DMs and inbox started flooding with requests I kept having to turn down. Email after email asking the same thing: "Hey, I love your project! Do you have historical Michelin data? Can you tell me when Restaurant Z lost its star?"

Peak into my Gmail
Peak into my Gmail

My response was always the same awkward deflection: "Um, no... no way. You can try to get it from Wayback Machine or the old Michelin Guide books from Internet Archive". I must have copy-pasted some variation of that reply dozens of times over the years.

But here's the thing — every time I hit reply on one of those emails, it bugged me a little more. The more I thought about it, the more I realized I was sitting on an interesting problem that apparently no one else was solving.

I started looking into it

I spent way too much time Googling, scouring Reddit threads, Wikipedia, and food blogs. I was looking (and hoping) for someone, anyone, who was systematically tracking Michelin star changes over time.

The results were... not great. Even Michelin themselves don't have this old data published. Sure, you can find scattered blog posts about individual restaurants, but comprehensive historical tracking? Nothing.

Hey, I thought, why not? I’ve got a bit of free time before I start my new job, so I decided to go for it. Honestly, I was really curious to see how restaurants earn and lose their stars over the years.

Using git-history (...and failing at it)

Getting historical Michelin data turned out to be way more complex than I initially thought. My initial idea was straightforward: since I had already started collecting data and committing it to GitHub since July 2022, I figured we could just use git-history — a tool that reads through the entire commit history of a file and generates an SQLite database reflecting changes over time.

However, it wasn't as simple as it seemed. When I tried it out, things didn't go as smoothly as I hoped. Over the years, some snapshots of my own generated data had faulty entries — restaurants with missing star counts, pricing data that said null for months, and addresses that were just empty strings. I had to manually exclude commits from the history, which made the entire process even more complicated than it needed to be.

One night, after the tool crashed for the fifth time, I realized I was fighting a losing battle. I did spend days trying to patch and fix the tool for my own use, but it was too much work, and I decided this wasn't worth my time.

Trying to fix things for my own use
Trying to fix things for my own use

Wayback Machine to the Rescue!

Clearly, I needed a completely different approach. So I pivoted to Plan B: use the Wayback Machine API to get snapshots of restaurant pages, then visit each available snapshot to extract the data. Sounds simple, right?

Wrong.

The ever-changing website layout problem

The first problem hit me immediately — Michelin's website structure changes almost every year. My existing HTML parsing code that worked perfectly? Completely useless.

Each year's snapshot had different CSS/XPath selectors, different div structures. Different everything. What started as a simple data extraction job turned into building a time-traveling HTML parser.

Missing publication dates

Left: snapshot from 2020; Right: snapshot from 2025
Left: snapshot from 2020; Right: snapshot from 2025

Here's where things get a bit tricky. Take a look at the 2020 snapshot (the picture on the left) — it clearly says "MICHELIN GUIDE 2020," so you know the award is for that year. You'd think every year's guide would include that info, right? But surprisingly, after 2021, the Michelin Guide website stopped mentioning the year on the restaurant pages altogether, and honestly, I’m not sure why they made that change.

Now you're probably thinking, "Oh, simple! If the snapshot is from 2025, then the 3-star award must be for 2025.

Nope.

For example, this snapshot was taken on February 7, 2025; the Michelin Guide for the year hadn’t even been released yet (it was only announced on February 10, 2025).

Luckily, when I inspected the HTML behind the page, I found the publication date tucked away deep inside one of the script tags (though the older snapshots do not have this):

See the webpage source for the 2025 snapshot for yourself
See the webpage source for the 2025 snapshot for yourself

So, even if the page looks like it’s from 2025, the 3-star award might actually be for 2024! The website just… assumes you know.

The final extraction logic looks something like:

flowchart TD A[Start: HTML Snapshot] --> B{JSON-LD Script Present?} B -->|Yes| C[Parse JSON-LD for review.datePublished] C -->|Found| F[Return Published Date] C -->|Not Found| D B -->|No| D["Try another text selector"] D --> E{Found any regex match for date?} E -->|Yes| F E -->|No| D E -->|Still No| G["Give up"]

Pricing data nightmare

The pricing data was also all over the place.

Back in 2019, a restaurant might just list prices like "125 - 280 USD" in plain text:

Example snapshot from 2019
Example snapshot from 2019

By 2023, the same range was represented with a bunch of "$$$$" symbols and totally different HTML markup:

Example snapshot from 2023
Example snapshot from 2023

I spent hours trying to clean and normalize all those prices across different years, only to realize that even Michelin's own categories kept changing and weren’t consistent. In the end, I decided just to store everything as a string and leave the prices as they were:

A screenshot of the SQLite table (browsed using TablePlus)
A screenshot of the SQLite table (browsed using TablePlus)

The Infrastructure

The last step of this entire project was to ensure that it wouldn't break existing users who are depending on the generated CSV (which also my Kaggle dataset is linked directly to). This means the existing way of generating the latest freshest dataset and publishing to the GitHub workflow must continue to work.

Secondly, the architecture is deliberately simple. The entire system must be simple to maintain so that I won't hate myself after 6 months of maintaining this.

Basically, here’s what the final thing looks like:

flowchart TD %% Data Sources M[guide.michelin.com] -.->|Scrapes restaurant data| A W[Wayback Machine] -.->|Historical data| A %% Main Pipeline A[Scraper] -->|Uploads michelin.db| B[MinIO] B -->|michelin.db| C[Datasette] A -.->|Triggers redeploy via API| C C -->|HTTP| D[Users] %% Infrastructure subgraph Railway A B C end subgraph "Data Sources" M W end subgraph GitHub E[Source Control] end A -->|Publishes CSV| E

Backfilling marathon

First of all, I let the backfilling process run for about 3 days. Once it was done, I just uploaded the SQLite database directly to my MinIO instance via the console.

Oh, before that, I had to refactor my existing scraper to accommodate the new database schema, whereby new historical entries will always be appended to the existing restaurants.

Serving the data

Once everything is done, it's being deployed on Railway.app.

Now, I can browse my SQLite database through a Datasette instance, which serves the data directly via a web interface.

And of course, like I said, I make sure to keep the freshest Michelin data updated and available as CSV on GitHub as part of the flow.

The Result

Now that I have all this historical data, it's time for some fun stuff. I was able to quickly, for example, look for the restaurants in Spain which had the longest 3-star streaks since 2019:

Well, it’s definitely not the full story. Michelin already started including Spain in their guides back in 1910, so there’s that
Well, it’s definitely not the full story. Michelin already started including Spain in their guides back in 1910, so there’s that

The rise and fall

Take GästeHaus Klaus Erfort: it used to have 3 stars back in 2020, but has consistently stayed at 2 stars since 2021 (which is pretty impressive nevertheless):

Green stars follow the money

Another cool finding: Green Stars (for sustainability) are way more common in higher price brackets (“€€€€”), especially in Europe:

Makes sense when you think about it - sustainable sourcing and practices cost money, and higher-end restaurants have the margins to support it
Makes sense when you think about it - sustainable sourcing and practices cost money, and higher-end restaurants have the margins to support it

On the flip side, Bib Gourmand (good value) is concentrated in “$$” and “€€” (mid-price brackets), rarely in the highest. Which is literally the point of the award, but it's satisfying to see the data confirm what we'd expect.

Top cuisines among the restaurants

In terms of cuisine, starred restaurants are mostly dominated by Creative and Japanese cuisine:

Was anyone expecting differently?
Was anyone expecting differently?

Clearly, there is much more to be discovered, but I think these will do for now.

Flaws

Finally, I must say that the project isn’t perfect. The Wayback Machine only has snapshots starting from 2019, and even then, not every restaurant and year is fully archived.

Secondly, closed restaurants, like Julemont in Wittem, are missing from the current database. It was a two-star place in June 2024, but after closing, it had just disappeared by February 2025. It’d be great if we could keep their info in the records, so we have a more complete history.

💬

Update: the second flaw is addressed by PR#125

Lastly, what we have here doesn’t consider if a restaurant changes its name, switches owners, or even moves to a different location over the years. This is particularly tricky to track, especially if the Michelin guide URL changes for the restaurant.

Rest assured that the award history will still be accurate as long as the Michelin guide URL stays consistent over the years!

Accepting imperfection

At some point, I had to make a choice: spend more weeks trying to achieve perfect historical coverage, or accept that 2019 onwards is good enough.

I went with the latter.

Sure, I'm missing some earlier data, but 2019 gives me a solid base. More importantly, from now on, I have the infrastructure to capture proper historical data for the future. Every month, I'm building tomorrow's historical dataset.

Having said that, I'll probably work on addressing some of these issues in the near future (fingers crossed).

🙏

Thanks so much to everyone who emailed me and took the time to share your feedback so quickly — it truly means a lot to me!

What's Next

Now that I have all this historical data flowing, the real fun is just beginning.

I really hope to keep this project going for a while. Fingers crossed it won’t end up costing me too much. Who knows, maybe in 10 or 20 years, we’ll look back at what we’ve done with even more data — now that would be pretty fun!