"Solving" Wordle from Shared Scores at 100%* accuracy. – Marcos Huerta

8 min read Original article ↗

The Premise : Solve Wordle in one guess.

Back in 2022, I was inspired by Ben Hamner’s Kaggle project in which he tried to find the answer to the daily Wordle based entirely on the shared scores on Twitter that look like this:

Wordle 1,575 5/6

🟩🟨⬜⬜🟨
🟩🟩🟩⬜⬜
⬜⬜⬜⬜⬜
🟨⬜⬜⬜🟩
🟩🟩🟩🟩🟩

This was before I had this Quarto-based web site, so after coming up with my own approach to this problem (I was nerd sniped again), I put some Jupyter notebooks in this repository that collected my efforts, code, etc. I was very confident in the superiority of my approach including my “100% accuracy” on the corpus of available Wordles at the time.

I wrote about various Wordle-related coding projects and ideas of mine and others in this 2022 post.

How did it work?

The basic premise is that any particular score line is more likely for some solutions than others, based on the guesses that can make that pattern. Also, some patterns are impossible for certain solutions. I used the same word frequency data as the Kaggle project, and essentially create a running sum of frequency of the found patterns (i.e. how common are the words that could make 🟨⬜🟩🟩⬜, etc.) are for a given solution.

I don’t just add, I also subtract. For example, only one word ends in uery, and that is query. It is impossible to make a guess and get ⬜🟩🟩🟩 as a result if query is the solution. Therefore if ⬜🟩🟩🟩 occurred in a post, the code would subtract a penalty term for query as a candidate, and for any other “impossible patterns.”

Rewriting the code.

There was nothing particularly important about Twitter for solving; for the code to work I just needed social media shared scores. With the advent of Bluesky and easy API access, I decided to try again to solve Wordles, but now with shared posts from Bluesky. The full source code is here on github.

I also wanted to modernize the code and maybe make it easier to turn the solutions into something I could visualize and share as some sort of blog post or web app. I started over, making the code faster, especially in how it iterates over the penalty term.

A significant change in using Bluesky is that I can get maybe 1000 shared scores, but not > 5000 I was retrieving from Twitter. For older wordles, I’m lucky to get a few hundred. This means that by default I’m not using any minimum count for the patterns. If 🟨⬜⬜⬜🟩 appears only once, then it’s going into the analysis.

What I thought was going to happen.

I thought my system would work beautifully and I’d make some cool visualizations of how my code iterates and solves the puzzles. Here is an animation showing the code iterating over patterns and computing the frequency score - the sum of the frequencies of the words that make the patterns minus a penalty term that was found by iterating:

Code
import os

os.chdir('BlueSkyWordle')
from BSWordle import BlueskyWordle

bs = BlueskyWordle()

_ = bs.solve(1001)
bs.solve(1001, build_snapshot=True)
fig = bs.snapshot_plot()

The reality: my scoring system quickly failed.

Suffice to say, with fewer posts and using every pattern the same approach I took in 2022 did not work 100% of the time. While the score usually produced the right result, there are ~23 and counting Wordles since Wordle 800 where my beloved scoring metric returns the incorrect result.

Let’s look at the wordle with the solution buddy from some time ago. My scoring system thought the solution was paddy. Both had 0 impossible patterns so the penalty term did nothing. Wordle had defeated me.

I’m trying to not to spoil any Wordles either today or from the archive by not mentioning the date or wordle number in the body of this post. However, if you look at the code you may see some wordle numbers, which presumably won’t be too spoilery because the NY Times archive organizes by date, not number.

Code
bs = BlueskyWordle()
bs.solve(1107, resort_poor_score=False)
tbl = bs.best_df.head(2).select([
    'word', 'norm_score', 'impossible_pattern_count', 'fraction_found',
    'is_solution', 'kstatistic'
]).style.fmt_number(['norm_score', 'fraction_found', 'kstatistic'], decimals=3)
word norm_score impossible_pattern_count fraction_found is_solution kstatistic
paddy 1.057 0 0.459 0 0.468
buddy 1.051 0 0.521 1 0.017

As it happens, this is something like a classification problem and it turns out more features are better than one. You can see some of the new features in the table.

The New Features

Fraction of possible patterns found.

What fraction of the candidate word’s total possible patterns were found in the social media shares? (Note: this was upwards of 90% in Twitter data when I had 5000 or more posts, now it’s usually in the 50-60% range)

KS Statistic based on Opener Frequency

This is is the two-sample Kolmogorov-Smirnov test statistic from SciPy. “The null hypothesis is that the two distributions are identical.” So, I’m comparing the prevalence of the score patterns in the first guess against the NYT’s data of what people actually guess first.

As you may have heard, certain openers are consistently popular (adieu, radio, crane, etc.). This data is shown on the NYT Wordlebot, and can be extracted to get the counts of each first guess. I generate the frequency of the score patterns for each candidate these guesses correspond to, then compare to what first guess patterns I observe.

This tells me “is this solution consistent with the patterns of the first guesses, knowing that certain first guesses are consistently popular.””

One flaw is that I just extracted the opener frequencies for some relatively recent random day - it may not work perfectly in the past nor is it guaranteed to work in the future if opener patterns change. I should get at least a few weeks of this data to get better statistics - it seems that the percentage of ‘adieu’ openers has a reasonable amount of daily variation.

Impossible Pattern Count

Certain patterns just can’t occur for particular solutions. For example if the solution was purse the pattern can never be 🟨⬛🟩🟨🟩. I use the allowable word list to figure out if a pattern is possible or not, and I keep count of how many impossible patterns there are for each candidate as I iterate through all possible guesses and found patterns.

This should be zero for the actual solution, but in practice sometimes some errant score share shows an impossible pattern. Because I am not using any minimum count, I had to be more rigorous at filtering out random non-NYT wordle shares, like from wordle.at or other sites, etc - and those are other changes I made to make this code work. But sometimes there’s a totally normal looking share that has an impossible pattern for the actual solution. Presumably someone playing an old version of the game?

Also, sometimes one ore more incorrect solutions has an impossible pattern count of 0.

Combining the score and the new metrics

The current strategy is : use the normalized score, but if it’s not a strong signal compared to the runner up: rerank with the sum of the ranks of fraction_found, the KS statistic, the original normalized score, and - you’ll love this - the square of (impossible_pattern_count + 1). It will also rerank if the impossible pattern count of the top candidate is > 0. I don’t even remember which wordle made me add that random and clearly-not-hacky condition.

My 100% accuracy is a sham

So, yes, with all this post-hoc combination of 4 parameters, my code now “solves” 100% of the Wordles with one guess based on the shared scores on Blue Sky. But this is only because I have been constantly tweaking the heuristics to account for all the edge cases when my prior methods failed.

Initially I only reranked poor score-only solutions with the “fraction found” metric. But then that didn’t always work. I was sorting by impossible pattern count but then for at least one puzzle the actual solution had an impossible pattern count of 1 and the false solution was 0. I had initially reduced my candidates to the top 20 via my “normalized score” metric before reranking … until that failed Wordle 821 and I had to expand to the top 50, because the norm_score of the true answer was so poor.

In any event, it is only a matter of time before the latest version of my “100% accurate” algorithm fails again.

You can view my 100% (ha!) success rate and explore the data and solutions via this dash app.

NoteWhat about a classifier?

You would think a classifier machine learning model trained on the metrics I describe above would be able to accurately predict the most likely solution – without hacky post-hoc equations. Well, the classifier probability for flown was 0.931, but for the incorrect solution blown it was .962 - that was training on everything but the data for the flown puzzle.

If I train on Wordles 800-1350, and predict on > 1350, an Xgboost classifier misfires two puzzles, both of which have rank 1 in the normalized score. 😕 Obviously, I have not optimized the hyperparameters nor implemented early stopping to avoid overfitting but a quick approach here did not work.