Solving Connections: ChatGPT's 60-Day Journey
ChatGPT has emerged as a remarkable language model capable of understanding and generating human-like text. So, we wondered how good it was at making logical connections, especially when it comes to tricky puzzles combining culture, trivia, and sometimes puns with incomplete words or phrases. To find out, we embarked on a 60-day experiment, delving into the new NYT Connections puzzle that premiered on June 12. We had one mission. To uncover the AI's puzzle-solving abilities and observe its performance and accuracy during the NYT Connections' first 60-day beta period from June 12 to August 10.
For those unfamiliar yet with the NYT Connections puzzle, it's a daily word puzzle that poses a daily enigma. Players must group 16 words into four groups, containing four words each. The aim is to group them by discovering commonalities or common threads.
In this article, we present a comprehensive log of ChatGPT's performance during the 60-day challenge, highlighting its successes and occasional missteps. We invite you to join us as we explore the potential of AI language models and uncover fascinating insights from this unique experiment, all to answer the question - Can ChatGPT solve NYT's Connections?
Playing with Puzzles: Inside the 60-Day Methodology
Curious and determined, we embarked on this thrilling 60-day puzzle-solving adventure by crafting the perfect AI-friendly templated prompt for the ever-capable ChatGPT.
Step 1: Engineering The Daily Templated Prompt
It took us a few iterations to craft a reliable daily prompt that gave enough W Questions (what, why, how, and where) to guide the language model.
Here is the daily prompt we used:
Step 2: The Daily Routine
Every morning (the NYT Connections puzzle updates at midnight), we gave ChatGPT enough juice by adding a fresh set of 16 words to the templated prompt.
After solving every puzzle, we meticulously recorded the accuracy of ChatGPT's responses.
Our Assessment of ChatGPT's Responses
We considered a few factors when assessing whether Chatgpt solved the Connections puzzle that day. First, whether it managed to group the words according to commonalities correctly, and second, whether it managed to guess the correct category for that group (more or less.)
On June 12th, for example, we assessed that Chatgpt could group all 16 words into the correct four categories (e.g., Weather Phenomena: rain, hail, sleet, and snow) but that the category name was slightly different than Connections which was WET WEATHER. Our experiment still considered the answer good if the groupings were correct, even if the category name was incorrect.
Step 3: Creating Interactive Visuals to Reveal Key Findings
In the spirit of sharing ChatGPT's performance, we created these interactive visuals; here, you'll see its triumphs and blunders.
Our findings are ordered by date, starting with the puzzle's debut on June 12th. We then documented how efficient the AI language model was at grouping the words into their respective categories according to the NYT Connections difficulty levels. In the last column, we gave ChatGPT a solving score out of 100%, depending on how many times it correctly grouped all 16 words daily - you can find the percentage of correct answers by hovering over the purple bar.
If you're curious, use the search box to find out how well ChatGPT did at solving a particular group, word, or puzzle date.
So, how many days did ChatGPT manage to solve that day's puzzle? Here's a bar graph showing how many days over the 60-day period that ChatGpt managed to find all four groups and how many days it missed one or more.
39 Victories Vs. 21 Challenges
Decoding Mistakes: Exploring ChatGPT's Missteps with an Interactive Treemap
So, where did ChatGPT falter, and on which colored categories? This interactive map shows ChatGPT's errors grouped by the four difficulty levels of NYT Connections. Connections purple and blue categories are the hardest, and green and yellow are the easiest.
As you can see, the harder the difficulty level, the more ChatGPT struggled to find a common thread to solve the category. Hover over each category, and you'll see the date of the puzzle and the four words that make it up.
If you prefer to look at each color individually, simply use the filter in the left-hand corner and use the dropdown menu or click directly on the color you want to see.
Chat GPT's Struggles with Connections
While the AI language model usually managed to find commonalities between the yellow and green groups, we noticed it struggled a lot more with the purple and blue categories on average. These were some of the trickiest categories for ChatGPT to solve:
- Homophones
- Categories with ____, like NAKED____ or _____STICK
- Things with..., like THINGS WITH WINGS
- Words with..., like WORDS WITH! or WORDS WITH TWO PRONUNCIATIONS
- Hyper-specific trivia like MTV SHOWS, SPICE GIRLS, or SLANG FOR MONEY
- Things that are, like THINGS THAT ARE RED
- Groups that contained a proper noun like someone's name, a TV show, a brand, or a magazine
The majority of these were from the purple and blue groups; however, sometimes, a sneaky green or yellow tripped ChatGPT up, too, especially if it was trivia or a brand name.
Conclusion
Did ChatGPT showcase its linguistic brilliance in solving the NYT Connections puzzle? We will leave that for you to decide.
Regardless, we can all agree that it was an interesting experiment. From triumphant victories to intriguing challenges, this 60-day journey revealed the remarkable potential of AI-driven puzzle-solving. Glimpsing at the limitless possibilities of language and artificial intelligence going toward the future.
Recent Clues
Trending Clues
- Escapade Crossword Clue
- "Raging Bull" Oscar nominee Crossword Clue
- Chess greats, for short Crossword Clue
- Magic cure Crossword Clue
- Killing it Crossword Clue
- Device on a snowboarder's helmet, maybe Crossword Clue
- Appointment Crossword Clue
- Wooden grid Crossword Clue
- Sharp blow with the fist Crossword Clue
- Swashbuckler, e.g. Crossword Clue
- Soul, in Spanish Crossword Clue
- Red Bull slogan, and what can also be said of 17-, 25-, 28-, and 42-Across Crossword Clue
- "Colorful" Atlantic Coast Conference team Crossword Clue
- Roof ornament Crossword Clue
- Water boy? Crossword Clue
- Two-toed sloth Crossword Clue
- Wall flowers, perhaps Crossword Clue
- Leaves Crossword Clue
- Flooded Crossword Clue
- Olive family shrub Crossword Clue
- Actress Seehorn Crossword Clue
- Called Crossword Clue
- Empirical Crossword Clue
- "Good 4 U" singer Olivia Crossword Clue
- Close, as a port Crossword Clue
- Many an American employee Crossword Clue
- Vanity pieces Crossword Clue
- Ultimate degree Crossword Clue
- Bewilder Crossword Clue
- Vegan brand owned by Estée Lauder Crossword Clue
- Goblet Crossword Clue
- Miso soup mushroom Crossword Clue
- Raft pilot Crossword Clue
- Foul call, maybe Crossword Clue
- Great ___ Desert, Australian bioregion Crossword Clue
- Painter Georges -- Crossword Clue
- Sound check? Crossword Clue
- Ibis kin Crossword Clue
- Puzzles Crossword Clue
- Wanders (about) Crossword Clue
- Praying figure Crossword Clue
- Special deals Crossword Clue
- Bubblehead Crossword Clue
- Bach piece Crossword Clue
- Since Crossword Clue
- Unspoken Crossword Clue
- Fruit also known as guanabana Crossword Clue
- Texter's "Are you sure?" Crossword Clue
- Source of financial aid Crossword Clue
- Itty-bitty pencil Crossword Clue