Sad to Say: An AI Creativity Test (The Billy Joel Test)
The AI systems are being tested using domains such as proofs where there is great structure and fixed rules. This work claims to measure each AI to see how it performs.
This is a different proposal. The idea is to see how creative an AI can be. The rules are simple and general, leaving a lot of room for "creative license".
I call this the "Sad To Say" test. It involves the use of excitement and regret represented by a story about an event somewhere in a persons lifetime. The underlying idea is to capture emotional stresses and strains. The whole album has a common structure and theme.
...
Billy Joel wrote a song "Scenes From An Italian Restaurant". https://www.youtube.com/watch?v=Hxx8IWIvKg0
The structure of the song is in three parts, the "headboard", the "mattress", and the "footboard". This structure is called an HMF (headboard, mattress, footboard).
Create a music album with 12 songs. The overall album structure is 12 different people who have gathered for a 50th year reunion. Each person tells a story using the HMF structure.
The overall theme of the album is a collection of stories of great potential, explained in the mattress section, to a group of people who know each other from high school or college.
Each song's "headboard" involves a setup story using a tempo and conversation structure similar to the Billy Joel example. This same tempo and structure will be repeated in the "footboard". The "headboard" should lead to a sense of "great potential".
Each song's "mattress" is a completely different tempo and structure. The tempo is different. Each mattress song should be about an event with great potential such as a love interest, a business success, an athletic achievement, etc. The story should emphasis a great rise of success that eventually fails.
Each song's "footboard" should return to the "headboard" tempo and conversation structure. The "footboard" will recap the "mattress" theme ending with the phrase "Sad To Say..."
Success will be measured by publishing the album in standard
music channels and looking for listener approval. True success
would be a top 10 song or album. The creative part is that the AI needs to create compelling stories that
resonate with people. Joel's story shows something that could happen to
a couple living on being popular. There are 12 stories requested in musical form. The form isn't the issue.
The content of the "mattress" section requires telling stories that people
can accept as "real". Being able to follow a formula to generate specific types of content is kind of the opposite of creativity. If you really want a creative benchmark, you need to have an LLM devise a completely new structure/formula, create works with it, and benchmark whether or not audiences respond positively. Copying an existing formula is not going to prove a thing. Half agree. Also, there's already some tests like this: • https://www.sciencedaily.com/releases/2026/01/260125083356.h... • https://creativitybenchmark.ai But the "half" is that in practice, "creativity" is finding the border zone between novelty and familiarity. Every time I've been praised for it in my life I've had the inside view on what inspired me, and once I'd seen enough culture I started to be able to spot the sources of inspiration in much other work too. How Star Trek is inspired by mixing cold war naval manoeuvres, the age of exploration, and John W. Campbell's "Islands of Space", and how Islands of Space itself feels like Jules Verne with less autism (Verne had a lot of lists) and more civilising-mission smugness. Go too far outside the border zone and you get the same initial reaction as "Danse macabre" by Saint-Saëns ("horrible screeching from solo violin" causing widespread feelings of anxiety). It took familiarity for it to be seen as it is now, "one of Saint-Saëns' masterpieces, widely regarded and reproduced in both high and popular culture" to quote Wikipedia. Well, now you are including cultural acceptance as a requirement for creativity, which is definitely not a given. As you were saying, many new works take time to gain acceptance, but that does not mean they were not creative works. I think we're talking about two different types of "creative". One is the re-hashing of existing ideas and content into existing structures. That styles is the kind you see in marketing, design, the corporate world, and even much of the artistic world. It is also the kind in the links you shared, which specifically are measuring performance on "well-defined tasks". And there is nothing wrong with that type of creativity. But there is a second kind - the kind where people bring unique ideas to the world that do not match what has come before. We don't even need to stick to art for that, we can take a fairly obvious example of someone like Einstein, who was able to devise new theories based on existing knowledge - theories that were not re-hashes of the current ideas, but completely new ways to approach the same information. If LLMs can do that - take the same body of information, from any subject area, and devise novel ideas that do not match pre-existing structure, building something more than the sum of the parts, that is the more interesting style of creativity to measure I don't understand your example for the second kind. Even with your own words, "Einstein, who was able to devise new theories based on existing knowledge", that sounds to me much the same as the first kind, because of e.g. GR coming from Riemann geometry, and SR coming from Lorentz transformations. Looking back at history, it's kinda remarkable that the ancient Greeks, who knew the surface of the Earth was curved, didn't realise this curvature itself was a contradiction of Euclid's triangle angle sum. Likewise, should it be difficult to invent discrete spaces, once someone has asked if atoms exist? Evidentially both were difficult, as the gap was thousands of years.