The AI Morning Show: Automating German Humor
There is a long-standing stereotype that Germans and humor have a difficult relationship. We don't do funny. We do efficiency, engineering, and bread. Sometimes Nazi stuff. So, asking an Artificial Intelligence to mathematically deconstruct the daily news and output radio comedy in German language is a bit like asking a calculator to write poetry. It is also, probably, a very German thing to do.
Beer Gardens and Time Zones
It started, as the best ideas do, in a beer garden on a mild summer evening. I was catching up with Markus, an old university friend with a career trajectory that defies easy categorization. He started out wanting to be a pop star, pivoted to writing comedy for artists like late-night legend Harald Schmidt, opened six childcare centers along the way, and, among other ventures, has run a small company named gag-flatrate.de for the last 17 years.
The gag-flatrate business model is as specific as it sounds. It's a subscription service for German-language radio stations. Every morning, no later than five o’clock, subscribers receive a PDF containing a minimum of 100 fresh, topical gags based on the daily news. Morning show moderators use these to pepper their broadcasts with wit, pretending they came up with them while drinking their coffee.
Producing 100 jokes a day, every day, is as much a creative problem as a logistical one. To hit the daily morning deadline with genuinely fresh material, Markus relies on an elite selection of long-standing freelance comedy writers scattered across time zones, on duty to be funny while Germany sleeps.
I work as Head of AI at an advertising agency, with a history of trying to automate myself out of existence. My previous works include an AI-driven, Henry Ford-style assembly line for generating my own installation art, and immersive video experiences musing about how machines absorb human culture. Teaching silicon circuits German humor feels like an obvious thing to do.
Over our non-alcoholic beers, the conversation turned to the intersection of our distinct worlds: Could we get an AI to take over the night shift? Can we get a machine to do funny?
The Sunday Afternoon Reality Check
We started innocently enough. I invited Markus to my studio for a Sunday afternoon session. We sat down with the current state-of-the-art models and simply asked them to be funny. The result? Grammatically correct German sentences that made zero sense. Or, worse, sentences that made perfect sense but were painfully, tragically unfunny.
We immediately realized we were fighting on multiple fronts. The first was linguistic: LLMs are heavily Anglo-centric, and getting them to grasp the peculiarities of the German language in general and German radio humor in particular is a massive hurdle. The second realization was that you cannot simply ask a model to 'be funny' and expect results. You do not only need a good prompt, but also a proper system. At first, we tried dragging and dropping nodes in no-code frameworks like AnythingLLM and n8n, but it became clear pretty soon that these tools were not 'system enough', either. Being the German perfectionists we are, we ditched the visual editors and forged ahead with bare-metal Python.
As I started coding, I assumed we were venturing into uncharted territory, but Markus quickly corrected me. He introduced me to the surprisingly dense academic lore of 'computational humor.' It turns out, people study this. There are papers, conferences, patents, and even rockstar figures like Joe Toplyn (a former Letterman writer), whose 'Witscript' project has demonstrated that comedy can be calculated, at least in English.
Markus had been following this research, and with his knowledge, we started turning from trying to 'generate funny text' to 'computing humor.' It's not a subtle distinction.
The Architecture: From Potted Plant to Jungle
What started as a modest houseplant of a script grew, over months of nightly calls, into something closer to a jungle. Translating the messy, intuitive reality of a 17-year-old creative business into software is complex.
Our pipeline starts with roughly 1000 news items scraped daily from German media RSS feeds. But before a single word of comedy is created, every item faces a ruthless filter where we evaluate its 'Bewitzbarkeit' (joke-ability): A child being struck down by a collapsing lamppost is a hard zero, do not touch. That same child, in a kinder version of the multiverse, freezing their tongue to that lantern post and causing unprecedented traffic chaos in Munich city centre? That's a ten. Proceed immediately.
Once we deem a news item to be 'bewitzbar,' it enters our assembly process, a chain of LLM calls that started at three stages, briefly became ten, and has since settled at six stages running in up to fifteen LLM calls. The steps in this chain disassemble the news item, map its sub-topics, build associations, and then construct our gags setup-by-setup, punchline-by-punchline. We currently run seven distinct gag-construction algorithms. Some are taken from the scientific canon. Others, as far as we can tell, we invented ourselves. The entirety of our prompts file adds up to more than thousand lines, which we mention without shame. Ultimately, our pipeline recreates much of the traditional infrastructure used by large media organizations and public broadcasters. We rebuilt a classic editorial system, complete with editors, a writers' room, head writers, and even a rather unamused 'Chef vom Dienst'.
If you deal with current news, a big issue is the model's cut-off date. If something happens after the training has finished, the AI has no idea about it. And the problem is compounded by safety alignment. Models are trained to be factual, particularly in news contexts. Current geopolitics, though, are often so bizarre that our models flat out reject them as obvious misinformation. Did Trump announce 20% tariffs? Oh, they have been declared illegal! 10% again? 15%! For our models, this all is, despite the comedy, just one big fake. We actually run a separate database of recent news, which our algorithms cross-reference to compel the model to find humorous takes on a reality it officially does not know exists.
We learned that models running with high reasoning budgets are significantly funnier than 'creative' models because humor is largely about logical twists and the violation of expectations, not randomness. And also that spending weeks tweaking parameters like Temperature, Top_P, and Top_K is largely a waste of time. While our massive prompts are crucial, the biggest levers are the pipeline itself (the discrete steps) and the underlying model's raw linguistic German proficiency.
Because the field moves at such breakneck speed, we refused to marry any single provider. Our code is aggressively modular, and every backend is a drop-in replacement class. These include a locally running Ollama setup and single-vendor APIs such as Google's absurdly confusing GenAI API; but mostly we route through inference aggregators, allowing us to choose our server's location and to swap the 'brain' the moment a smarter model drops.
I Didn't Write This Code (But I Did Fix It)
As with the rest of the project, our roles evolved naturally. I built the Python engine, while Markus focused on the prompts, turning his comedy expertise into the fuel for the machine.
I did look into some of the du jour agent frameworks but decided against all of them. Mostly, they added complexity, lacked documentation, and sometimes injected hidden prompts at runtime. If something breaks, I want to know exactly where and why.
After experimenting with Cline, Claude and Gemini CLI, my setup eventually settled on VS Code with Codex. While there is barely a line in this repository that wasn't written by an LLM, this was not vibe coded, though. My workflow usually started with a high-level architectural debate, followed by a fresh chat to generate the implementation. Most of the time I spent in diff view, figuring out what the model had done, collapsing over-engineered logic, deleting esoteric fallbacks, and ensuring the code actually did what it claimed to do.
Markus and I collaborated via GitHub, which Markus had never used before and which certainly has its own learning curve. If you're not used to it, a git merge conflict message reads less like a software error and more like a ransom note.
17 Years of Data
Markus has been running this circus for nearly two decades. That means he's sitting on a goldmine: over half a million German gags, every single one handcrafted by a professional human writer. Our first instinct was to fine-tune a model on this, but we couldn't bring ourselves to commit to any specific model version that would be obsolete by next Tuesday. So instead of fine-tuning, we turned to 'ragging'.
At its core, RAG (Retrieval-Augmented Generation) describes giving a generative model a semantic search system as a sidecar. We converted our existing gags into vector embeddings, mathematical representations of meaning. Which allows us to identify and retrieve archive entries that are conceptually similar. So, a search for 'Oktoberfest' doesn't just match the string 'Oktober'; it can also retrieve conceptually linked concepts like 'public intoxication', 'incoherent slurring', and 'Markus Söder' without a single keyword match.
This allows us to give the AI really good examples of what we want. For example, if we ask for a joke about 'rising inflation' but the static gag examples in the prompt are about 'a penguin stealing a jet ski', the AI might get confused about the tone. It might try to make the Chancellor waddle or insert a marine mammal into the consumer price index. Things can go off the rails fast, and not in a good way.
With our RAG system in place, this process becomes context-aware. When the AI has to write a joke about 'Oktoberfest,' we know we've covered this topic 17 times before. We can instantly supply the AI with dozens of hand-crafted, original jokes about projectile vomiting behind ferris wheels, Americans passing out in 300-Euro Lederhosen, and Brezn that cost more than a small car. The AI sees the structure, the specific cruelty, and it can mimic the vibe.
Technically, our system is based on FAISS. We embedded our archive using 'jina-embeddings-v2-base-de' (yes, we tried a few models) via Ollama. At query time, we embed the current news topic, run a Top-K similarity search, and return the closest matches. And enjoy effective few-shot learning.
The Taste Problem
Generating jokes was not easy. Figuring out which ones were actually funny was even harder, though. The trouble started with the ground truth. To train our scorer, Markus, with all his expertise, hand-scored a test set of one thousand jokes, and I rated 300 of them. When we ran the numbers (Spearman and Pearson correlation), we discovered a terrifying truth: our personal sense of humor was almost as misaligned as the AI's. If two human friends can't agree on what's funny, how can a machine?
We tried almost everything. Standard scoring, rubrics, ranking systems, 1:1 comparisons with ELO-like ratings, like in chess. I even tried logistic regression and XGBoost models while on vacation in Madeira: great view, terrible performance.
LLMs are yes-men and therefore terrible critics. If you show an LLM a bad pun, it tends to think it's 'a creative play on words!' Also, humor often relies on the absurd, the far-fetched, or the bizarre. To a model, these jokes look illogical, so it scores them down for the very thing that makes them funny to a human.
While LLMs are terrible at giving absolute scores (is this a 6/10 or a 7/10?), they are slightly less bad at comparisons, though. They mostly can tell that Joke A is better than Joke B. So at least for now, our system uses a two-step approach that first ranks a batch of gags relative to each other, then derives a score based on that order.
Since the joke (and news) scorers turned out to be the weakest part of our pipeline, we also had to revise our overall workflow: Instead of using the scorer to filter (delete) bad jokes, we now only use it to sort them. We deliver the full list to our human writers, but we machine-rank it so that, under deadline pressure, they can rely on the first few entries at the top.
So… About Replacing Humans
Speaking of our human writers: When we started, at least I did not have a clear vision. I was mostly curious, but I also remember thinking: automate the whole thing. Replace these writers with a pipeline. Let the machine do the night shift, the morning shift, and every shift in between. Because I can.
Throughout the process, though, gradually and step by step, our entire orientation shifted. And it wasn't just because our scorers performed poorly. Or because Markus was adamant about gag quality, his brand, and the human craft of professional gag writers. Rather it just kind of made natural sense to shift. We stopped asking 'How do we fully automate production?' and started asking 'How can we make our writers less miserable at two in the morning?' This is a less dramatic story, admittedly. But for us, building a collaborative system turned out to be the far more interesting job.
We also asked ourselves questions like: 'If we are automating half of the work, what happens to the people who used to do that half?' Our writers are freelancers. They don't have employment contracts, but invoices and rent. They write for multiple clients across the media landscape. And that landscape is, to put it mildly, on fire. Am I the bad guy who replaces these craftsmen with machines? Or am I the one who enables this business to thrive (and feed the other fifty percent) for the next 17 years? I do not have a clean answer to this. And I don't think we should wrap this in the language of 'augmentation' and pretend the math works out evenly for everyone. It does not.
Regarding our endeavor, Markus and I did find a working model, though. But it’s not like we miraculously slashed costs by half, or all the night-shift drudgery vanished. The impact is far more nuanced.
We did, in fact, replace a portion of our freelance writers with AI. And also, the work has shifted. Instead of starting with a blank piece of paper, it is more about filtering and polishing the outputs from a machine. That machine, of course, never gets tired. It doesn’t roll its eyes when asked to write its eighteenth joke about the Oktoberfest, and it can deliver massive volume within minutes. Certain time-consuming tasks, such as news research, have been sped up massively. But for our head writers, the day-to-day hasn't flipped upside down. Because we intentionally designed it this way, the AI slots almost seamlessly into our existing editorial system.
What we did gain though, is better scalability in a highly uncertain landscape. Which translates to more resilience. And we stand on an entirely different strategic ground. We built a new stack of options allowing us to evolve and start thinking about new features that would have been completely unthinkable a year ago. Features that could make our commercial offering more valuable and gag-flatrate.de as a company more resilient still.
Minimalism
As this was now about collaboration, we needed our baby to move from a local machine into the internet. Our first attempt at hosting was render, which worked kinda ok until it didn't. So we set up a small Hetzner VPS, and containerized the pipeline using Orbstack. German gags, German engineering, German servers, yay.
For the interface itself, I chose Gradio, which is a library that lets you build web UIs with minimal effort. Our entire frontend is essentially a single Python file that pulls in the rest of the application and presents it as a few text boxes, two sliders, and two buttons.
Two buttons. Two sliders. I'm genuinely proud of this. Minimalism in interface design is not easy to achieve and every feature we didn't add represents a conversation where Markus and I debated for hours and then decided against it.
Our head writers live in Microsoft Word. They have lived in Word for years, possibly decades. And even though this software is a baroque monument of anti-patterns, they operate it with effortless efficiency. So we did not even try to replace this with some annoying, too small in-browser textboxes. Instead, we built our entire interface paradigm around plain text. The system takes plain text as input and produces (markdown formatted) plain text as output. You copy it. You paste it into Word, or whatever else you prefer. Done.
That said, the interface does have quite a few functions. You can run the full autopilot pipeline, obviously. But also process a batch of specific URLs or paste in custom text for station-specific events, promotions or hyper-local stories. The kind of thing where the butcher's apprentice threw up into the wine queen's décolletage at the village fair last night. Our apparatus will treat this provincial chaos with the very same rigor it applies to state politics.
At the specific request of our head writers, we also added two tools that don't generate gags at all. First, the ability to pull a news feed from the last X hours (that's one of the sliders), ranked by Bewitzbarkeit. Second, a standalone RAG search component for trawling through our archives. Sometimes the writers don't need the AI to be funny. They just need it to remember.
The Drunk Teenager
After almost half a year of developing, our system went live in January. We made no secret about the nature of this 'new colleague', but from the perspective of our head writers, the machine simply showed up as an additional gag writer emailing in material alongside the human freelancers. In February, we switched to the website, put our head writers in control and started refining our GUI. By the end of the month, we had implemented their feedback and planned to rely on our pipeline for a meaningful share of the gags in March.
Then, on the 26th of February, we got an email from Google, telling us that they would drop the model of our choice, Gemini 3.0, in ten days. They assured us that their new version 3.1 would be better. We are pretty sure they did not bother to run any German test, though, because their new model behaved like a drunk teenager with an inferiority complex. Here is what Google's 'significant improvement' looked like:
'Zwei unfassbare und irre große Drittel der bestellten Produkte des China-Megashops Temu haben bei der aktuellen harten Überprüfung durch das Gremium der Stiftung Warentest echt kläglich versagt und gelten als echt irre gefährlich im Betrieb… Der pure Umgang mit diesen unsicheren Dingern ist dabei also oftmals schon ein sehr echter brutaler Extremsport – diese völlig irre tägliche Bereitschaft zur eigenen absoluten Lebensgefahr kennt man wohl sonst ausschließlich am Abend echt nur von diesen krassen nordischen Wikingern aus dem Internet, die mit fetten kriegerischen Stahl-Äxten bewaffnet gut 20 harte und eiskalte Meter in die Tiefe direkt kopfüber in ein dänisches Tal voller Eis und Schnee eintauchen.'
Seriously Google, what the fuck?
Although certainly comical in its own way, this does reveal an existential problem. As long as we are reliant on American Big Tech, building AI products in Europe is a game of Russian roulette. At least if they involve culture-specific details such as our language.
A Cyborg Writers' Room
After a week of panic, much more model testing, and Markus performing emergency surgery on our monster prompts, we managed to stabilize the patient. At least for the time being. Now, our contraption does produce a significant portion of our gags, though we can’t put an exact number on it since the writers use it at their own discretion.
It is an ongoing process. For now, instead of a fully automated pipeline, we have built an intricate man-machine dance where our head writers incorporate our tool so deeply into their workflow that the line between 'AI-generated' and 'human-written' has become meaningfully blurred. Maybe this was always the most likely outcome. Not replacement, but absorption. Maybe the question shouldn't be 'what can the machine do?', but 'what does the human become when the machine is always in the room?'
A few more things may be worth taking away from this:
AI works, but no single prompt or magic incantation can replace a business. What we built took half a year of continuous iteration, and much of that work had nothing to do with AI at all. The AI may power the system, but the structure is made of institutional knowledge. You can't download that from Hugging Face.
German is not English with umlauts (sic). This sounds obvious, but the gap between English and German-language model performance is a fundamental problem. And it's not just compound noun humor, gendered article comedy, or even a general linguistics question. It is about culture, about values, about what makes us German. Germany needs its own foundational SOTA models, and it's a disgrace that we, being the world's third biggest economy, do not have them.
I could not have built this three years ago. I'm a technical person. I've written code before. But nothing remotely approaching a production system at this level. This entire project would simply not have been possible without AI-assisted coding tools. And it's not like Markus would have hired some established software development team instead. It simply would not have happened at all.
By the end of last year, our main inference platform alone had processed about 170 million tokens, which is roughly the equivalent of 29 years of uninterrupted human speech. All fed into a machine whose sole raison d'être is to make Germans crack a smile on their morning commute. The robots have not, as yet, taken over the comedy world. But every night, one of them quietly delivers a first draft. And in the morning, somewhere in Germany, a radio moderator is reading a joke that was half-written by a machine, half-polished by a human, and fully pretending it just came to them over coffee.
Addendum: Actual machine written gags
As described above, every gag that makes it into our final document passes through human hands. The ones below did not. They are cherry-picked, but direct, unedited outputs of our pipeline.
Meldung: Streusalz-Verbot in Berlin bestätigt. NABU siegt vor Gericht.
Der Naturschutzbund NABU hat vor Gericht ein Streusalz-Verbot in Berlin durchgesetzt… Man wollte wohl einfach verhindern, dass die Ratten in der Kanalisation Bluthochdruck bekommen.
Nach einer Klage des Nabu ist der Einsatz von Streusalz in Berlin ab sofort nur noch illegal möglich… Die Dealer im Görlitzer Park haben ihr Sortiment schon umgestellt: Statt Gras gibt es jetzt „Bad Reichenhaller Alpenjod“ – das Gramm für 15 Euro.
In Berlin drohen hohe Bußgelder für jeden, der vor seiner Haustür Streusalz benutzt… Ich habe mir den Bußgeldkatalog mal angeschaut: Die Preise sind wirklich gesalzen.
Meldung: Neunziger-Musik erobert den Eistanz bei den Olympischen Spielen
In Mailand tanzen die olympischen Eistanz-Paare dieses Jahr zu Eurodance-Musik… Das wirkt so passend wie eine Ballett-Aufführung am Ballermann.
Endlich hat der olympische Sport genau das, was ihm gefehlt hat: Das Flair einer Dorfdisco kurz vor halb zwei.
Ich warte ja nur darauf, dass beim großen Finale kein dreifacher Rittberger kommt – sondern die synchrone „Macarena“-Drehung.
Meldung: Überraschende Studie: Cannabis vergrößert Hirnvolumen bei älteren Erwachsenen
Forschungen der University of Colorado zeigen, dass Cannabis das Gedächtnis im Alter leistungsfähiger macht… Die Betroffenen erinnern sich plötzlich wieder an absolut alles – außer daran, wo sie vor fünf Minuten das verdammte Feuerzeug hingelegt haben.
Eine US-Studie kommt zu dem Ergebnis, dass das Gehirn durch regelmäßigen Cannabis-Konsum an Masse zulegt… Mediziner warnen allerdings: Wenn der Kopf morgens plötzlich nicht mehr durch den Türrahmen passt, sollten Sie die Dosis vielleicht reduzieren.
Eine US-Studie empfiehlt Cannabis quasi als Gehirn-Doping für Menschen ab 40… Nächste Woche lesen Sie dann als große Beilage in der Apotheken-Umschau: „Häkeln, Sudoku oder Eimer rauchen – was hält Sie wirklich fit?“
Meldung: Hier wird der Eintritt teurer: Neue Preise für beliebte Freizeitparks
Wegen der gestiegenen Kosten müssen Besucher im Europa-Park für Tickets bald bis zu 76 Euro bezahlen… Auf dem Schnappschuss in der Wildwasserbahn sieht man in den Gesichtern deshalb ab sofort nicht mehr panische Angst, sondern finanzielle Existenznot.
Der Europa-Park in Rust erhöht seine Eintrittspreise zur Saison 2026 um rund vier Prozent… Damit kriegt der Familienvater den ersten Adrenalin-Kick ab sofort nicht mehr im Looping, sondern schon an der Kasse.
Für eine vierköpfige Familie wird der Tag im Freizeitpark bei den Preisen schnell zum teuren Luxus-Trip… Besonders bitter ist das für die Väter, die für 76 Euro Eintritt eigentlich den ganzen Tag nur Rucksäcke bewachen.
Meldung: Polarwirbel stottert: Deutschland droht Winter-Wetter bis in den März
Ein Meteorologe warnt vor Frost bis zum Frühlingsanfang… Das ist das endgültige Aus für das nutzloseste Kleidungsstück der Welt: die Übergangsjacke.
Der Winter könnte uns laut Dominik Jung noch bis März erhalten bleiben… Wenn Sie ganz leise sind, hören Sie im Hintergrund das Sektkorken-Knallen bei Ihrem Gasanbieter.
Diplom-Meteorologe Dominik Jung rechnet mit einem Wintereinbruch bis in den März… Das heißt für alle Grill-Fans: Das traditionelle „Angrillen“ findet dieses Jahr auf dem Heizkörper statt.