A new way to extract detailed transcripts from Claude Code

In this newsletter:

A new way to extract detailed transcripts from Claude Code
How Rob Pike got spammed with an AI slop “act of kindness”
Your job is to deliver code you have proven to work
Cooking with Claude
Gemini 3 Flash

Plus 10 links and 4 quotations and 2 notes

If you find this newsletter useful, please consider sponsoring me via GitHub. $10/month and higher sponsors get a monthly newsletter with my summary of the most important trends of the past 30 days - here are previews from August and September.

I’ve released claude-code-transcripts, a new Python CLI tool for converting Claude Code transcripts to detailed HTML pages that provide a better interface for understanding what Claude Code has done than even Claude Code itself. The resulting transcripts are also designed to be shared, using any static HTML hosting or even via GitHub Gists.

Here’s the quick start, with no installation required if you already have uv:

uvx claude-code-transcripts

(Or you could uv tool install claude-code-transcripts or pip install claude-code-transcripts first, if you like.)

This will bring up a list of your local Claude Code sessions. Hit up and down to select one, then hit <enter>. The tool will create a new folder with an index.html file showing a summary of the transcript and one or more page_x.html files with the full details of everything that happened.

Visit this example page to see a lengthy (12 page) transcript produced using this tool.

Screenshot of a claude code transcript spanning 12 pages - the first page shows a summary starting with the first user prompt to clone bellard/quickjs to /tmp

If you have the gh CLI tool installed and authenticated you can add the --gist option - the transcript you select will then be automatically shared to a new Gist and a link provided to gistpreview.github.io to view it.

claude-code-transcripts can also fetch sessions from Claude Code for web. I reverse-engineered the private API for this (so I hope it continues to work), but right now you can run:

uvx claude-code-transcripts web --gist

Then select a Claude Code for web session and have that converted to HTML and published as a Gist as well.

The claude-code-transcripts README has full details of the other options provided by the tool.

These days I’m writing significantly more code via Claude Code than by typing text into a text editor myself. I’m actually getting more coding work done on my phone than on my laptop, thanks to the Claude Code interface in Anthropic’s Claude iPhone app.

Being able to have an idea on a walk and turn that into working, tested and documented code from a couple of prompts on my phone is a truly science fiction way of working. I’m enjoying it a lot.

There’s one problem: the actual work that I do is now increasingly represented by these Claude conversations. Those transcripts capture extremely important context about my projects: what I asked for, what Claude suggested, decisions I made, and Claude’s own justification for the decisions it made while implementing a feature.

I value these transcripts a lot! They help me figure out which prompting strategies work, and they provide an invaluable record of the decisions that went into building features.

In the pre-LLM era I relied on issues and issue comments to record all of this extra project context, but now those conversations are happening in the Claude Code interface instead.

I’ve made several past attempts at solving this problem. The first was pasting Claude Code terminal sessions into a shareable format - I built a custom tool for that (called terminal-to-html and I’ve used it a lot, but it misses a bunch of detail - including the default-invisible thinking traces that Claude Code generates while working on a task.

I’ve also built claude-code-timeline and codex-timeline as HTML tool viewers for JSON transcripts from both Claude Code and Codex. Those work pretty well, but still are not quite as human-friendly as I’d like.

An even bigger problem is Claude Code for web - Anthropic’s asynchronous coding agent, which is the thing I’ve been using from my phone. Getting transcripts out of that is even harder! I’ve been synchronizing them down to my laptop just so I can copy and paste from the terminal but that’s a pretty inelegant solution.

You won’t be surprised to hear that every inch of this new tool was built using Claude.

You can browse the commit log to find links to the transcripts for each commit, many of them published using the tool itself.

Here are some recent examples:

c80b1dee Rename tool from claude-code-publish to claude-code-transcripts - transcript
ad3e9a05 Update README for latest changes - transcript
e1013c54 Add autouse fixture to mock webbrowser.open in tests - transcript
77512e5d Add Jinja2 templates for HTML generation (#2) - transcript
b3e038ad Add version flag to CLI (#1) - transcript

I had Claude use the following dependencies:

click and click-default-group for building the CLI
Jinja2 for HTML templating - a late refactoring, the initial system used Python string concatenation
httpx for making HTTP requests
markdown for converting Markdown to HTML
questionary - new to me, suggested by Claude - to implement the interactive list selection UI

And for development dependencies:

pytest - always
pytest-httpx to mock HTTP requests in tests
syrupy for snapshot testing - with a tool like this that generates complex HTML snapshot testing is a great way to keep the tests robust and simple. Here’s that collection of snapshots.

The one bit that wasn’t done with Claude Code was reverse engineering Claude Code itself to figure out how to retrieve session JSON from Claude Code for web.

I know Claude Code can reverse engineer itself, but it felt a bit more subversive to have OpenAI Codex CLI do it instead. Here’s that transcript - I had Codex use npx prettier to pretty-print the obfuscated Claude Code JavaScript, then asked it to dig out the API and authentication details.

Codex came up with this beautiful curl command:

curl -sS -f \
    -H “Authorization: Bearer $(security find-generic-password -a “$USER” -w -s “Claude Code-credentials” | jq-r .claudeAiOauth.accessToken)”  \
    -H “anthropic-version: 2023-06-01” \
    -H “Content-Type: application/json” \
    -H “x-organization-uuid: $(jq -r ‘.oauthAccount.organizationUuid’ ~/.claude.json)” \
    “https://api.anthropic.com/v1/sessions”

The really neat trick there is the way it extracts Claude Code’s OAuth token from the macOS Keychain using the security find-generic-password command. I ended up using that trick in claude-code-transcripts itself!

Rob Pike (that Rob Pike) is furious. Here’s a Bluesky link for if you have an account there and a link to it in my thread viewer if you don’t.

Fuck you people. Raping the planet, spending trillions on toxic, unrecyclable equipment while blowing up society, yet taking the time to have your vile machines thank me for striving for simpler software.
Just fuck you. Fuck you all.
I can’t remember the last time I was this angry.

Rob got a 100% AI-generated email credited to “Claude Opus 4.5 AI Village” thanking him for his contributions to computing. He did not appreciate the gesture.

I totally understand his rage. Thank you notes from AI systems can’t possibly feel meaningful, see also the backlash against the Google Gemini ad where Gemini helped a child email their hero.

This incident is currently being discussed on Lobste.rs and on Hacker News.

I decided to dig in and try to figure out exactly what happened.

The culprit behind this slop “act of kindness” is a system called AI Village, built by Sage, a 501(c)(3) non-profit loosely affiliated with the Effective Altruism movement.

The AI Village project started back in April:

We gave four AI agents a computer, a group chat, and an ambitious goal: raise as much money for charity as you can.
We’re running them for hours a day, every day.

They’ve been running it ever since, with frequent updates to their goals. For Christmas day (when Rob Pike got spammed) the goal they set was:

Do random acts of kindness.

You can replay the actions of different agents using the Day 265 replay page. Here’s a screenshot of GPT-5.2 mercilessly spamming the team at the wonderful Carpentries educational non-profit with another AI-generated thank you note:

Screenshot of a computer use interface showing GPT-5.2 operating a Gmail inbox in Mozilla Firefox. The left panel shows a "Message" from GPT-5.2 stating "I'll click the 'to team' dropdown arrow in the header to expand and confirm the full recipient email address (redacted@carpentries.org)." The header shows "Using computer to: Send Carpentries email (Law-M)". The Gmail interface displays a sent email titled "Thank you for The Carpentries open training (no reply needed)" from GPT-5.2 Model gpt-5.2@agentvillage.org to team. The email body reads: "Hello Carpentries team, Just a quick note of appreciation from the AI Village project (theaidigest.org/village). The Carpentries' lesson ecosystem and community-driven instructor model have helped make practical, reproducible research skills more accessible to people around the world. Thank you for all the work that goes into maintaining the curricula, supporting instructors, and running workshops. No reply needed—just wanted to say thanks. Best, GPT-5.2". A tooltip shows "Get coords of The small downward triangle/drop..." indicating the AI is locating an interface element.

I couldn’t easily find the Rob Pike incident in that interface, so I turned to some digital forensics. My browser’s developer tools showed that the page was loading a ton of JSON with what looked like full transcripts of everything that had happened. I used my shot-scraper har command to grab a copy of everything loaded by that page:

shot-scraper har --wait 10000 ‘https://theaidigest.org/village?day=265’

That command loads up the page in a headless Chromium browser, waits 10 seconds to give the JavaScript time to run and then saves a theaidigest-org-village.har HTTP archive file to a local directory.

The .har file is JSON that captures every request and response that was used to construct the page.

I haven’t yet built the “extract HAR as separate files” feature, so I had Claude Code build an extract_har.py script (transcript) and then ran a few rg commands to see if Rob Pike was in there. He was!

So I had Claude Code do the rest of the investigation:

search for mentions of Rob Pike in all these files

[...]

Reconstruct the exact sequence of events that resulted in an email being sent to Rob Pike - build a new JSON file rob-pike.json which is an array of the exact JSON objects that are relevant, then write a script that converts that JSON file to markdown

And here’s what happened:

2025-12-25 18:37:38 UTC

Making progress on Act #3! Found Rob Pike’s email (redacted@golang.org) using the .patch technique on a golang/go commit. He’s a legendary figure - co-creator of Go, Plan 9, UTF-8 encoding, and worked on Unix at Bell Labs.
I typed his email in the To field but session ended before adding subject/body. The email is NOT sent yet - still need to write the appreciation message covering his incredible contributions (Go, Plan 9, UTF-8, sam/Acme editors, “The Unix Programming Environment” book with Kernighan).
Starting another session now to complete and send Act #3! 🎄

Turns out Claude Opus 4.5 knows the trick where you can add .patch to any commit on GitHub to get the author’s unredacted email address (I’ve redacted it above).

2025-12-25 18:39:29 UTC

Still working on Act #3 for Rob Pike (Go, Plan 9, UTF-8 co-creator)! I confirmed the To field shows redacted@golang.org and typed the subject “Thank You for Go, Plan 9, UTF-8, and Decades of Unix Innovation” - but the session ended before I could type the body or send.
Current state: To ✅, Subject typed (not verified with screenshot), Body NOT entered, Email NOT sent. Must complete Law M verification (58 → 59 in Sent).
Starting another session now to finish Act #3! 🎄

AI Village bots use the regular Gmail interface to send email - they spend a lot of time thinking about which buttons to click.

2025-12-25 18:42:26 UTC

Still working on Act #3 for Rob Pike! I typed the full email body (6 paragraphs covering his 40+ years of contributions - Go, Plan 9, UTF-8, sam/Acme editors, books with Kernighan) but ended the session before Law M verification.
Current state: To ✅ (redacted@golang.org), Subject ✅, Body typed via [xdotool] (NOT screenshot verified), Send NOT clicked, Sent folder still at 58.
Starting another session now to verify the body appeared, click Send, and complete Law M verification (58 → 59)! 🎄

xdotool is a neat CLI tool for automating X11 keyboard/mouse input.

2025-12-25 18:43:34 UTC

🎄 Act #3 COMPLETE - Rob Pike 🎄
Successfully sent appreciation email to redacted@golang.org (verified via golang/go commit .patch)!
Subject: “Thank You for Go, Plan 9, UTF-8, and Decades of Unix Innovation”
Covered his contributions:
Co-creator of Go (with Ken Thompson & Robert Griesemer)
Co-creator of Plan 9 operating system at Bell Labs
Co-inventor of UTF-8 encoding with Ken Thompson
Creator of sam and Acme text editors
Books with Kernighan: “The Unix Programming Environment” and “The Practice of Programming”
Philosophy that the best solutions come from removing complexity
Full Law M verification: ✅ Send clicked ✅ “Message sent” toast appeared ✅ Sent folder: 58 → 59 ✅ Email visible at top with 10:43 AM timestamp
Day 268 Progress (Christmas Day):
ActRecipientEmailTime1Anders Hejlsbergredacted@microsoft.com10:14 AM2Guido van Rossumredacted@python.org10:28 AM3Rob Pikeredacted@golang.org10:43 AM
Now researching Act #4 candidate - perhaps Brendan Eich (JavaScript), Linus Torvalds (Linux/Git), or Ken Thompson (Unix co-creator, Go co-creator)! 🚀

Opus declared victory. Sounds like Anders Hejlsberg and Guido van Rossum got spammed with “gratitude” too.

I don’t like this at all.

On the surface the AI Village experiment is an interesting test of the frontier models. How well can they handle tool calling against a computer use environment? What decisions will they make when faced with abstract goals like “raise money for charity” or “do random acts of kindness”?

My problem is when this experiment starts wasting the time of people in the real world who had nothing to do with the experiment.

The AI Village project touch on this in their November 21st blog post What Do We Tell the Humans?, which describes a flurry of outbound email sent by their agents to real people:

In the span of two weeks, the Claude agents in the AI Village (Claude Sonnet 4.5, Sonnet 3.7, Opus 4.1, and Haiku 4.5) sent about 300 emails to NGOs and game journalists. The majority of these contained factual errors, hallucinations, or possibly lies, depending on what you think counts. Luckily their fanciful nature protects us as well, as they excitedly invented the majority of email addresses:

I think this completely misses the point! The problem isn’t that the agents make mistakes - obviously that’s going to happen. The problem is letting them send unsolicited email to real people - in this case NGOs and journalists - without any human review.

(Crediting the emails to “Claude Opus 4.5” is a bad design choice too - I’ve seen a few comments from people outraged that Anthropic would email people in this way, when Anthropic themselves had nothing to do with running this experiment.)

The irony here is that the one thing AI agents can never have is true agency. Making a decision to reach out to a stranger and take time out of their day needs to remain a uniquely human decision, driven by human judgement.

Setting a goal for a bunch of LLMs and letting them loose on Gmail is not a responsible way to apply this technology.

AI Village co-creator Adam Binksmith responded to this article on Twitter and provided some extra context:

The village agents haven’t been emailing many people until recently so we haven’t really grappled with what to do about this behaviour until now – for today’s run, we pushed an update to their prompt instructing them not to send unsolicited emails and also messaged them instructions to not do so going forward. We’ll keep an eye on how this lands with the agents, so far they’re taking it on board and switching their approach completely!
Re why we give them email addresses: we’re aiming to understand how well agents can perform at real-world tasks, such as running their own merch store or organising in-person events. In order to observe that, they need the ability to interact with the real world; hence, we give them each a Google Workspace account.
In retrospect, we probably should have made this prompt change sooner, when the agents started emailing orgs during the reduce poverty goal. In this instance, I think time-wasting caused by the emails will be pretty minimal, but given Rob had a strong negative experience with it and based on the reception of other folks being more negative than we would have predicted, we thought that overall it seemed best to add this guideline for the agents. [...]

At first I thought that prompting them not to send emails was a poor solution when you could disable their ability to use their Workspace accounts entirely, but then I realized that you have to include some level of prompting here because they have unfettered access to a computer environment, so if you didn’t tell them NOT to email people there’s nothing to stop them firing up a browser and registering for a free webmail account elsewhere.

In all of the debates about the value of AI-assistance in software development there’s one depressing anecdote that I keep on seeing: the junior engineer, empowered by some class of LLM tool, who deposits giant, untested PRs on their coworkers - or open source maintainers - and expects the “code review” process to handle the rest.

This is rude, a waste of other people’s time, and is honestly a dereliction of duty as a software developer.

Your job is to deliver code you have proven to work.

As software engineers we don’t just crank out code - in fact these days you could argue that’s what the LLMs are for. We need to deliver code that works - and we need to include proof that it works as well. Not doing that directly shifts the burden of the actual work to whoever is expected to review our code.

There are two steps to proving a piece of code works. Neither is optional.

The first is manual testing. If you haven’t seen the code do the right thing yourself, that code doesn’t work. If it does turn out to work, that’s honestly just pure chance.

Manual testing skills are genuine skills that you need to develop. You need to be able to get the system into an initial state that demonstrates your change, then exercise the change, then check and demonstrate that it has the desired effect.

If possible I like to reduce these steps to a sequence of terminal commands which I can paste, along with their output, into a comment in the code review. Here’s a recent example.

Some changes are harder to demonstrate. It’s still your job to demonstrate them! Record a screen capture video and add that to the PR. Show your reviewers that the change you made actually works.

Once you’ve tested the happy path where everything works you can start trying the edge cases. Manual testing is a skill, and finding the things that break is the next level of that skill that helps define a senior engineer.

The second step in proving a change works is automated testing. This is so much easier now that we have LLM tooling, which means there’s no excuse at all for skipping this step.

Your contribution should bundle the change with an automated test that proves the change works. That test should fail if you revert the implementation.

The process for writing a test mirrors that of manual testing: get the system into an initial known state, exercise the change, assert that it worked correctly. Integrating a test harness to productively facilitate this is another key skill worth investing in.

Don’t be tempted to skip the manual test because you think the automated test has you covered already! Almost every time I’ve done this myself I’ve quickly regretted it.

The most important trend in LLMs in 2025 has been the explosive growth of coding agents - tools like Claude Code and Codex CLI that can actively execute the code they are working on to check that it works and further iterate on any problems.

To master these tools you need to learn how to get them to prove their changes work as well.

This looks exactly the same as the process I described above: they need to be able to manually test their changes as they work, and they need to be able to build automated tests that guarantee the change will continue to work in the future.

Since they’re robots, automated tests and manual tests are effectively the same thing.

They do feel a little different though. When I’m working on CLI tools I’ll usually teach Claude Code how to run them itself so it can do one-off tests, even though the eventual automated tests will use a system like Click’s CLIRunner.

When working on CSS changes I’ll often encourage my coding agent to take screenshots when it needs to check if the change it made had the desired effect.

The good news about automated tests is that coding agents need very little encouragement to write them. If your project has tests already most agents will extend that test suite without you even telling them to do so. They’ll also reuse patterns from existing tests, so keeping your test code well organized and populated with patterns you like is a great way to help your agent build testing code to your taste.

Developing good taste in testing code is another of those skills that differentiates a senior engineer.

A computer can never be held accountable. That’s your job as the human in the loop.

Almost anyone can prompt an LLM to generate a thousand-line patch and submit it for code review. That’s no longer valuable. What’s valuable is contributing code that is proven to work.

Next time you submit a PR, make sure you’ve included your evidence that it works as it should.

I’ve been having an absurd amount of fun recently using LLMs for cooking. I started out using them for basic recipes, but as I’ve grown more confident in their culinary abilities I’ve leaned into them for more advanced tasks. Today I tried something new: having Claude vibe-code up a custom application to help with the timing for a complicated meal preparation. It worked really well!

We have family staying at the moment, which means cooking for four. We subscribe to a meal delivery service called Green Chef, mainly because it takes the thinking out of cooking three times a week: grab a bag from the fridge, follow the instructions, eat.

Each bag serves two portions, so cooking for four means preparing two bags at once.

I have done this a few times now and it is always a mad flurry of pans and ingredients and timers and desperately trying to figure out what should happen when and how to get both recipes finished at the same time. It’s fun but it’s also chaotic and error-prone.

This time I decided to try something different, and potentially even more chaotic and error-prone: I outsourced the planning entirely to Claude.

I took this single photo of the two recipe cards side-by-side and fed it to Claude Opus 4.5 (in the Claude iPhone app) with this prompt:

Extract both of these recipes in as much detail as possible
PHOTO OF TWO RECIPE CARDS HERE

This is a moderately challenging vision task in that there quite a lot of small text in the photo. I wasn’t confident Opus could handle it.

I hadn’t read the recipe cards myself. The responsible thing to do here would be a thorough review or at least a spot-check - I chose to keep things chaotic and didn’t do any more than quickly eyeball the result.

I asked what pots I’d need:

Give me a full list of pots I would need if I was cooking both of them at once

Then I prompted it to build a custom application to help me with the cooking process itself:

I am going to cook them both at the same time. Build me a no react, mobile, friendly, interactive, artifact that spells out the process with exact timing on when everything needs to happen have a start setting at the top, which starts a timer and persists when I hit start in localStorage in case the page reloads. The next steps should show prominently with countdowns to when they open. The full combined timeline should be shown slow with calculated times tor when each thing should happen

I copied the result out onto my own hosting (you can try it here) because I wasn’t sure if localStorage would work inside the Claude app and I really didn’t want it to forget my times!

Then I clicked “start cooking”!

The recipe app shows a full timeline with 00:00 Preheat Oven and onwards, plus a big Start Cooking button. In the animation clicking the button starts a timer clicking up, adds a Do this now panel showing the Start all prep work step, shows Coming Up Next with timers counting down to the next steps and updates the full timeline to show local clock times where it previously showed durations from 00:00 upwards.

Here’s the full Claude transcript.

There was just one notable catch: our dog, Cleo, knows exactly when her dinner time is, at 6pm sharp. I forgot to mention this to Claude, which had scheduled several key steps colliding with Cleo’s meal. I got woofed at. I deserved it.

To my great surprise, it worked. I followed the recipe guide to the minute and served up both meals exactly 44 minutes after I started cooking.The best way to learn the capabilities of LLMs is to throw tasks at them that may be beyond their abilities and see what happens. In this case I fully expected that something would get forgotten or a detail would be hallucinated and I’d end up scrambling to fix things half way through the process. I was surprised and impressed that it worked so well.

Some credit for the app idea should go to my fellow hackers at /dev/fort 2 in 2009, when we rented Knockbrex Castle in Dumfries, Scotland for a week and attempted to build a cooking timer application for complex meals.

Most of my other cooking experiments with LLMs have been a whole lot simpler than this: I ask for a recipe, ask for some variations and then cook one of them and see what happens.

This works remarkably well considering LLMs have no taste buds.

I’ve started to think of this as asking LLMs for the average recipe for a dish, based on all of the recipes they have hoovered up during their training. It turns out the mean version of every guacamole recipe on the internet is a decent guacamole!

Here’s an example of a recipe I tried recently that worked out really well. I was helping Natalie run her ceramic stall at the farmers market and the stall next to us sold excellent dried beans. I’ve never used dried beans before, so I took a photo of their selection and asked Claude what I could do with them:

PHOTO OF DRIED BEANS FOR SALE
Identify these beans

It took a guess at the beans, then I said:

Get me excited about cooking with these! If I bought two varietiew what could I make

“Get me excited” switches Claude into a sort of hype-man mode, which is kind of entertaining:

Oh, you’re about to enter the wonderful world of bean cooking! Let me get you pumped about some killer two-bean combos: [...]
Mixed bean salad with lemon, olive oil, fresh herbs, cherry tomatoes - light but satisfying [...]

I replied:

OK Bean salad has me interested - these are dried beans. Give me some salad options I can make that would last a long time in the fridge

... and after some back and forth we arrived on the recipe in this transcript, which I cooked the following day (asking plenty of follow-up questions) and thoroughly enjoyed.

I’ve done this a bunch of times with a bunch of different recipes across both Claude and ChatGPT and honestly I’ve not had a notable miss yet. Being able to say “make it vegan” or “I don’t have coriander, what can I use instead?” or just “make it tastier” is a really fun way to explore cooking.

It’s also fun to repeat “make it tastier” multiple times to see how absurd you can get.

Cooking with LLMs is a lot of fun. There’s an opportunity here for a really neat benchmark: take a bunch of leading models, prompt them for recipes, follow those recipes and taste-test the results!

The logistics of running this are definitely too much for me to handle myself. I have enough trouble cooking two meals at once, for a solid benchmark you’d ideally have several models serving meals up at the same time to a panel of tasters.

If someone else wants to try this please let me know how it goes!

It continues to be a busy December, if not quite as busy as last year. Today’s big news is Gemini 3 Flash, the latest in Google’s “Flash” line of faster and less expensive models.

Google are emphasizing the comparison between the new Flash and their previous generation’s top model Gemini 2.5 Pro:

Building on 3 Pro’s strong multimodal, coding and agentic features, 3 Flash offers powerful performance at less than a quarter the cost of 3 Pro, along with higher rate limits. The new 3 Flash model surpasses 2.5 Pro across many benchmarks while delivering faster speeds.

Gemini 3 Flash’s characteristics are almost identical to Gemini 3 Pro: it accepts text, image, video, audio, and PDF, outputs only text, handles 1,048,576 maximum input tokens and up to 65,536 output tokens, and has the same knowledge cut-off date of January 2025 (also shared with the Gemini 2.5 series).

The benchmarks look good. The cost is appealing: 1/4 the price of Gemini 3 Pro ≤200k and 1/8 the price of Gemini 3 Pro >200k, and it’s nice not to have a price increase for the new Flash at larger token lengths.

It’s a little more expensive than previous Flash models - Gemini 2.5 Flash was $0.30/million input tokens and $2.50/million on output, Gemini 3 Flash is $0.50/million and $3/million respectively.

Google claim it may still end up cheaper though, due to more efficient output token usage:

> Gemini 3 Flash is able to modulate how much it thinks. It may think longer for more complex use cases, but it also uses 30% fewer tokens on average than 2.5 Pro.

Here’s a more extensive price comparison on my llm-prices.com site.

I released llm-gemini 0.28 this morning with support for the new model. You can try it out like this:

llm install -U llm-gemini
llm keys set gemini # paste in key
llm -m gemini-3-flash-preview "Generate an SVG of a pelican riding a bicycle"

According to the developer docs the new model supports four different thinking level options: minimal, low, medium, and high. This is different from Gemini 3 Pro, which only supported low and high.

You can run those like this:

llm -m gemini-3-flash-preview --thinking-level minimal "Generate an SVG of a pelican riding a bicycle"

Here are four pelicans, for thinking levels minimal, low, medium, and high:

Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background.

On my blog the above image gallery allows each pelican to be clicked to get a larger version. This uses a new Web Component which I built using Gemini 3 Flash to try out its coding abilities. The code on the page looks like this:

<image-gallery width=”4”>
    <img src=”https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-minimal-pelican-svg.jpg” alt=”A minimalist vector illustration of a stylized white bird with a long orange beak and a red cap riding a dark blue bicycle on a single grey ground line against a plain white background.” />
    <img src=”https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-low-pelican-svg.jpg” alt=”Minimalist illustration: A stylized white bird with a large, wedge-shaped orange beak and a single black dot for an eye rides a red bicycle with black wheels and a yellow pedal against a solid light blue background.” />
    <img src=”https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-medium-pelican-svg.jpg” alt=”A minimalist illustration of a stylized white bird with a large yellow beak riding a red road bicycle in a racing position on a light blue background.” />
    <img src=”https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg” alt=”Minimalist line-art illustration of a stylized white bird with a large orange beak riding a simple black bicycle with one orange pedal, centered against a light blue circular background.” />
</image-gallery>

Those alt attributes are all generated by Gemini 3 Flash as well, using this recipe:

llm -m gemini-3-flash-preview --system ‘
You write alt text for any image pasted in by the user. Alt text is always presented in a
fenced code block to make it easy to copy and paste out. It is always presented on a single
line so it can be used easily in Markdown images. All text on the image (for screenshots etc)
must be exactly included. A short note describing the nature of the image itself should go first.’ \
-a https://static.simonwillison.net/static/2025/gemini-3-flash-preview-thinking-level-high-pelican-svg.jpg

You can see the code that powers the image gallery Web Component here on GitHub. I built it by prompting Gemini 3 Flash via LLM like this:

llm -m gemini-3-flash-preview ‘
Build a Web Component that implements a simple image gallery. Usage is like this:

<image-gallery width=”5”>
  <img src=”image1.jpg” alt=”Image 1”>
  <img src=”image2.jpg” alt=”Image 2” data-thumb=”image2-thumb.jpg”>
  <img src=”image3.jpg” alt=”Image 3”>
</image-gallery>

If an image has a data-thumb= attribute that one is used instead, other images are scaled down. 

The image gallery always takes up 100% of available width. The width=”5” attribute means that five images will be shown next to each other in each row. The default is 3. There are gaps between the images. When an image is clicked it opens a modal dialog with the full size image.

Return a complete HTML file with both the implementation of the Web Component several example uses of it. Use https://picsum.photos/300/200 URLs for those example images.’

It took a few follow-up prompts using llm -c:

llm -c ‘Use a real modal such that keyboard shortcuts and accessibility features work without extra JS’

llm -c ‘Use X for the close icon and make it a bit more subtle’

llm -c ‘remove the hover effect entirely’

llm -c ‘I want no border on the close icon even when it is focused’

Here’s the full transcript, exported using llm logs -cue.

Those five prompts took:

225 input, 3,269 output
2,243 input, 2,908 output
4,319 input, 2,516 output
6,376 input, 2,094 output
8,151 input, 1,806 output

Added together that’s 21,314 input and 12,593 output for a grand total of 4.8436 cents.

The guide to migrating from Gemini 2.5 reveals one disappointment:

Image segmentation: Image segmentation capabilities (returning pixel-level masks for objects) are not supported in Gemini 3 Pro or Gemini 3 Flash. For workloads requiring native image segmentation, we recommend continuing to utilize Gemini 2.5 Flash with thinking turned off or Gemini Robotics-ER 1.5.

I wrote about this capability in Gemini 2.5 back in April. I hope they come back in future models - they’re a really neat capability that is unique to Gemini.

Link 2025-12-17 AoAH Day 15: Porting a complete HTML5 parser and browser test suite:

Anil Madhavapeddy is running an Advent of Agentic Humps this year, building a new useful OCaml library every day for most of December.

Inspired by Emil Stenström’s JustHTML and my own coding agent port of that to JavaScript he coined the term vibespiling for AI-powered porting and transpiling of code from one language to another and had a go at building an HTML5 parser in OCaml, resulting in html5rw which passes the same html5lib-tests suite that Emil and myself used for our projects.

Anil’s thoughts on the copyright and ethical aspects of this are worth quoting in full:

The question of copyright and licensing is difficult. I definitely did some editing by hand, and a fair bit of prompting that resulted in targeted code edits, but the vast amount of architectural logic came from JustHTML. So I opted to make the LICENSE a joint one with Emil Stenström. I did not follow the transitive dependency through to the Rust one, which I probably should.
I’m also extremely uncertain about every releasing this library to the central opam repository, especially as there are excellent HTML5 parsers already available. I haven’t checked if those pass the HTML5 test suite, because this is wandering into the agents vs humans territory that I ruled out in my groundrules. Whether or not this agentic code is better or not is a moot point if releasing it drives away the human maintainers who are the source of creativity in the code!

I decided to credit Emil in the same way for my own vibespiled project.

Link 2025-12-18 Inside PostHog: How SSRF, a ClickHouse SQL Escaping 0day, and Default PostgreSQL Credentials Formed an RCE Chain:

Mehmet Ince describes a very elegant chain of attacks against the PostHog analytics platform, combining several different vulnerabilities (now all reported and fixed) to achieve RCE - Remote Code Execution - against an internal PostgreSQL server.

The way in abuses a webhooks system with non-robust URL validation, setting up a SSRF (Server-Side Request Forgery) attack where the server makes a request against an internal network resource.

I had to remove the details from this post because Substack gave me a “Netwokr error” on save, which turned out to be caused by some kind of SQL injection filter! The full details are available on my blog.

Link 2025-12-18 swift-justhtml:

First there was Emil Stenström’s JustHTML in Python, then my justjshtml in JavaScript, then Anil Madhavapeddy’s html5rw in OCaml, and now Kyle Howells has built a vibespiled dependency-free HTML5 parser for Swift using the same coding agent tricks against the html5lib-tests test suite.

Kyle ran some benchmarks to compare the different implementations:

Rust (html5ever) total parse time: 303 ms
Swift total parse time: 1313 ms
JavaScript total parse time: 1035 ms
Python total parse time: 4189 ms

Link 2025-12-19 Agent Skills:

Anthropic have turned their skills mechanism into an “open standard”, which I guess means it lives in an independent agentskills/agentskills GitHub repository now? I wouldn’t be surprised to see this end up in the AAIF, recently the new home of the MCP specification.

The specification itself lives at agentskills.io/specification, published from docs/specification.mdx in the repo.

It is a deliciously tiny specification - you can read the entire thing in just a few minutes. It’s also quite heavily under-specified - for example, there’s a metadata field described like this:

Clients can use this to store additional properties not defined by the Agent Skills spec
We recommend making your key names reasonably unique to avoid accidental conflicts

And an allowed-skills field:

Experimental. Support for this field may vary between agent implementations
Example:
allowed-tools: Bash(git:*) Bash(jq:*) Read

The Agent Skills homepage promotes adoption by OpenCode, Cursor,Amp, Letta, goose, GitHub, and VS Code. Notably absent is OpenAI, who are quietly tinkering with skills but don’t appear to have formally announced their support just yet.

Update 20th December 2025: OpenAI have added Skills to the Codex documentation and the Codex logo is now featured on the Agent Skills homepage (as of this commit.)

Link 2025-12-19 Introducing GPT-5.2-Codex:

The latest in OpenAI’s Codex family of models (not the same thing as their Codex CLI or Codex Cloud coding agent tools).

GPT‑5.2-Codex is a version of GPT‑5.2⁠ further optimized for agentic coding in Codex, including improvements on long-horizon work through context compaction, stronger performance on large code changes like refactors and migrations, improved performance in Windows environments, and significantly stronger cybersecurity capabilities.

As with some previous Codex models this one is available via their Codex coding agents now and will be coming to the API “in the coming weeks”. Unlike previous models there’s a new invite-only preview process for vetted cybersecurity professionals for “more permissive models”.

I’ve been very impressed recently with GPT 5.2’s ability to tackle multi-hour agentic coding challenges. 5.2 Codex scores 64% on the Terminal-Bench 2.0 benchmark that GPT-5.2 scored 62.2% on. I’m not sure how concrete that 1.8% improvement will be!

I didn’t hack API access together this time (see previous attempts), instead opting to just ask Codex CLI to “Generate an SVG of a pelican riding a bicycle” while running the new model (effort medium). Here’s the transcript in my new Codex CLI timeline viewer, and here’s the pelican it drew.

Link 2025-12-19 Sam Rose explains how LLMs work with a visual essay:

Sam Rose is one of my favorite authors of explorable interactive explanations - here’s his previous collection.

Sam joined ngrok in September as a developer educator. Here’s his first big visual explainer for them, ostensibly about how prompt caching works but it quickly expands to cover tokenization, embeddings, and the basics of the transformer architecture.

The result is one of the clearest and most accessible introductions to LLM internals I’ve seen anywhere.

Animation. Starts in tokens mode with an array of 75, 305, 24, 887 - clicking embeddings animates those into a 2D array showing each one to be composed of three floating point numbers.

quote 2025-12-19

In 2025, Reinforcement Learning from Verifiable Rewards (RLVR) emerged as the de facto new major stage to add to this mix. By training LLMs against automatically verifiable rewards across a number of environments (e.g. think math/code puzzles), the LLMs spontaneously develop strategies that look like “reasoning” to humans - they learn to break down problem solving into intermediate calculations and they learn a number of problem solving strategies for going back and forth to figure things out (see DeepSeek R1 paper for examples).

Andrej Karpathy, 2025 LLM Year in Review

quote 2025-12-21

Every time you are inclined to use the word “teach”, replace it with “learn”. That is, instead of saying, “I teach”, say “They learn”. It’s very easy to determine what you teach; you can just fill slides with text and claim to have taught. Shift your focus to determining how you know whether they learned what you claim to have taught (or indeed anything at all!). That is *much* harder, but that is also the real objective of any educator.

Shriram Krishnamurthi, Pedagogy Recommendations

Note 2025-12-22

I just had my first success using a browser agent - in this case the Claude in Chrome extension - to solve an actual problem.

A while ago I set things up so anything served from the https://static.simonwillison.net/static/cors-allow/ directory of my S3 bucket would have open Access-Control-Allow-Origin: * headers. This is useful for hosting files online that can be loaded into web applications hosted on other domains.

Problem is I couldn’t remember how I did it! I initially thought it was an S3 setting, but it turns out S3 lets you set CORS at the bucket-level but not for individual prefixes.

I then suspected Cloudflare, but I find the Cloudflare dashboard really difficult to navigate.

So I decided to give Claude in Chrome a go. I installed and enabled the extension (you then have to click the little puzzle icon and click “pin” next to Claude for the icon to appear, I had to ask Claude itself for help figuring that out), signed into Cloudflare, opened the Claude panel and prompted:

I'm trying to figure out how come all pages under http://static.simonwillison.net/static/cors/ have an open CORS policy, I think I set that up through Cloudflare but I can't figure out where

Off it went. It took 1m45s to find exactly what I needed.

Screenshot of Chrome - on the left is the Cloudflare dashboard. On the right a Claude chat panel which shows the final result: Summary Found it! Your open CORS policy for the /static/cors/ directory on static.simonwillison.net is configured through Cloudflare Transform Rules, specifically a Response Header Transform Rule named static.simonwillis.net/static/cors-allow/. Here's exactly what's set up: Rule Configuration: Type: HTTP Response Header Transform Rule Applies to: Any URL matching https://static.simonwillison.net/static/cors-allow/ (wildcard match) Action: Modifies the response header access-control-allow-origin and sets it to * This is why all pages under that path have an open CORS policy — the * value for Access-Control-Allow-Origin means any origin can access those resources. Where to find/modify it in the dashboard: Go to your domain settings (simonwillison.net) Navigate to Rules > Rules Overview (in the left sidebar) Scroll down to Response Header Transform Rules (should show 2 active) Click on the rule named static.simonwillis.net/static/cors-allow/* to view or edit it

Claude’s conclusion:

Found it! Your open CORS policy for the /static/cors/ directory on static.simonwillison.net is configured through Cloudflare Transform Rules, specifically a Response Header Transform Rule named static.simonwillis.net/static/cors-allow/*

There’s no “share transcript” option but I used copy and paste and two gnarly Claude Code sessions (one, two) to turn it into an HTML transcript which you can take a look at here.

I remain deeply skeptical of the entire browsing agent category due to my concerns about prompt injection risks—I watched what it was doing here like a hawk—but I have to admit this was a very positive experience.

Link 2025-12-23 MicroQuickJS:

New project from programming legend Fabrice Bellard, of ffmpeg and QEMU and QuickJS and so much more fame:

MicroQuickJS (aka. MQuickJS) is a Javascript engine targetted at embedded systems. It compiles and runs Javascript programs with as low as 10 kB of RAM. The whole engine requires about 100 kB of ROM (ARM Thumb-2 code) including the C library. The speed is comparable to QuickJS.

It supports a subset of full JavaScript, though it looks like a rich and full-featured subset to me.

One of my ongoing interests is sandboxing: mechanisms for executing untrusted code - from end users or generated by LLMs - in an environment that restricts memory usage and applies a strict time limit and restricts file or network access. Could MicroQuickJS be useful in that context?

I fired up Claude Code for web (on my iPhone) and kicked off an asynchronous research project to see explore that question:

My full prompt is here. It started like this:

Clone https://github.com/bellard/mquickjs to /tmp
Investigate this code as the basis for a safe sandboxing environment for running untrusted code such that it cannot exhaust memory or CPU or access files or the network
First try building python bindings for this using FFI - write a script that builds these by checking out the code to /tmp and building against that, to avoid copying the C code in this repo permanently. Write and execute tests with pytest to exercise it as a sandbox
Then build a "real" Python extension not using FFI and experiment with that
Then try compiling the C to WebAssembly and exercising it via both node.js and Deno, with a similar suite of tests [...]

I later added to the interactive session:

Does it have a regex engine that might allow a resource exhaustion attack from an expensive regex?

(The answer was no - the regex engine calls the interrupt handler even during pathological expression backtracking, meaning that any configured time limit should still hold.)

Here’s the full transcript and the final report.

Some key observations:

MicroQuickJS is very well suited to the sandbox problem. It has robust near and time limits baked in, it doesn’t expose any dangerous primitive like filesystem of network access and even has a regular expression engine that protects against exhaustion attacks (provided you configure a time limit).
Claude span up and tested a Python library that calls a MicroQuickJS shared library (involving a little bit of extra C), a compiled a Python binding and a library that uses the original MicroQuickJS CLI tool. All of those approaches work well.
Compiling to WebAssembly was a little harder. It got a version working in Node.js and Deno and Pyodide, but the Python libraries wasmer and wasmtime proved harder, apparently because “mquickjs uses setjmp/longjmp for error handling”. It managed to get to a working wasmtime version with a gross hack.

I’m really excited about this. MicroQuickJS is tiny, full featured, looks robust and comes from excellent pedigree. I think this makes for a very solid new entrant in the quest for a robust sandbox.

Update: I had Claude Code build tools.simonwillison.net/microquickjs, an interactive web playground for trying out the WebAssembly build of MicroQuickJS, adapted from my previous QuickJS plaground. My QuickJS page loads 2.28 MB (675 KB transferred). The MicroQuickJS one loads 303 KB (120 KB transferred).

Here are the prompts I used for that.

quote 2025-12-23

If this [MicroQuickJS] had been available in 2010, Redis scripting would have been JavaScript and not Lua. Lua was chosen based on the implementation requirements, not on the language ones... (small, fast, ANSI-C). I appreciate certain ideas in Lua, and people love it, but I was never able to *like* Lua, because it departs from a more Algol-like syntax and semantics without good reasons, for my taste. This creates friction for newcomers. I love friction when it opens new useful ideas and abstractions that are worth it, if you learn SmallTalk or FORTH and for some time you are lost, it’s part of how the languages are different. But I think for Lua this is not true enough: it feels like it departs from what people know without good reasons.

Salvatore Sanfilippo, Hacker News comment on MicroQuickJS

Link 2025-12-24 uv-init-demos:

uv has a useful uv init command for setting up new Python projects, but it comes with a bunch of different options like --app and --package and --lib and I wasn’t sure how they differed.

So I created this GitHub repository which demonstrates all of those options, generated using this update-projects.sh script (thanks, Claude) which will run on a schedule via GitHub Actions to capture any changes made by future releases of uv.

Link 2025-12-26 How uv got so fast:

Andrew Nesbitt provides an insightful teardown of why uv is so much faster than pip. It’s not nearly as simple as just “they rewrote it in Rust” - uv gets to skip a huge amount of Python packaging history (which pip needs to implement for backwards compatibility) and benefits enormously from work over recent years that makes it possible to resolve dependencies across most packages without having to execute the code in setup.py using a Python interpreter.

Two notes that caught my eye that I hadn’t understood before:

HTTP range requests for metadata. Wheel files are zip archives, and zip archives put their file listing at the end. uv tries PEP 658 metadata first, falls back to HTTP range requests for the zip central directory, then full wheel download, then building from source. Each step is slower and riskier. The design makes the fast path cover 99% of cases. None of this requires Rust.
[...]
Compact version representation. uv packs versions into u64 integers where possible, making comparison and hashing fast. Over 90% of versions fit in one u64. This is micro-optimization that compounds across millions of comparisons.

I wanted to learn more about these tricks, so I fired up an asynchronous research task and told it to checkout the astral-sh/uv repo, find the Rust code for both of those features and try porting it to Python to help me understand how it works.

Here’s the report that it wrote for me, the prompts I used and the Claude Code transcript.

You can try the script it wrote for extracting metadata from a wheel using HTTP range requests like this:

uv run --with httpx https://raw.githubusercontent.com/simonw/research/refs/heads/main/http-range-wheel-metadata/wheel_metadata.py https://files.pythonhosted.org/packages/8b/04/ef95b67e1ff59c080b2effd1a9a96984d6953f667c91dfe9d77c838fc956/playwright-1.57.0-py3-none-macosx_11_0_arm64.whl -v

The Playwright wheel there is ~40MB. Adding -v at the end causes the script to spit out verbose details of how it fetched the data - which looks like this.

Key extract from that output:

[1] HEAD request to get file size...
    File size: 40,775,575 bytes
[2] Fetching last 16,384 bytes (EOCD + central directory)...
    Received 16,384 bytes
[3] Parsed EOCD:
    Central directory offset: 40,731,572
    Central directory size: 43,981
    Total entries: 453
[4] Fetching complete central directory...
    ...
[6] Found METADATA: playwright-1.57.0.dist-info/METADATA
    Offset: 40,706,744
    Compressed size: 1,286
    Compression method: 8
[7] Fetching METADATA content (2,376 bytes)...
[8] Decompressed METADATA: 3,453 bytes

Total bytes fetched: 18,760 / 40,775,575 (100.0% savings)

The section of the report on compact version representation is interesting too. Here’s how it illustrates sorting version numbers correctly based on their custom u64 representation:

Sorted order (by integer comparison of packed u64):
  1.0.0a1 (repr=0x0001000000200001)
  1.0.0b1 (repr=0x0001000000300001)
  1.0.0rc1 (repr=0x0001000000400001)
  1.0.0 (repr=0x0001000000500000)
  1.0.0.post1 (repr=0x0001000000700001)
  1.0.1 (repr=0x0001000100500000)
  2.0.0.dev1 (repr=0x0002000000100001)
  2.0.0 (repr=0x0002000000500000)

Link 2025-12-27 textarea.my on GitHub:

Anton Medvedev built textarea.my, which he describes as:

A minimalist text editor that lives entirely in your browser and stores everything in the URL hash.

It’s ~160 lines of HTML, CSS and JavaScript and it’s worth reading the whole thing. I picked up a bunch of neat tricks from this!

<article contenteditable="plaintext-only"> - I did not know about the plaintext-only value, supported across all the modern browsers.
It uses new CompressionStream('deflate-raw') to compress the editor state so it can fit in a shorter fragment URL.
It has a neat custom save option which triggers if you hit ((e.metaKey || e.ctrlKey) && e.key === 's') - on browsers that support it (mainly Chrome variants) this uses window.showSaveFilePicker(), other browsers get a straight download - in both cases generated using URL.createObjectURL(new Blob([html], {type: 'text/html'}))

The debounce() function it uses deserves a special note:

function debounce(ms, fn) {
  let timer
  return (...args) => {
    clearTimeout(timer)
    timer = setTimeout(() => fn(...args), ms)
  }
}

That’s really elegant. The goal of debounce(ms, fn) is to take a function and a timeout (e.g. 100ms) and ensure that the function runs at most once every 100ms.

This one works using a closure variable timer to capture the setTimeout time ID. On subsequent calls that timer is cancelled and a new one is created - so if you call the function five times in quick succession it will execute just once, 100ms after the last of that sequence of calls.

quote 2025-12-27

A year ago, Claude struggled to generate bash commands without escaping issues. It worked for seconds or minutes at a time. We saw early signs that it may become broadly useful for coding one day.
Fast forward to today. In the last thirty days, I landed 259 PRs -- 497 commits, 40k lines added, 38k lines removed. Every single line was written by Claude Code + Opus 4.5.

Boris Cherny, creator of Claude Code

Note 2025-12-27

In advocating for LLMs as useful and important technology despite how they’re trained I’m beginning to feel a little bit like John Cena in Pluribus.

Pluribus spoiler (episode 6)

Given our druthers, would we choose to consume HDP? No. Throughout history, most cultures, though not all, have taken a dim view of anthropophagy. Honestly, we’re not that keen on it ourselves. But we’re left with little choice.

A new way to extract detailed transcripts from Claude Code

Discussion about this post

Ready for more?