I asked an AI to fix my docs and all I got was this lousy Python script
I often struggle with writing catchy titles for blog posts, so on a whim I decided recently to ask Gemini to do it, based on the contents of a post I’d written.
We argued back and forth for about 20 minutes while it gave me 100 variations on a single theme. None were particularly original or interesting, and about 95% of the outputs were extremely similar. Eventually it started randomly talking to me in French. That was when I decided to cut my losses and create the title myself. A bunch of time wasted, and still having to ultimately do the job myself – that felt like a pretty good analogy for all my previous experiments with LLMs.
However, since GitHub offered us a free trial of GitHub Copilot, I decided to put aside my skepticism and make a serious attempt at using the various LLMs, to assess their capabilities and see if I could get something useful out of them. I’ve had a lot of really helpful discussions on this topic with my colleague @akonev so I want to give him a shout-out in particular: he’s shared a lot of tips with me on prompt and context engineering. I also want to give a shout out to the Server team, who let me run my experiments on the Server documentation.
Thanks to @akonev, @iso-maxwell, @paelzer, @sophiepages for reviewing this post.
Contents
- Goals
- Constraints
- Setup
- Model vs. model tests
- Saving time on one-shot tasks
- Scripting and automating
- Conclusions
- The future of documentation?
Goals
- To compare various LLMs against each other, to see if any outperformed each other on real documentation tasks, and find out which are better suited for different tasks.
- To see whether I could use an agent to make improvements in the documentation quality.
- To try and make real productivity improvements. Not the fake “I saved 2 hours by delegating this one-off task to an LLM (and now my colleague has to spend 6 hours fixing it up)” productivity, but rather, making documentation processes more efficient.
Constraints
Most notably, I would not use an LLM to create documentation, or make major changes to the existing documentation. It’s not within the capability of LLMs to create good documentation where good documentation doesn’t already exist. LLMs require high quality, trusted primary sources to function well. Such trusted sources are human-made, written by experts – not LLM-generated. The Server documentation is a trusted primary source, and we don’t want that to change.
For myself, sustainability/environmental impact is one of the main concerns I have about the proliferation of AI. So, the second constraint I imposed on myself is not to use it frivolously, or to create any workflow, task, or artifact that requires continued dependence on an LLM or agent. Where possible, the outcome should include a re-usable script to allow repeat actions/automation. For one-off tasks that can’t be scripted, asking the agent to “produce a prompt that would’ve resulted in the same outcome without the need for my corrections and extra suggestions” at the end of the conversation reduces the costly back-and-forth if someone else wants to achieve the same task (thanks again to @akonev for that suggestion!). All the reusable prompts, and everything useful generated by the LLMs, has been saved to this repository.
Setup
I’m a bit impatient, and don’t like copy/pasting things back and forth – I like my workflows simple and unfussy, so I used VSCode as my editor with the GitHub Copilot extension. Mostly because I’ve used VSCode before, but also because it looked like the simplest option.
After I installed the addon, I asked the agent (which, purely by chance, happened to be running on Claude Sonnet 4.5) to create the copilot-instructions.md file for the Server documentation. This step surveys the entire repository and extracts the rules that agents need to follow (and the context they need about organization etc). Our comprehensive contribution guide helped a lot here – Claude in particular seems to find the context this document provides extremely useful.
Model vs. model tests
Test 1: GB to US English (natural language test)
Some time ago, the decision was made to switch all documentation from GB to US English. However, in a documentation set the size of the Server docs, this was a daunting task to do by hand. Because of this, we’ve been changing things as we go along, but this led to a weird and confusing mix of GB and US English – sometimes even on the same page (or paragraph). So, asking the LLM to help us update the spellings from GB to US seemed like a good initial task to test the waters, and to see how each model behaves with the same agent.
The prompt: I didn’t use a full prompt in this case, I just asked (as a beginner, in natural language): “I am halfway through converting my documentation from GB to US English. Can you help me?”
The task was complicated a bit by the fact that our spelling exception list, which every LLM had access to, had (for quirky historical reasons) a lot of US and GB spellings on it, so none of the models did a 100% perfect job – but that wasn’t the point of the task anyway.
- 7/10: Claude Sonnet 4.5 (#413)
It mostly did a good job. All the things it picked up were correct, but it missed a lot and required a second pass. It also randomly corrected some quotation marks without being asked (helpful, but unexpected). - 0/10: GPT5
Straight up just refused to do it. No idea why. - 4/10: GPT5-mini
Managed to identify more GB spellings to update than Sonnet did, but it relies heavily on the spelling exception list. The output summaries of what it’s doing at each step are really confusing because it presents too many options. This increases my cognitive load as a user as I have to sift out the junk options. The file it created to map GB to US spellings was incomplete and only mapped specific words. - 2/10: Gemini 3 Pro (#415)
This gives nice natural language updates at each step, but seems to be unnecessarily reading and ingesting the files so it takes ages. It’s also hard to jump in and course-correct because it talks to itself a lot.
Hilariously, it also changed some US spellings to GB (i.e., the wrong way around), then when I challenged that, it initially agreed with me but then after chatting to itself, changed its mind and said my complaint was “unfounded”. Nice. - 5/10: Claude Haiku 4.5
Haiku 4.5 is cheaper than Sonnet 4.5 (only charged at 0.33x). It mostly worked fine – almost indistinguishable from Sonnet, but after some time I started persistently getting a “request failed” error. Maybe I just picked the wrong time of day.
Overall impression:
The two Claude-based agent runs were by far the easiest and most enjoyable to drive. Gemini was the least enjoyable, given that it was slow and provided little opportunity for steering. GPT did a slightly better job overall than Claude, but I found it cumbersome to work with, given the sheer volume of options provided at every step. Claude has, I think, a good balance between requesting user input and driving itself. The ability that it has to hold a wide context window makes it particularly good at handling long back-and-forth discussions and debugging.
The outputs generated by each LLM were similar in format, since I asked them specifically to provide a summary of what was removed (and the final count) at the end. I was expecting a lot more overlap in the specific words that were changed, but curiously, each seemed to pick up on different sets of words. Some words (like “initialisation”) were ignored completely in all the tests. I haven’t been able to figure out why.
Test 2: clean the spelling exception list (with the same prompt)
After noticing just how many US and GB spellings (and some actual spelling errors) I had in the spelling exception list, I was curious about how the models would cope with making decisions about what words should and shouldn’t be on there.
The prompt: Unlike last time where I asked in natural language for “help” in cleaning up the GB spellings, this time I decided to see if they performed any better when fed a full and proper prompt. It was created by Claude Sonnet 4.5 with an outline from me of all the criteria that should be applied.
The original spelling exception list had 1,127 entries on it. I wanted to remove all correct spellings in GB English (which mask terms that should be in US English), correct spellings in US English (since they’re redundant), all genuine spelling errors that need correcting, duplicated entries, invalid technical terms and incorrectly presented product names (e.g. apparmor and Apparmor should be removed, but AppArmor should be present). In each test, after the LLM had done its work, I ran the results through the spellchecker.
- 2/10: GPT 4o:
Pretty aggressive – it removed almost the whole file (only 29 entries left), which resulted in 1,347 errors in the spellchecker. - 5/10: Claude Haiku 4.5: (#416)
Helpfully created a nice Python script to remove specific bogus terms, but scripting this task requires hardcoding specific terms which makes it unsuitable for repeat applications on different sets of words. It removed 217 entries, resulting in only 347 spellchecker violations. - 4/10: Gemini Pro 3:
Created not one, but two Python scripts, although it didn’t let me see the contents. It removed 200 words, resulting in 385 errors in the spellchecker. - 3/10: Claude Sonnet 4.5: (#417)
A bit too permissive – it removed only 90 entries from the list. I quite like the fact that it’s so careful, but this does mean spending way more time on iterations (so it gets a lower score on that front).
Overall impression:
Quite a lot of variation between them, but they all did a pretty poor job.
Ultimately, I did my own manual audit of the spelling exception list as a control sample and discovered 843 invalid entries. Each agent run removed different sets of words for different reasons, and although they gave me a comprehensive summary of removed words, I disagreed with a lot of the decisions they made. Although they do a reasonably good job of cleaning up spelling errors when provided with a clean spelling exception list, it turns out they’re not really capable of making judgment calls about what belongs on the list to begin with.
Saving time on one-shot tasks
Add metadata descriptions
When you search for something on the internet, most results include a short snippet underneath the URL to help you decide if the page is what you’re looking for. This snippet is the metadata description: it describes the contents of the page in about the length of a tweet. If it isn’t set explicitly, a search engine usually takes the first 160 characters of the page – which may not be helpful to users.
We hadn’t explicitly set the metadata descriptions in the Server docs, but given that we have almost 250 pages, and crafting really good (short) metadata descriptions is hard… I asked Claude (yep, Sonnet 4.5 again) to do it. Based on the model vs. model testing, I figured it would be the best tool for this job – even if it’s not the quickest or most direct one.
It did indeed take quite a bit of prompting back and forth, and a lot of time, thanks to the sheer volume of pages we have. About a quarter of the way through, Claude tried to create a Python script to automate the rest of the job, but it gave up on that after it didn’t work very well. ![]()
However, it ultimately did a pretty good job of adding the metadata descriptions in the Server docs – they were short but highly relevant. This saved the team about 1–2 weeks’ worth of effort, which is what this would have required to do “by hand”, and we can always tweak the descriptions over time as the pages evolve. As a starting point for iterations, this was a huge win.
If this is something you’d like to try in your own docs, this is the reusable prompt I created from this task.
Convert landing pages to Markdown
Back when we first moved the documentation from Discourse to Read the Docs, I created a Python script to keep the RTD docs in sync (automatically) with updates being made on Discourse. This included some fully auto-generated landing pages that RTD needed but Discourse didn’t. For scripting convenience (“dammit Jim, I’m a Doctor scientist not a software engineer”) I created these landing pages in reStructuredText, but the rest of the docs were in Markdown from being fetched directly from Discourse.
This was fine while we were building up the new docs and during the migration period, but as we have begun to see more people contributing to the docs, having this mix of rST and Markdown became an unnecessary confusion for many. I’m aware of many conversion tools available, such as pandoc, but these tools seldom do a full, clean conversion, and they always require a manual cleanup step afterwards. Out of curiosity, I decided to see how Claude would handle the task.
It turns out, not too badly, even based on a natural language request rather than a prompt. However, I had to argue with Claude a lot more on this task. Most of the results (about 95%) were correct, but there were some things it just didn’t handle at all. It totally failed the :guilabel: to {guilabel} conversion – it just straight up removed the tags. Also, in 2 out of the 53 pages that were edited, the text on the page didn’t match the live docs… for reasons I never managed to get to the bottom of. It was a helpful reminder that, even if an agent does an otherwise excellent job on 99 pages, weird errors or hallucinations can pop up on number 100 without warning.
Take away: thorough, nitpicky reviews are necessary and unavoidable.
Linkchecker optimization
Our linkchecker is slow. Really slow. Slow to the point that I turned it off for a long time. It was painful enough that I decided to ask Claude (guess which one) to optimize it for us – before I set about the onerous task of cleaning up broken links, so I wouldn’t have to battle 10 minute+ CI runtimes each time I pushed some commits.
I started off with a benchmarking test to measure exactly how slow the linkcheck actually is:
Real time: 9 minutes 57.7 seconds (wall clock time)
User time: 2 minutes 25.5 seconds (CPU time in user mode)
System time: 0 minutes 9.5 seconds (CPU time in kernel mode)
The command exited with code 2, which likely indicates some broken links were found (this is normal for linkcheck). The output has been saved to linkcheck.txt for reference.
This establishes our baseline: approximately 10 minutes for a full linkcheck run.
I then asked Claude to analyze the linkcheck output and the documentation configuration settings to identify possible optimizations that could be made. Then, implementing each one-by-one allows us to re-benchmark after each intervention to measure the improvements.
Initial stats:
Total runtime: ~10 minutes (9m 57.7s)
Links checked: ~1,100+ URLs
Successful checks (ok): 812
Broken links: 160
Redirects: 191
Ignored links: 105 (manpages mostly)
Rate-limited: 21 instances
Claude managed to identify six issues causing slower runtimes:
- Rate limiting (high impact) causing forced sleep delays.
- 30-second timeout on slow sites (medium impact): the default setting we use has a linkcheck timeout of 30 seconds, with 3 retries per URL – which is 90 seconds per failing URL. By reducing the number of tries and the timeout time, we can cut this down significantly.
- 105 ignored URLs still processed by Sphinx (low impact).
- Broken links re-checked with 3 retries per broken link (medium impact).
- Redirect following (low-medium impact), where Sphinx spends time chasing along chains of redirects.
- No parallel workers (configuration) – by default, Sphinx auto-detects CPU cores, but the config doesn’t make use of them by using parallel workers.
In “Round 1” I decided we should save time on timeouts by:
- Adding two common rate-limited domains to the ignore list
- Reducing the number of retries from 3 → 2 to save time on persistent failures
- Reducing the timeout time from 30 to 15 seconds for faster failure on slow/broken links
These changes reduce the wait time per failing URL from 90s to just 30s.
Round 1: Performance Improvement
Before: 9m 57.7s
After: 2m 48.1s
Improvement: 71.8% reduction in runtime
Pretty good already! The ignored URLs don’t add much to the runtime, and the redirect following would eventually be solved when I cleared up the redirecting URLs, so I ignored those options.
In “Round 2” I decided we should enable parallel workers. The Sphinx default (when not set) is to use 5 parallel workers. After benchmarking again to determine the optimum number of workers (to avoid triggering mass rate-limiting), Claude and I settled on using 20.
| Run | Configuration | Time | Improvement vs Baseline | Improvement vs Previous |
|---|---|---|---|---|
| 1. Baseline | Timeout: 30s, retries: 3, workers: 5 (default) | 9m 57.7s | – | – |
| 2. Config optimization | Timeout: 15s, retries: 2, workers: 5, + ignore rate-limited domains | 2m 48.1s | -71.8% (7m 9.6s) | – |
| 3. + Parallel workers | Same as #2 + workers: 20 | 1m 30.7s | -84.8% (8m 27.0s) | -46.0% (1m 17.4s) |
Adding parallel workers saved an additional 77s, without triggering any rate-limiting issues.
I suspect we could make further gains if, instead of checking every link in the documentation as it appears, we extracted the links into a separate file for checking so that we can de-duplicate links that appear more than once in the docs. This would not only speed up the linkcheck, but would also make us a better neighbor by not pinging the same URLs multiple times per pull request.
While you should get good improvements by changing the config to match the parameters above, the test itself was interesting to run, so I created another reusable prompt off the back of this experiment.
Scripting and automating
Automatically updating redirected links
This test was based on another hint I got from @akonev, about the power of metaprompting – the idea of creating prompts that explicitly pre-define the structure, constraints, goals, and reasoning behavior an LLM agent should use when responding to subsequent prompts.
The task was simple: given the output from our linkchecker, construct a Python script that will automatically replace redirecting links in the documentation (and add a GitHub workflow that will run once a week to create a PR based on changes created by the script). @akonev very kindly shared a pre-made metaprompt with me, which I fed into Google Gemini alongside the task I wanted to complete; then I received the starting prompt for this task.
The prompt(s): I’ve included the initial prompts in the PR descriptions linked below.
To check whether there’s a difference in results when feeding a metaprompt to the agent, I used the same model (Claude Sonnet 4.5 again
) both times:
Overall impressions:
As you might expect, the metaprompt test was a one-shot, with no additional prompting required, whilst the non-metaprompt test required some additional iteration. Interestingly, in the metaprompt test, Claude didn’t create a readme to go with the script, or update the repo’s readme to point to the script. The non-metaprompt test version of Claude was much more diligent about documentation (which I appreciated).
The metaprompt test seems to have created a better GitHub action, in the sense that it runs the linkcheck and checks the output before running everything else. This means if there are no redirecting links, it doesn’t run anything – skipping unnecessary steps in the name of efficiency is behavior that I prefer. However, when I compared the results of the scripts (all the amended links), they were overwhelmingly similar, and when the scripts failed, they both failed in the same way.
Although there are some problems with some of the edge cases, both tests handled the majority of the redirecting link updates well. With a bit of work, this little proof-of-concept script could be reworked into a useful GitHub action to improve the user experience of the docs (no more getting redirected all over the place) and the maintainers’ experience (from having this work be mostly automated).
The better-performing script and associated GitHub action (from the metaprompt test) are in the GitHub repo alongside some instructions from the LLM on how to use it.
Feeding the robots
Every scientist sadly knows that you have to stop experimenting at some point and write up what you found. However, I managed to squeeze in one final experiment.
As technical authors, we traditionally use Search Engine Optimization (SEO) and analytics data to measure how well our documentation serves our readers. Since LLMs and other AI tools are becoming more popular, understanding how these tools ingest our docs and regenerate them on behalf of readers is becoming increasingly important – especially as these tools are continuously evolving. If we are a trusted primary source, and we want these tools to accurately reflect the content of our documentation, we need to make sure we pay attention to how the robots eat the docs.
The problem is, we don’t have the equivalent of Google Analytics for the various AI applications on the market. So, we only have indirect methods to tell how well docs are being represented through these tools, such as “coming up with huge lists of questions that you prompt in all the engines periodically, and manually monitoring the accuracy of the answers over time.”
Not very efficient, and way too subjective for my liking. Also, when everyone gets a personalized answer depending on their previous interactions with a tool, I can’t guarantee that what I’m seeing is the same as anyone else saw.
So I put my science hat on and wondered, “what if?” What if we go in the other direction? Instead of looking at the output of these engines, what would happen if we take the known information about what robots like to eat, and compare the input (our documentation) against those criteria?
So, just for the fun of it, I asked ChatGPT, Gemini and Claude what they like to eat. And then I asked Claude Sonnet 4.5 (because of course I did) to create a framework around that, along with a Python script that will measure every page of our documentation against that framework. I then asked it to interpret the results, and provide me actionable feedback ordered by impact.
To my amazement, after a couple of hours of prodding and tweaking, it actually did. It created a comprehensive, proof-of-concept toolkit that includes a Python script to measure every page’s “deliciousness,” a GitHub action that runs the script every 2 months, and a re-usable prompt to draw insights from the data. Claude identified improvements that could be made to the Server docs in several categories, including a critical weakness of the Server docs around internal linking.
There is an obvious caveat here that can’t go unmentioned. The AI landscape is changing so swiftly that the input measurement criteria are liable to change often and without notice. However, in the absence of official analytics tooling from the various engines, this would at least represent a semi-automated and more objective measure of how nutritious our docs are as robot food, even if the criteria do periodically change. For this reason, my technical author colleagues and I (particularly those who, unlike me, are actually trained in software engineering!) intend to develop this prototype into a viable tool for documentation measurement.
If you’d like to play with this prototype toolkit yourself, you can find all the relevant parts here.
Conclusions
These experiments, focused as heavily on Claude as they are, mirror much of Anthropic’s own findings about how AI is changing the workplace. As someone who knows enough Python to build “scripts that (basically, more-or-less) work”, I’ve come to really appreciate AI agents as a tool for rapid prototyping and developing early proof-of-concept ideas that otherwise wouldn’t be made. I’ve gone from “I have a vague idea about how this could be done, and maybe I can take a week to hack something together,” to “I know what I want to do, show me how it could be done, and let me iterate over the idea and not the code”.
The hidden cost for learners
I always learn something from my inefficient hacking. Deep learning comes from struggle, from friction. I learn which approaches are just plain bad or won’t work by working through them, and discovering for myself why they don’t work. In all of these experiments, I learned a lot about how the different tools approach problems, which has given me a sense of which tools are good for what tasks.
However, I didn’t learn anything about the outputs they created – why they chose certain paths and not others. Why did metaprompting result in a GitHub action that used a workflow that I would have chosen to create, but the other didn’t? Why did the agent handle converting some standard Sphinx markup roles ({term}, {ref}), and completely fail at others? ({guilabel}).
There is still so much about how the machinery works that is completely opaque to the user, by design. The scripts created by these agents are a great starting point, at least to see if an idea has legs. But I still have to be able to understand the workflow and be able to judge if the approach is good or not. I’m fortunate, in that I already know enough Python to understand what’s being generated and when an approach is faulty. AI agents may make coding more accessible to the absolute beginner, but they don’t make learning easier. If anything, they put up an invisible barrier to deeper skills development by removing all the friction that teaches you the right and wrong ways to do things.
Trading coding for reviewing
While I gained time saved on the pure code generation front, I spent considerably more time parsing, judging, and understanding the results. Some of the scripts will (once we’ve re-written the code) save us time by automating away some routine maintenance tasks (like updating redirecting links picked up by the linkchecker), but edge cases still trip up the machinery in ways that a human would be able to handle. This means that good quality, in-depth reviews are more critical than ever – and increasingly more costly in terms of time. We can’t just trust that the agents will do a good job. Sometimes they do random and unexpected things, and the more they do a good job, the greater the risk that we’ll be lulled into a false sense of security and reduce the amount of scrutiny we apply.
So, for each task that saved me time on the front end, the corresponding review cost me double or even triple what I’d spend reviewing work from my teammates. It’s easy to think of these tools as “productivity enhancers” if you exclude reviewing time. But this is just passing the burden of the work along to someone else; kicking the can down the road. To that, one might then reasonably ask “Why not get the agents to do the reviews?”. A fair question, but that would be like allowing students to mark their own homework. I’m sure we all remember how well that worked in school, on the rare occasions we were allowed to do it.
Can we trust LLMs to review?
There is an open question about whether agents should do automated pre-reviews on human-created pull requests. This is something I’ve been arguing with myself about, as a maintainer. The Server docs are an artifact of human generosity and community. They have always been created by humans, for humans, and when our community takes the time to contribute to the docs, it feels right that they should be greeted by humans, and not robots (sorry Claude). Would agent pre-reviews save us time? Maybe, although I sincerely doubt it when they need so much oversight. Would it enhance the contributor experience? I doubt that too. Being greeted by a robot never feels like a warm and welcoming experience. There could be a case for having agents “help” reviewers by pointing out things that might have been missed, at least where the human is doing the review. This is something I’m going to test in my next round of experiments, but the automated pre-reviews will stay turned off in the Server docs for now.
The future of documentation?
I see a lot of “doomer-ism” online about the role AI has to play in documentation. I don’t buy into it. The main takeaway I got from this experience is:
We don’t have to choose between writing for people or writing for AI. To the contrary, what’s good for people is also good for AI.
Having a really good, detailed, and comprehensive contributors guide for people provided excellent context for Claude when it created the copilot-instructions.md file. Making your documentation well organized for people to navigate also helps engines understand the structure of the information. Having a logical page structure and good division of sections helps engines to retrieve the right answers for user questions. A good technical writer already pays attention to the information architecture, user experience, and performance of their docs. These are the more challenging aspects of the job – and the most enjoyable – and not only are they not going anywhere, they are going to be even more crucial in the future.