CP Doyle - NFHN Reader

Overview

This post covers how to best utilise the Pydantic library for manipulating LLM responses and although it starts off pretty simple, we dive into some esoteric gotchas which may be novel to even hardcore llm hackers.

I'm assuming some pre-requisite knowledge here, such as adequate python skills and familiarity with Anthropic's SDK.

💡 I don't have access to the code powering OpenAI's structured outputs, but I'm curious as to whether they address the problems outlined in this post. My guess is they don't, but get in touch if you think otherwise.

Basic Structured Outputs

In a nutshell, structured outputs allow us to enforce some schema onto an LLM. Take a prompt that looks like:

The LLM will generate a raw string containing the answer to the question, great. Now let's look at the structured output approach:

Now there's nothing special about doing this on its own, but it starts to get fun when you approach this problem from first-principles.

LLMs are autoregressive meaning that each token is conditional on the tokens that came before it i.e. it's a conditional distribution over tokens at each time step:

$$P(x_t | x_{t-1},\ \ldots, x_0)$$

This means we can use structure to force the model to generate some preamble that might improve its answer i.e.

$$P(answer\ |\ preamble,\ \ldots, context)$$

Let's look at a prompt that conditions the answer on some reasoning preamble - this is sort of similar to what reasoning models do. The difference here being the format of our reasoning isn't optimised via some reinforcement learning process:

Great! But our model is hallucinating and for some reason always messes up answers which require reasoning over some time period!

If we give the model a full medical record and a question such as "has the patient received treatment X in the past two years?" the model might be inclined to say "yes, absolutely" at any sign of treatment X, disregarding the temporal aspect or whether the treatment took place 10 years ago. Let's update our prompt with a simple solution which absolutely will not work:

A much more robust way to solving is by doing so with structured outputs; remember that we can manipulate what the model generates:

There's more we could say here but that's the gist of it. Let's move on to Pydantic.

Introducing Pydantic

Enter Pydantic, python's library for handling structured typing and schemas. Let's rewrite the last example:

With the overview out of the way, the rest of this blog will dive into the implementation of call_anthropic_structured(). Some people might stop here and ask:

"But wait, can't we just use libraries such as instructor?"

The answer is sometimes. Let's keep going; starting with simplifying our prompt using Pydantic's out of the box features:

This is very useful for when our models get increasingly complex because we don't want human-error introducing discrepancies between our prompt and output schema. Here's what the output of Response.model_json_schema() looks like:

💡 Another easy win is to use Pydantic.Field(desc="your description here") which helps the LLM understand exactly what it's looking for at each attribute, but I've excluded this from my examples because it adds clutter.

Of course this still doesn't explain why we can't always use Instructor. Let's look at a more complicated real-world example. In this example, we're trying to help a farmer decide which pesticides they should purchase for their crops.

Assume we have two inputs:

The farm record which contains all irrigation and crop details
The guidance on pesticides as released by the farmer's association

We've used a string guidance here but you'll appreciate that most LLM providers support direct PDF byte ingestion, meaning that this structured approach generalises pretty well to more complicated examples. Obviously the PDFs may not have such lovely structure, but you can just make your schemas more generalised.

The Meat of the Issue

Now let's make the problem a little bit harder and add a condition where the developer must render a frontend dynamically, perfectly showing the reasoning for each bullet point and whether each bullet's logical conditions were met.

We want to capture everything and make it easily parseable and can achieve a rough template with the following schema:

But let's say we use this pattern everywhere, we'd likely create some base abstraction e.g.:

The problem is that if we dump this schema via instructor or some other method like Response.model_schema_json(), we're going to hit some highly esoteric bugs - wherein lies the core issue of this blog post.

If we try to dump the schema for Reasoning, we'll get the following:

If you immediately understood the issue from just reading the schema then take a bow, I was stumped for hours but essentially Pydantic dumps keys in the order of inheritance.

In case the issue isn't fully clear, the problem is that Pydantic serialisation serialises schema attributes in order of inheritance, meaning that it'll output the schema for BaseReasoning first and the attributes to TemporalReasoning second. This means that when we dump the schema, the LLM will generate BaseReasoning attributes before TemporalReasoning attributes, essentially nullifying the use of the child class.

💡What this means is that your responses will contain all expected keys, but the order in which the LLM generates these keys won't force the answer to be conditioned on the reasoning. We can see how temporal_reasoning_required is only generated AFTER the model has generated its answer.

Your responses will therefore contain all the correct keys, but we'll have been generating

$$P(reasoning\ |\ answer,\ \ldots, context)$$

instead of

$$P(answer\ |\ reasoning,\ \ldots, context)$$

which is even more pernicious that you might initially realise, because if a model's accuracy is low, a sensible dev's first instinct is to inspect the reasoning….

What's the fix?

I'm sure there are lots of great fixes out there, but I've settled for a custom function that recursively re-orders the schema based on some preferred order. It's not a lot of code and I've attached it to a Mixin.

There are other problems that can arise of course, such as ensuring the model dump actually contains all the child attributes, but that's a deeper dive into python's many polymorphism pitfalls and would require a much longer post.

Appendix

1. Handling Model Validation Failures

I didn't cover extracting json or what to do if the model validation fails. Look at instructor, but I've always wondered whether there existed a better approach that retries responses that failed validation using smaller/faster/local models.

2. Self-Checking Validation

A clever approach might be to get the model to generate post-json reasoning that examines whether it's got all the right keys - this works and it will often re- generate new and entirely whole responses within the same response, but you need to watch out for context length (and handle the multi ```json parsing).

3. Improving Accuracy with Field Descriptions

A very easy win to improving accuracy is using Pydantic.Field and adding a description for each key. I didn't bother in the examples because it adds clutter.

4. Missing Child Attributes in Polymorphic Classes

The below pytest file contains an illustrative dive into missing child attributes which might occur if you're not careful. It's really tricky to spot these issues, especially if you can't see into the internals of the library you're using e.g. Instructor or the generic OpenAI client.