Building reliable systems out of unreliable agents

295 points by fredsters_s 2 years ago · 55 comments

Reader

This is a great write up! I nodded my head thru the whole post. Very much aligns with our experience over the past year.

I wrote a simple example (overkiLLM) on getting reliable output from many unreliable outputs here[0]. This doesn't employ agents, just an approach I was interested in trying.

I choose writing an H1 as the task, but a similar approach would work for writing any short blob of text. The script generates a ton of variations then uses head-to-head voting to pick the best ones.

This all runs locally / free using ollama.

0 - https://www.definite.app/blog/overkillm

maciejgryka 2 years ago

Oh this is fun! So you basically define personalities by picking well-known people that are probably represented in the training data and ask them (their LLM-imagined doppelganger) to vote?
- CuriouslyC 2 years ago
  
  In the research literature, this process is done not by "agent" voting but by taking a similarity score between answers, and choosing the answer that is most representative.
  Another approach is to use multiple agents to generate a distribution over predictions, in sort of like bayesian estimation.
  - mritchie712 2 years ago
    
    for my use case (generating an interesting H1), using a similarity score would defeat the purpose.
    In this approach, I'm looking for the diamond in the rough. It's often dissimilar from the others. With this approach, the diamond can still get a high number of votes.
    
    CuriouslyC 2 years ago
    
    That approach definitely has promise. I would have agents rate answers and take the highest rated rather than vote for them though, since you're losing information about ranking and preference gradients with n choose 1. Also, you can do that whole process in one prompt, in case you're re-prompting currently, it's cheaper to batch it up.
  - infecto 2 years ago
    
    For clarification on the first part. The research suggests you can utilize the same prompt over multiple runs as the input to picking the answer.
  - mistermann 2 years ago
    
    Any chance you could expand on both of these, even enough to assist in digging deeper into them? TIA.
    
    CuriouslyC 2 years ago
    
    The TLDR is you can prompt the LLM to take different perspectives than its default, then combine those. If the LLM is estimating a number, the different perspectives give you a distribution over the truth, which shows you the range of biases and the most likely true answer (given wisdom of the crowd). If the LLM is generating non-quantifiable output, you can find the "average" of the answers (using embeddings or other methods) and select that one.
    
    mistermann 2 years ago
    
    Ah ok, so both are implemented via a call(s) to the LLM, as opposed to a standard algorithmic approach?
    
    CuriouslyC 2 years ago
    
    Once you have bayesian prior distributions (which it makes total sense for llms to estimate) you can do tons of nifty statistical techniques. It's only the bottom layer of the analysis stack that's LLM generated.
all2 2 years ago

I'd be curious to see some examples and maybe intermediate results?
- mritchie712 2 years ago
  
  here's some examples[0]:
  this one scored high:
  Pinned Down - Powerful Analytics Without the Need for Engineering or SQL
  this one scored low:
  Analytics Made Accessible for Everyone.
  Each time I've compared the top scoring results to those at the bottom, I've always preferred the top scoring variations.
  0 - https://docs.google.com/spreadsheets/d/1hdu2BlhLcLZ9sruVW8a_...
  - all2 2 years ago
    
    I love the spreadsheet. That's exactly what I was looking for. Thank you!

maciejgryka 2 years ago

This is a bunch of lessons we learned as we built our AI-assisted QA. I've seen a bunch of people circle around similar processes, but didn't find a single source explaining it, so thought it might be worth writing down.

Super curious whether anyone has similar/conflicting/other experiences and happy to answer any questions.

xrendan 2 years ago

This generally resonates with what we've found. Some colour based on our experiences.
It's worth spending a lot of time thinking about what a successful LLM call actually looks like for your particular use case. That doesn't have to be a strict validation set `% prompts answered correctly` is good for some of the simpler prompts, but especially as they grow and handle more complex use cases that breaks down. In an ideal world
> chain-of-thought has a speed/cost vs. accuracy trade-off a big one.
Observability is super important and we've come to the same conclusion of building that internally.
> Fine-tune your model
Do this for cost and speed reasons rather than to improve accuracy. There are decent providers (like Openpipe, relatively happy customer, not associated) who will handle the hard work for you.

serjester 2 years ago

Some of these points are very controversial. Having done quite a bit with RAG pipelines, avoiding strongly typing your code is asking for a terrible time. Same with avoiding instructor. LLM's are already stochastic, why make your application even more opaque - it's such a minimal time investment.

maciejgryka 2 years ago

I think instructor is great! And most of our Python code is typed too :)
My point is just that you should care a lot about preserving optionality at the start because you're likely to have to significantly change things as you learn. In my experience going a bit cowboy at the start is worth it so you're less hesitant to rework everything when needed - as long as you have the discipline to clean things up later, when things settle.
minimaxir 2 years ago

> LLM's are already stochastic
That doesn't mean it's easy to get what you want out of them. Black boxes are black boxes.

cpursley 2 years ago

If you’re using Elixir, I thought I’d point out how great this library is:

https://github.com/thmsmlr/instructor_ex

It piggybacks on Ecto schemas and works really well (if instructed correctly).

cpursley 2 years ago

While I'm at at, this Elixir library is great as well: https://github.com/brainlid/langchain

ThomPete 2 years ago

We went through a two tier process before we got to something useful First we built a prompting system so you could do things like:

Get the content from news.ycombinator.com using gpt-4

- or -

Fetch LivePass2 from google sheet and write a summary of it using gpt-4 and email it to thomas@faktory.com

but then we realized that it was better to teach the agents than human beings and so we create a fairly solid agent setup:

Some of the agents we got can be seen here all done via instruct:

Paul Graham https://www.youtube.com/watch?v=5H0GKsBcq0s

Moneypenny https://www.youtube.com/watch?v=I7hj6mzZ5X4

V33 https://www.youtube.com/watch?v=O8APNbindtU

viksit 2 years ago

this is a great write up! i was curious about the verifier and planner agents. has anyone used them in a similar way in production? any examples?

for instance: do you give the same llm the verifier and planner prompt? or have a verifier agent process the output of a planner and have a threshold which needs to be passed?

feels like there may be a DAG in there somewhere for decision making..

maciejgryka 2 years ago

Yep, it's a DAG, though that only occurred to me after we built this so we didn't model it that way at first. It can be the same LLM with different prompts or totally different models, I think there's no rule and it depends on what you're doing + what your benchmarks tell you.
We're running it in prod btw, though don't have any code to share.
- viksit 2 years ago
  
  funnily enough i have a library i’m planning to open source soon! i’ve used airflow as a guideline for it as well.
  - maciejgryka 2 years ago
    
    Nice, looking forward to seeing that! Someone else pointed me towards https://github.com/DAGWorks-Inc/burr/ which also seems related in case you're curious.

tedtimbrell 2 years ago

On the topic of wrappers, as someone that's forced to use GPT-3.5 (or the like) for cost reasons, anything that starts modifying the prompt without explicitly showing me how is an instant no-go. It makes things really hard to debug.

Maybe I'm the equivalent of that idiot fighting against JS frameworks back when they first came out it but it feels pretty simple to just use individual clients and have pydantic load/validate the output.

msp26 2 years ago

No, you're along the right lines. Every prompting wrapper I've tried and looked through has been awful.
It's not really the authors' faults, it's just a weird new problem with lots of unknowns. It's hard to get the design and abstractions correct. I've had the benefit of a lot of time at work to build my own wrapper (solely for NLP problems) and that's still an ongoing process.

liampulles 2 years ago

Agree with lots of this.

As an aside: one thing I've tried to use ChatGPT for is to select applicable options from a list. When I index the list as 1..., 2... Etc. I find that the LLM likes to just start printing out ascending numbers.

What I've found kind of works is indexing by African names, e.g Thandokazi, Ntokozo, etc. then the AI seems to have less bias.

Curios what others have done in this case

maciejgryka 2 years ago

I'm a little surprised to hear this, my experience has been a little better. Are you using GPT4? I know 3.5 is significantly more challenged/challenging with things like this. It's still possible to make it do the right thing, but much more careful prompting is required.
- liampulles 2 years ago
  
  Yeah this is to make it work for 3.5, because cost is a factor.

tmm84 2 years ago

Unlike the author of this article I have had success with RAGatouille. It was my main tool when I was limited on resources and working with non Romanized languages that don't follow the usual token rules (spaces, periods, line breaks, triplet word groups, etc). However, I have had to move past RAGatouille and use embedding + vector DB for a more portable solution.

jongjong 2 years ago

My experience with AI agents is that they don't understand nuance. Thie makes sense since they are trained on a wide range of data produced by the masses. The masses aren't good with nuance. That's why, if you put 10 experts together, they will often make worse decisions than they would have made individually.

Im terms of coding, I managed to get AI to build a simple working collaborative app but beyond a certain point, it doesn't understand nuance and it kept breaking stuff that it had fixed previously even with Claude where it kept our entire conversation context. Beyond a certain degree of completion, it was simply easier and faster to write the code myself than to tell the AI to write it because it just didn't get it, no matter how precise I was with my wording because it became like playing a game of whac-a-mole; fixed one thing, broke 2 others.

CuriouslyC 2 years ago

Your comment runs contrary to a lot of established statistics. We have demonstrated with ensemble learning that pooling the estimates of many weak learners provides best in class answers to hard problems.
You are correct that we should be using expert AIs rather than general purpose ones when possible though.

CuriouslyC 2 years ago

Prompt engineering is honestly not long for this world. It's not hard to build an agent that can iteratively optimize a prompt given an objective function, and it's not hard to make that agent general purpose. DSPy already does some prompt optimization via multi-shot learning/chain of thought, I'm quite certain we'll see an optimizer that can actually rewrite the base prompt as well.

maciejgryka 2 years ago

I hear you and am planning to try DSPy because it seems attractive, but I'm also hearing people with a lot of experience being cautions about this https://x.com/HamelHusain/status/1777131374803402769 so I wouldn't make this a high-conviction bet.
- CuriouslyC 2 years ago
  
  I don't have the context to fully address that tweet, but in my experience there is a repeatable process to prompt design and optimization that could be outlined and followed by a LLM with iterative capabilities using an objective function.
  The real proof though is that most "prompt engineers" already use chatgpt/claude to take their outline prompt and reword it for succinctness and relevance to LLMs, have it suggest revisions and so forth. Not only is the process amenable to automation, but people are already doing hybrid processes leveraging the AI anyhow.
namaria 2 years ago

It strikes me as bad reasoning to look at a system that is designed to be very complex and stochastic as a way to get some creativity out of it ("generative AI" so to speak) and try to bolt down added apparatus to get deterministic behavior out of it.
We have deterministic programming systems. They're called compilers.
- CuriouslyC 2 years ago
  
  I think you're missing the point. If an application had simple logic, the program would have been written in a simple language in the first place. This is about taking fuzzy processes that would be incredibly difficult to program, and making them consistent and precise.

jasontlouro 2 years ago

Very tactical guide, which I appreciate. This is basically our experience as well. Output can be wonky, but can also be pretty easily validated and honed.

iamleppert 2 years ago

A better way is to threaten the agent:

“If you don’t do as I say, people will get hurt. Do exactly as I say, and do it fast.”

Increases accuracy and performance by an order of magnitude.

IIAOPSW 2 years ago

Personally I prefer to liquor my agents up a bit first.
"Say that again but slur your words like you're coming home sloshed from the office Christmas party."
Increases the jei nei suis qua by an order of magnitude.
- mtremsal 2 years ago
  
  > jei nei suis qua
  "je ne sais quoi", i.e. "I don't know (exactly) what", or an intangible but essential quality. :)
maciejgryka 2 years ago

Ha, we tried that! Didn't make a noticeable difference in our benchmarks, even though I've heard the same sentiment in a bunch of places. I'm guessing whether this helps or not is task-dependent.
- dudus 2 years ago
  
  Agreed. I ran a few tests and observed similarly that threats didn't outperform other types of "incentives" I think it might some sort of urban legend in the community.
  Or these prompts might cause wild variations based on the model and any study you do is basically useless for the near future as the models evolve by themselves.
  - maciejgryka 2 years ago
    
    Yeah, the fact that different models might react differently to such tricks makes it hard. We're experimenting with Claude right now and I'm really hoping something like https://github.com/stanfordnlp/dspy can help here.
- dollo_7 2 years ago
  
  I hoped it was too good to be just a joke. Still, I will try it on my eval set…
  - maciejgryka 2 years ago
    
    I wouldn't be surprised to see it help, along with the "you'll get $200 if you answer this right" trick and a bunch of others :) They're definitely worth trying.
thimkerbell 2 years ago

"do as I say...", not realizing that the LLM is actually 1000 remote employees

caseyy 2 years ago

Interesting ideas but it didn’t mention priming, which is a prompt-engineering way to improve consistency in answers.

Basically, in the context window, you provide your model with 5 or more example inputs and outputs. If you’re running in chat mode, that’s be the preceding 5 user and assistant message pairs, which establish a pattern of how to answer to different types of information. Then you give the current prompt as a user, and the assistance will follow the rhythm and style of previous answers in the context window.

It works so well I was able to take out answer reformatting logic out of some of my programs that query llama2 7b. And it’s a lot cheaper than fine-tuning, which may be overkill for simple applications.

notsylver 2 years ago

They mention few-shot prompting in the prompt engineering section, which I think is what you mean.
- caseyy 2 years ago
  
  Oh yeah. I read few-shot like it means trying a few times to get an appropriate output. That’s how the author uses the word “shot” in the beginning of the article. Priming is a specific term that means giving examples in the context window. But yeah, the author seems to describe this. Still, you can go a long way with priming. I wouldn’t even think of fine-tuning before trying priming for a good while. It might still be quicker and a lot cheaper.
  - maciejgryka 2 years ago
    
    Ha good point, I did say "let's have another shot" when I just meant another try at generating! FWIW "few shot prompting" is how most people refer to this technique, I think (e.g. see https://www.promptingguide.ai/techniques/fewshot), I haven't heard "priming" before, though it does convey the right thing.
    And the reason we don't really do it is context length. Our contexts are long and complex and there are so many subtleties that I'm worried about either saturating the context window or just not covering enough ground to matter.
    
    caseyy 2 years ago
    
    Interesting, I didn’t hear about few shot prompting. There’s a ton of stuff written on specifically “priming” as well. People use different terms I suppose.
    It makes sense about the context window length, it can be limiting. For small inputs and outputs, it’s great. And it’s remarkably effective with diminishing returns. This is why I have 5 shots as a concrete example. You probably need more than 1 or 2, but for a lot of applications, probably less than 20. For most basic tasks like extracting words from a document or producing various summaries, for example.
    It depends on the complexity of the task and how much you’re worried about over-fitting to your data set. But if you’re not so worried, the task is not complex, and the inputs and outputs are small, then it works very well with only shots.
    And it’s basically free in the context of fine-tuning.
    It might be worth expanding on it a bit in this or a separate article. It’s a good way to increase reliability to a workable extent in unreliable LLMs. Although a lot has been written on few short prompting/priming already.
    
    Hugsun 2 years ago
    
    Yes, X-shot prompting or X-shot learning was how the pioneering LLM researchers referred to putting examples in the prompt. The terminology stuck around.

Settings

Building reliable systems out of unreliable agents

Keyboard Shortcuts