Show HN: Composo – Evaluating and improving performance of LLM applications

30 points by sfox100 2 years ago · 6 comments · 2 min read

Reader

Hey HN!

We’re been building Composo - a platform that helps teams achieve high performance, guarantee accuracy & minimise the cost of LLM applications.

Problem we’re solving:

LLM applications are non-deterministic, so evaluating whether results are good or bad is highly subjective and often requires domain expertise. Iterating over 1000s of combinations of prompts, models, temperatures, RAG settings (& many other elements) is therefore very manual & time consuming.

How we are solving it:

Composo links directly to your application (in a simple to set up, but highly powerful way) which enables it to function like a remote control for your application. Once set up, anyone on your team (inc. non technical domain experts or PMs), can use Composo to easily test out your application with different models, prompts, temperatures & RAG settings (or any other variable in your codebase you decide to make available at initial set up). Crucially, this is simple enough to be used by anyone, but powerful enough for any application (e.g. real apps built in code using agents etc).

This testing can be done in both our playground & our evaluation suite:

1) Playground: Here you can ‘chat’ with your application in a UI similar to the openAI playground, but with inputs being runs on your actual application rather than a simple LLM call & with the ability to change any variables you like directly within the Composo UI (e.g. system message, temperature, model, RAG settings).

2) Evaluation suite: Here you can conduct rigorous testing & evaluation on your application either ad-hoc while in development, or repeated over time to check for performance regression. Our test suite contains automated evaluation tools including: evaluation in comparison to ground truth answers (with exact match, vector similarity & LLM graded similarity), with specific criteria (e.g. code validity, JSON validity, specific keyword inclusion or exclusion) & AI grading (this uses the Composo AI critic which leverages the latest research in LLM auto-evaluation under the hood).

The easiest things to get started with, without having to link an application or even sign up, are:

1) Play with different models in our playground by chatting directly or using our demo apps (e.g. an AI doctor)

2) Automate your prompt writing & optimisation with our AI prompt writer

Thanks so much, and would be hugely grateful for any feedback!

llm-dev 2 years ago

How is this different from existing solutions such as humanloop or agenta? seems as though there are loads of companies popping doing similar stuff

sfox100OP 2 years ago

Great question!
We see ourselves targeting different segments of the market.
1) Composo focusses on non-technical users like PMs or domain experts (in contrast to dev-focussed & open-source tools like Agenta).
2) With Composo, you build your application in code in a codebase (rather than building it in a low-code UI builder, like Humanloop). This enables teams to develop & test applications of any level of complexity, with no constraints on functionality.

wheld1 2 years ago

codebase integration sounds nice. always missing that when playing around with the existing opensource solutions

pggeek 2 years ago

doesn't langsmith offer that as well? i don't see how this would be too different
- sfox100OP 2 years ago
  
  Yes, great point, difference is that this Langsmith is more focussed on dev users, rather than the PM/domain expert users. And langsmith is also more on the observability, monitoring side, whereas we are more about testing & optimisation while in development
arminsommer 2 years ago

glad to hear you like it! always open to suggestions on how we could improve it.

Settings

Show HN: Composo – Evaluating and improving performance of LLM applications

Keyboard Shortcuts