Given examples of desired output, automatically discover the best system prompt for any LLM task.
The Problem
Any codebase that uses LLMs has prompts — for text generation, data transformation, classification, summarization. These prompts are hand-written, manually iterated, and frozen. When requirements shift or quality degrades, a developer tweaks the prompt by hand again. There's no systematic way for agents to self-optimize their own prompts given examples of what good output looks like.
The Idea
Treat prompt authoring as a learning problem. The LLM is both the optimizer and the evaluator:
- Feed in examples of desired output (and optionally an existing prompt to refine).
- A thinking model analyzes examples to discover dimensions of quality — the independent axes along which "good" varies for this specific task.
- It generates a candidate prompt from scratch, or starts from the existing one.
- It tests the prompt by generating new output, then scores it against the originals on each discovered dimension.
- It hypothesizes what's weak, refines the prompt.
- Repeat N times. Keep the best-scoring prompt.
The output is a single system prompt, ready for production. No fine-tuning, no training data formatting, no GPU.
Why It Works
The LLM discovers what to optimize for. Hand-written rubrics reflect the prompt engineer's assumptions. By analyzing examples with extended thinking, the model finds dimensions the engineer wouldn't think to specify — domain-specific quality signals that are obvious in the examples but invisible to introspection.
Evaluation is grounded in examples, not self-assessment. The model doesn't ask "did I follow my own instructions?" — it asks "does this look like the reference examples on each dimension?" This catches blind spots the prompt itself doesn't address.
Best-of-N selection absorbs noise. Refinement can overshoot. Tracking the best across all iterations means you never lose a good prompt to a bad edit.
Analysis and generation are deliberately separate. Going straight from examples to a prompt produces shallow results. Forcing the model to observe before acting yields richer understanding and faster convergence.
The Protocol
Input
Required: Examples of desired output — writing samples, good outputs from a manual process, reference documents, or labeled input-output pairs. Minimum ~1,000 words or ~10 pairs.
Optional: An existing prompt to use as starting point.
Step 1: Analyze
Read all examples with a thinking model (extended thinking / chain-of-thought). Identify:
- Dimensions of quality — the axes that define "good" for this task. These become the evaluation rubric for all subsequent iterations. Dimensions can be LLM-judged qualitative axes (tone, structure, specificity) or computable metrics (cosine similarity to reference embeddings, exact-match accuracy, BLEU score) — whatever fits the task. The thinking model decides.
- Patterns — recurring structural, stylistic, or logical choices.
- Anti-patterns — what the examples consistently avoid.
This is the most consequential step. If the analysis is shallow, the loop optimizes for the wrong things. Use the strongest available reasoning model.
Step 2: Generate Candidate Prompt
If no existing prompt was provided, generate one from the analysis. Otherwise, refine the existing prompt using the analysis. Write as direct instructions, not descriptions: "Write sentences that average 12-18 words" not "The author tends to use medium-length sentences." Include an explicit "avoid" section. Target 500-2000 words.
Step 3: Iteration Loop
Run N times (default 5, adjust by task complexity):
3a. Generate — Produce test output using the current prompt as system instruction. Moderate temperature (0.6-0.8), neutral topic — test robustness, not memorization.
3b. Evaluate — Thinking model scores test output against reference examples on each dimension. For qualitative dimensions, LLM-judged 0-100. For computable dimensions, run the metric directly (cosine similarity, exact match, etc.). Returns per-dimension scores, overall score, and 2-3 actionable weaknesses.
3c. Track best — If overall score exceeds previous best, save this prompt.
3d. Refine — Current prompt + scores + feedback → improved prompt. Focus on lowest-scoring dimensions, preserve high-scoring sections. Low temperature (0.2-0.4).
Output
{
"learned_prompt": "...",
"best_score": 82,
"best_iteration": 3,
"dimensions": ["tone", "cosine_similarity", "structure", "specificity"],
"conformity_log": [
{"iteration": 1, "overall": 68, "scores": {...}, "chosen": false},
{"iteration": 2, "overall": 74, "scores": {...}, "chosen": false},
{"iteration": 3, "overall": 82, "scores": {...}, "chosen": true}
]
}Where It's Been Used
Writing style learning. A publishing platform ingests a user's past writing, runs the protocol, and produces a learned prompt that generates new content in the user's voice. The anti-pattern discovery turned out to be critical — the system also maintains a "slop detection" list (cliche transitions, hollow superlatives, structural monotony) injected alongside the learned prompt. Users edit generated drafts lightly rather than rewriting them.
Cold-start recommendations. A B2B recommendation engine needs to describe a new user's ideal customer profile from just their website, with no interaction history. The protocol learns a prompt from historical data (users who liked certain items, paired with their websites), evaluated by embedding distance between generated descriptions and actual liked-item descriptions. The learned prompt replaced a hand-written one in production.
Comparison With Alternatives
vs. Few-shot prompting. Few-shot puts examples in the prompt at inference time, consuming tokens on every call. Prompt learning distills examples into instructions once — smaller, faster, cheaper at inference.
vs. Fine-tuning. Fine-tuning modifies model weights. Prompt learning stays at the prompt level — works with any API model, and the result is inspectable, editable text.
vs. DSPy / OPRO. These optimize prompts programmatically with gradient-like signals. Prompt learning is simpler — no framework, works in a single conversation. Less rigorous for tasks with large labeled datasets and clear metrics; better where quality is subjective or multidimensional.
Limitations
Convergence is not guaranteed — the conformity log tells you when to stop. Examples must be representative. The method produces one prompt; for multi-step pipelines, run it separately per stage.
Developed in production at Leadbay.ai.
License
MIT License.