prompt-learning.md

5 min read Original article ↗

Given examples of desired output, automatically discover the best system prompt for any LLM task.

The Problem

Any codebase that uses LLMs has prompts — for text generation, data transformation, classification, summarization. These prompts are hand-written, manually iterated, and frozen. When requirements shift or quality degrades, a developer tweaks the prompt by hand again. There's no systematic way for agents to self-optimize their own prompts given examples of what good output looks like.

The Idea

Treat prompt authoring as a learning problem. The LLM is both the optimizer and the evaluator:

  1. Feed in examples of desired output (and optionally an existing prompt to refine).
  2. A thinking model analyzes examples to discover dimensions of quality — the independent axes along which "good" varies for this specific task.
  3. It generates a candidate prompt from scratch, or starts from the existing one.
  4. It tests the prompt by generating new output, then scores it against the originals on each discovered dimension.
  5. It hypothesizes what's weak, refines the prompt.
  6. Repeat N times. Keep the best-scoring prompt.

The output is a single system prompt, ready for production. No fine-tuning, no training data formatting, no GPU.

Why It Works

The LLM discovers what to optimize for. Hand-written rubrics reflect the prompt engineer's assumptions. By analyzing examples with extended thinking, the model finds dimensions the engineer wouldn't think to specify — domain-specific quality signals that are obvious in the examples but invisible to introspection.

Evaluation is grounded in examples, not self-assessment. The model doesn't ask "did I follow my own instructions?" — it asks "does this look like the reference examples on each dimension?" This catches blind spots the prompt itself doesn't address.

Best-of-N selection absorbs noise. Refinement can overshoot. Tracking the best across all iterations means you never lose a good prompt to a bad edit.

Analysis and generation are deliberately separate. Going straight from examples to a prompt produces shallow results. Forcing the model to observe before acting yields richer understanding and faster convergence.

The Protocol

Input

Required: Examples of desired output — writing samples, good outputs from a manual process, reference documents, or labeled input-output pairs. Minimum ~1,000 words or ~10 pairs.

Optional: An existing prompt to use as starting point.

Step 1: Analyze

Read all examples with a thinking model (extended thinking / chain-of-thought). Identify:

  • Dimensions of quality — the axes that define "good" for this task. These become the evaluation rubric for all subsequent iterations. Dimensions can be LLM-judged qualitative axes (tone, structure, specificity) or computable metrics (cosine similarity to reference embeddings, exact-match accuracy, BLEU score) — whatever fits the task. The thinking model decides.
  • Patterns — recurring structural, stylistic, or logical choices.
  • Anti-patterns — what the examples consistently avoid.

This is the most consequential step. If the analysis is shallow, the loop optimizes for the wrong things. Use the strongest available reasoning model.

Step 2: Generate Candidate Prompt

If no existing prompt was provided, generate one from the analysis. Otherwise, refine the existing prompt using the analysis. Write as direct instructions, not descriptions: "Write sentences that average 12-18 words" not "The author tends to use medium-length sentences." Include an explicit "avoid" section. Target 500-2000 words.

Step 3: Iteration Loop

Run N times (default 5, adjust by task complexity):

3a. Generate — Produce test output using the current prompt as system instruction. Moderate temperature (0.6-0.8), neutral topic — test robustness, not memorization.

3b. EvaluateThinking model scores test output against reference examples on each dimension. For qualitative dimensions, LLM-judged 0-100. For computable dimensions, run the metric directly (cosine similarity, exact match, etc.). Returns per-dimension scores, overall score, and 2-3 actionable weaknesses.

3c. Track best — If overall score exceeds previous best, save this prompt.

3d. Refine — Current prompt + scores + feedback → improved prompt. Focus on lowest-scoring dimensions, preserve high-scoring sections. Low temperature (0.2-0.4).

Output

{
  "learned_prompt": "...",
  "best_score": 82,
  "best_iteration": 3,
  "dimensions": ["tone", "cosine_similarity", "structure", "specificity"],
  "conformity_log": [
    {"iteration": 1, "overall": 68, "scores": {...}, "chosen": false},
    {"iteration": 2, "overall": 74, "scores": {...}, "chosen": false},
    {"iteration": 3, "overall": 82, "scores": {...}, "chosen": true}
  ]
}

Where It's Been Used

Writing style learning. A publishing platform ingests a user's past writing, runs the protocol, and produces a learned prompt that generates new content in the user's voice. The anti-pattern discovery turned out to be critical — the system also maintains a "slop detection" list (cliche transitions, hollow superlatives, structural monotony) injected alongside the learned prompt. Users edit generated drafts lightly rather than rewriting them.

Cold-start recommendations. A B2B recommendation engine needs to describe a new user's ideal customer profile from just their website, with no interaction history. The protocol learns a prompt from historical data (users who liked certain items, paired with their websites), evaluated by embedding distance between generated descriptions and actual liked-item descriptions. The learned prompt replaced a hand-written one in production.

Comparison With Alternatives

vs. Few-shot prompting. Few-shot puts examples in the prompt at inference time, consuming tokens on every call. Prompt learning distills examples into instructions once — smaller, faster, cheaper at inference.

vs. Fine-tuning. Fine-tuning modifies model weights. Prompt learning stays at the prompt level — works with any API model, and the result is inspectable, editable text.

vs. DSPy / OPRO. These optimize prompts programmatically with gradient-like signals. Prompt learning is simpler — no framework, works in a single conversation. Less rigorous for tasks with large labeled datasets and clear metrics; better where quality is subjective or multidimensional.

Limitations

Convergence is not guaranteed — the conformity log tells you when to stop. Examples must be representative. The method produces one prompt; for multi-step pipelines, run it separately per stage.


Developed in production at Leadbay.ai.

License

MIT License.

Go Beyond

/refine Skill based on this protocol. Thanks @raywu