Background On Skills And How DSPy Can Help Us Optimize It
OpenAI added Skills to Codex this week. Anthropic released the Agent Skills spec a few days ago.
The format: reusable prompt files (SKILL.md) that agents can invoke:
code-review.md # Find security vulnerabilities
sql-optimization.md # Optimize queries
api-design.md # Review endpointsThese are 100-200 line markdown files with instructions, examples, and output formats.
Now Skills can be as direct as "Replace a text in PDF" which probably doesn't need any optimization at all. On the other hand it can be broad "Find security issues in the code" which we can optimize to no end.
Skills are structured prompts. DSPy is a framework for optimizing prompts. What if we could optimize Skills with DSPy like optimization?
So the main idea is, if Skills are just structured prompts, they should be programmatically optimizable. You could benchmark them, version them, auto-generate model-specific variants.
I tested this hypothesis with a code security review skill.
Approach To Test If Skills Can Be Optimized
DSPy is a prompt optimization framework from Stanford. Instead of manually tweaking prompts, you:
- Define task as a signature (inputs/outputs)
- Provide training examples
- Run optimizer
- Get improved prompt with auto-selected few-shot examples
The optimizer tests different instruction phrasings, example combinations, and reasoning strategies, then picks the best performing variant.
The Pipeline
To test if Skills can be optimized with DSPy:
1. Parse SKILL.md → DSPy Signature
2. Create training data (vulnerable code + expected issues)
3. Run DSPy optimization
4. Extract improvements → SKILL-optimized.mdWhat this proves: If DSPy can improve a Skill's output quality, then Skills aren't just static markdown files - they're improvable programs.
I tested with a code security review skill that finds SQL injection, weak crypto, etc.
Setup to Optimize code-review Skill
Input: code-review.md - 80-line skill that finds security vulnerabilities
Training data: 10 vulnerable code samples:
- SQL injection (Python, JavaScript)
- Weak crypto (MD5, hardcoded secrets)
- Command injection
- XSS, path traversal, IDOR, etc.
DSPy Signature:
class CodeReview(dspy.Signature):
"""Find security vulnerabilities and bugs in code."""
code = dspy.InputField()
language = dspy.InputField()
critical_issues = dspy.OutputField()
high_issues = dspy.OutputField()Optimization:
optimizer = dspy.BootstrapFewShot(metric=quality_metric)
optimized = optimizer.compile(baseline, trainset=examples)Models tested: GPT-4o (Azure), Qwen3 (Ollama)
Did We Succeed In Optimizing Skills?
Bottom line: Yes, Skills can be programmatically optimized.
I tested on two very different models to prove the concept works broadly:
- GPT-4o (frontier model, Azure)
- Qwen3 (local model, Ollama)
The results show something interesting about Skills themselves.
GPT-4o (Azure)
Baseline: 40.6% quality score
baseline = dspy.Predict(CodeReview)Optimized variants:
# Chain of Thought
cot = dspy.ChainOfThought(CodeReview)
# 39.2%
# Enhanced signature
enhanced = dspy.ChainOfThought(EnhancedCodeReview)
# 38.8%
# BootstrapFewShot
optimized = optimizer.compile(module, trainset=10_examples)
# ~40% ❌Result: 0% improvement. Some variants performed worse.
What this tells us: The baseline code-review.md didn't receive any improvement. This most likely means that our set of 10 examples were not comprehensive enough. With more examples we will most likely receive atleast some imrpovement. Also, since the field of security is always evolving this Skill will also need optimization from time to time. So revisiting Skills with more examples either collected from users or manually created will help further optimize it.
It's easier to prove the point that optimization indeed works with a less capable model - some locally deploy quantized model, lets say. We will see qwen3:8B being used in the next section.
Qwen3 (Ollama, local)
Tested with REAL DSPy BootstrapFewShot optimization on 3 code examples:
BASELINE: No optimization
Testing 1/3... Score: 61.6%
Testing 2/3... Score: 56.0%
Testing 3/3... Score: 10.0%
- Baseline Average: 42.5%
OPTIMIZED: BootstrapFewShot (automated)
• DSPy automatically selected best examples
• Generated optimized prompt with demonstrations
• No manual prompt engineering
Testing 1/3... Score: 77.0%
Testing 2/3... Score: 65.7%
Testing 3/3... Score: 22.5%
- Optimized Average: 55.1%Result: BootstrapFewShot showed improvement of 12.5 percentage points (42.5% → 55.1%)
What this tells us: DSPy's automated optimization works significantly better than manual prompt variations. The optimizer automatically selected which training examples work best and compiled an optimized module - exactly what DSPy is designed to do.
Can Skills Be More Than Mere Prompts?
What Did We Enable for Skills Here
1. Skills can be benchmarked objectively
Right now, Skills are shared as markdown files with no quality metrics. With DSPy optimization:
my-skill/
├── SKILL.md # Baseline
├── TRAINING.json # Test cases
├── METRICS.json # Quality scores
└── OPTIMIZED.md # Auto-generated variantsYou could compare Skills objectively: "code-review-v2.md scores 15% better on SQL injection detection"
2. Skills can be auto-adapted per model
Instead of one-size-fits-all:
code-review-gpt4.md- Minimal (frontier models work well baseline)code-review-qwen.md- Heavy few-shot (local models need examples)- Auto-generated based on model capabilities
3. Skills become versionable and improvable
Like code dependencies:
{
"skill": "code-review",
"version": "2.1.0",
"optimized_for": ["gpt-4o", "qwen3", "llama-3"],
"quality_score": 0.82
}Skills could evolve over time with community contributions to training data.
4. Skills can self-optimize as models improve
When GPT-6 releases, re-run optimization on existing Skills. No manual rewrites needed.
You Can Try It Yourself (5-Minute Setup)
All code is available at github.com/instavm/skill-optimization.
Quick Start (Local, No API Keys)
# Install Ollama: https://ollama.ai
ollama pull qwen3
# Install dependencies
pip install -r requirements.txt
# Run REAL DSPy BootstrapFewShot optimization
python scripts/optimize_qwen.pyYou'll see the before/after quality comparison in ~5 minutes.
With Your Own OpenAI/Azure Key
# Set environment variables
export AZURE_API_KEY="your-key"
export AZURE_API_BASE="your-endpoint"
export AZURE_DEPLOYMENT="your-deployment"
# Run optimization
python scripts/run_azure_optimization.pyWhat You'll See
Qwen output (before optimization):
1. SQL Injection Vulnerabilities: Functions concatenate user inputs
2. Insecure Password Hashing: Uses MD5Qwen output (after optimization):
SQL Injection - User input concatenated allows arbitrary SQL execution.
Attackers can inject malicious SQL to bypass authentication or extract data.
Fix: Use parameterized queries: cursor.execute("SELECT * FROM users WHERE id = ?", (user_id,))The difference is in the specificity and actionability.
How To Optimize Your Own Skill.md
Want to optimize your own Skill? Here's the actual workflow:
Step 1: Understand Your Skill's Structure
Look at your my-skill.md and identify:
- Inputs: What data does the skill receive? (e.g., code, user query, document)
- Outputs: What should it produce? (e.g., bug report, summary, recommendations)
- Task: What's the core transformation? (e.g., "find security bugs", "extract key points")
Example - for a code review skill:
Input: code snippet + language
Output: critical_issues, high_issues, medium_issues
Task: "Find security vulnerabilities in code"Step 2: Create Training Examples (Manual)
You need to manually create 5-10 examples with expected outputs. This is the hardest part.
For code review, I created data/training_data.json:
{
"file_path": "examples/sql_injection.py",
"expected_issues": [
{
"title": "SQL Injection",
"severity": "Critical",
"description": "User input concatenated into SQL query",
"locations": ["authenticate_user:10"],
"fix": "Use parameterized queries"
}
]
}You manually write these or use LLM but do verify them manually - DSPy doesn't auto-generate training data (yet).
Step 3: Define DSPy Signature
Convert your Skill to DSPy code:
import dspy
class MySkill(dspy.Signature):
"""Your skill's task description."""
input_field = dspy.InputField(desc="What goes in")
output_field = dspy.OutputField(desc="What comes out")
# Create baseline module
baseline = dspy.Predict(MySkill)Step 4: Create Evaluation Metric
Write a function that scores output quality (0-100%):
def quality_metric(example, prediction, trace=None):
score = 0
# Check if output has required elements
if "key_term" in prediction.output_field:
score += 25
if len(prediction.output_field) > 100:
score += 25
# ... more checks
return score / 100 # Return 0.0-1.0Step 5: Run BootstrapFewShot
from dspy.teleprompt import BootstrapFewShot
# Load your training examples
trainset = [dspy.Example(**ex).with_inputs('input_field')
for ex in training_data]
# Run optimization
optimizer = BootstrapFewShot(
metric=quality_metric,
max_bootstrapped_demos=3,
max_labeled_demos=3
)
optimized = optimizer.compile(baseline, trainset=trainset)This takes 5-10 minutes and automatically selects which examples improve performance.
Step 6: Extract Results Back to Skill.md
DSPy gives you an optimized module with:
- Better instruction phrasing
- Auto-selected few-shot examples
- Improved reasoning strategies
You manually create a new my-skill-optimized.md by:
- Looking at what examples DSPy selected
- Adding those examples to your Skill.md
- Updating instructions based on what worked
OR keep using the DSPy module directly in your code (no need for .md file).
What You DON'T Do
- DSPy doesn't parse your Skill.md automatically
- DSPy doesn't generate training examples for you
- DSPy doesn't output a new Skill.md file
What You Do
- You manually convert Skill → DSPy Signature
- You manually create training examples
- DSPy automatically finds which examples work best
- You manually update your Skill.md with the results (or use DSPy module directly)
The automation is in finding WHICH examples to use, not creating them.
Open Questions
Should Skills ship with training data? Like:
SKILL.md+TRAINING.json+ benchmarks. This would enable community optimization.Can we standardize Skill quality metrics? My metrics are custom. We need standard benchmarks for comparing Skills objectively.
Do optimizations transfer across model families? If I optimize for Qwen, does it work for Llama? Mistral? This would reduce optimization overhead.
Can we auto-generate training data? I hand-wrote 10 vulnerable code examples. Could GPT-4 generate these systematically?
What about Skill versioning? How do we track improvements?
code-review@2.1.0with changelog and quality deltas?
What This Experiment Tells Us And Where We Go Next
Skills are exploding. OpenAI yesterday, Anthropic three days ago. Everyone's building new Skills.
But here's what nobody's talking about: Skills can be programmatically optimized.
What This Proves
1. Skills are not just markdown files
They're structured prompts that can be:
- Benchmarked with quality metrics
- Optimized for different models
- Versioned and improved over time
- Auto-generated in model-specific variants
2. The Skills + DSPy connection unlocks infrastructure
Current state: Skills are manually crafted, shared as .md files, no quality metrics
Possible future:
my-skill/
├── SKILL.md # Baseline
├── TRAINING.json # Test cases
├── OPTIMIZED.md # Auto-generated variants
└── METRICS.json # Quality benchmarksThis enables:
- "This Skill scores 82% on security detection benchmarks"
- "Auto-optimized for GPT-4o, Qwen3, Llama 3.1"
- "Version 2.1 - +15% improvement over v2.0"
- Community contributions to training data
3. Skills become model-agnostic
Write the Skill once. Auto-optimize for:
- Frontier models (minimal prompting needed)
- Local models (heavy few-shot guidance)
- Future models (re-optimize when GPT-5 drops)
No manual prompt engineering per model.
Limitations & Next Steps
This is early exploration. I only tested:
- One skill type (code review)
- Two models (GPT-4o, Qwen3)
- Simple metrics (10 training examples)
Questions to explore:
- Does this work for other skill types? (data analysis, API design, etc.)
- Do optimizations transfer across model families?
- Can we auto-generate training data with GPT-4?
- What's the right standardized format for Skill training data?
I'm sharing this as: "These two ecosystems connect nicely - here's proof it works."
Not as: "This is the final solution."
Try it yourself: https://github.com/instavm/skill-optimization
Appendix: Technical Implementation
DSPy Signatures
class CodeReview(dspy.Signature):
"""Find security vulnerabilities and bugs in code."""
code = dspy.InputField(desc="Source code to analyze")
language = dspy.InputField(desc="Programming language")
critical_issues = dspy.OutputField(desc="Critical security vulnerabilities")
high_issues = dspy.OutputField(desc="High priority bugs")Evaluation Metric
def evaluate_quality(output, expected):
# Parse issues from output
issues = parse_issues(output)
# Calculate precision/recall
precision = matched / len(issues)
recall = matched / len(expected)
# Evaluate explanation quality
has_impact = check_for_impact_explanation(issues)
has_fix = check_for_fix_suggestion(issues)
# Weighted score
return 0.4 * f1 + 0.3 * has_impact + 0.3 * has_fixOptimization Loop
# Create optimizer
optimizer = dspy.BootstrapFewShot(
metric=quality_metric,
max_bootstrapped_demos=4,
max_labeled_demos=4
)
# Compile
optimized = optimizer.compile(
student=module,
trainset=training_examples
)
# The optimized module now has:
# - Better instructions
# - 4 selected few-shot examples
# - Improved reasoning strategyResults Data
{
"gpt4o": {
"baseline": 0.406,
"optimized": 0.388,
"improvement": -0.044
},
"qwen": {
"baseline": 0.425,
"optimized": 0.551,
"improvement": 0.125,
"relative_improvement": 0.294
}
}All data, code, and examples available in the repository.