Bosun-XS (0.6B)
Launch post: Introducing Bosun →
The judge that keeps an agent's memory — its knowledge graph — clean. As an agent accumulates memory as a graph of facts linked by relationships, Bosun-XS decides, edge by edge, which connections are warranted — supported, non-redundant, still-true — so the graph stays useful instead of growing into noise that drowns the model reading it back. Nothing else scores that "judge" step; Bosun-XS is a small, fast, calibrated model built for it, and you program it with a sentence.
Given two findings and an instruction it emits P = sigmoid(logit_yes - logit_no) ∈ [0,1] — how strongly
the pair satisfies the rule you supplied, with no opinion of its own. "Warranted" isn't one fixed rule
(same-entity, cross-domain bridge, not-a-duplicate, still-supported-by-evidence), so you define it per graph;
Bosun-XS follows the rule, respects negation, and generalizes to rules it never trained on. That same
capability is exactly what RAG filtering, content moderation, and deduplication need too — knowledge-graph
curation is simply where the need bites first and hardest.
LoRA fine-tune of Qwen/Qwen3-Reranker-0.6B, scored on the native reranker yes/no logits.
Changelog
v1.1 — broader general judgment (current)
Same architecture and inference contract as v1.0; retrained on an expanded blend (DialAM-2024 argument edges, NLI, PAWS, e-CARE/COPA causal, dedup hard-negatives, completeness, and synthetic directional data, on top of v1.0). Still one model, programmed by a sentence — no per-task fine-tuning.
New: directional & typed-edge judgment — supersession ("B replaces A"), depends-on, supports / contradicts (ordered-pair relations, not just symmetric similarity).
Generality on held-out public benchmarks (one instruction each), vs a frontier LLM on the same items:
| benchmark | Bosun-XS v1.1 | gemini-3.1-flash-lite | similarity baseline |
|---|---|---|---|
| PAWS (adversarial paraphrase) | 0.90 | 0.81 | ~chance (0.53 AUROC) |
| e-CARE (causal direction) | 0.72 | 0.86 | 0.60 |
| ANLI (adversarial NLI) | 0.44 | 0.74 | 0.33 |
At 0.6B, Bosun-XS beats gemini-3.1-flash-lite on PAWS — a frontier LLM out-paraphrased by a small local reranker. (For the strongest e-CARE / ANLI numbers, see Bosun-4B.)
No regression: FollowIR held (improved on this size); WarrantBench steerability 0.935 (unchanged).
v1.0 — launch
Symmetric programmable judge. WarrantBench steerability 0.935; FollowIR +10.5 p-MRR.
Inference contract
Native Qwen3-Reranker template; read the last-token logits:
<Instruct>: <your rule, e.g. "Connected only if the two findings share a specific named entity.">
<Query>: These two findings share the specified relationship.
<Document>: FINDING A:\n<text_a>\n\nFINDING B:\n<text_b>
score = sigmoid(logits[yes_id] - logits[no_id]) at the final position (logits_to_keep=1). The exact
yes_id / no_id / template prefix+suffix and max_len are in serving.json.
import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
repo = "Hanno-Labs/bosun-xs"
cfg = ... # serving.json from this repo
tok = AutoTokenizer.from_pretrained(repo, subfolder="tokenizer", padding_side="left")
base = AutoModelForCausalLM.from_pretrained(cfg["base_model"], torch_dtype=torch.bfloat16,
attn_implementation="sdpa", trust_remote_code=True)
model = PeftModel.from_pretrained(base, repo).merge_and_unload().eval().cuda()
# build ids = prefix + <Instruct/Query/Document> + suffix, then:
# lg = model(input_ids, attention_mask, logits_to_keep=1).logits[:, -1, :]
# p = torch.sigmoid(lg[:, cfg["yes_id"]] - lg[:, cfg["no_id"]])
Run locally (GGUF / llama.cpp)
CPU / Apple-Silicon / edge builds (f16, Q8_0, Q4_K_M) live at Hanno-Labs/bosun-xs-GGUF.
⚠️ Do not use llama.cpp's --rerank mode — it silently discards the <Instruct> and returns
degenerate, instruction-blind scores. Use the completion + logits path documented in that repo
(verified end-to-end via stock llama-server, matching this model within ~0.01 at Q8_0).
Results
WarrantBench (Hanno-Labs/warrantbench) — it out-steers a frontier LLM:
| cosine | Bosun-XS | gemini-3.1-flash-lite | |
|---|---|---|---|
| steerability — score flips with the rule | 0.00 | 0.94 | 0.58 |
| negation — "NOT the same topic" | 0.00 | 0.97 | 0.996 |
| cross-domain bridge | 0.32 | 0.83 | 0.38 |
On novel rules it never trained on: 0.95 ("both mention a figure ≥ $1B") and 0.95 ("both involve a government or regulator"), vs 0.35 / 0.63 for flash-lite.
FollowIR (public instruction-following retrieval, p-MRR): Bosun-XS tops the board where most retrievers score zero or negative — they read the instruction as keywords; Bosun reads it as a rule.
Files
| file | what |
|---|---|
adapter_model.safetensors, adapter_config.json |
the LoRA adapter (load with PEFT over the base) |
serving.json |
inference contract: template + yes_id/no_id + max_len |
tokenizer/ |
Qwen tokenizer (left-padding) |
Links
- Launch post — Introducing Bosun
- GGUF (run locally) — Hanno-Labs/bosun-xs-GGUF
- WarrantBench — github.com/Hanno-Labs/warrantbench (dataset)
From Hanno Labs.
