Small Model Forensics

4 min read Original article ↗
Small Model Forensics

Claude and I made 2,000 API calls to nine small closed-weight models across three providers in a range of prompt sizes between 100 and 1M tokens. We ended up discovering some interesting things about how providers scale inference, or fail to, in 2026.

You can view all the measurements in the interactive viewer. Code and raw dataset on GitHub.

01Every model's prefill is nowhere near O(n²)

Plot the minimum first-token latency (excluding network) against input size, log-log. Textbook dense attention models with O(n²) prefill would give you a line with slope near 1 at large n; every doubling of context should cost roughly double the prefill wall time. What we actually see is a much flatter curve. Even with >100K contexts where opaque provider overheads become negligible, prefill still scales sub-linearly.

fig. 1 TTFT floor vs context — seven models, four orders of magnitude log-log · min over repeats

Each line is the per-cell minimum of first_content_delta_ms across prompt shapes. gemini-3.1-flash-lite stays under 5 s at ~870 k input tokens; gpt-4.1-nano exceeds 23 s at half that context. None of the seven curves bends like a quadratic.

Fit a power law C · n^α to the floor of each curve and read off the exponent.

tab. 1Fitted scaling exponent of min TTFT, by model.
model α (all data) α (≥ 10 k tok) character
gemini-3.1-flash-lite 0.15 0.29 remarkably flat
claude-haiku-4-5 0.19 0.58 smooth, sub-linear
gpt-5.4-mini 0.28 0.69 smooth, sub-linear
gemini-2.5-flash 0.34 0.70 smooth, sub-linear
gemini-3-flash-preview 0.31 0.73 sub-linear, late break
gpt-4o-mini 0.34 0.84 step-laden
gpt-4.1-nano 0.40 1.02 linear-or-worse at top end

Gemini 3.1 Flash Lite walks from 204 input tokens to 866 k input tokens — a factor of 4,200 in context — for only 0.7s → 5s in wall time. Seven times more latency for four thousand times more context. GPT-4.1 Nano exceeds 23s at half that context.

02Providers are doing decode very differently

The textbook says decode-per-token should be near flat or rises a touch as the prefix grows. We saw models with decode costs that rise significantly or even even fall.

fig. 2decode ms per output token vs contextlinear y · mean per cell · all prompt shapes
tab. 2Decode ms/token at representative prompt sizes
model tiny 128 KB 256 KB 601 KB 1 MB
gemini-3.1-flash-lite 4.6 4.7 4.8 3.3 3.3
gemini-3-flash-preview 7.1 8.5 8.0 12.5 11.7
claude-haiku-4-5 11.8 11.1 11.0 12.5
gpt-5.4-mini 7.1 9.1 8.9 31.8 108.4
gpt-4.1-nano 14.1 12.0 18.3 17.7 67.5
gpt-4o-mini 18.5 31.9 34.6

03Fastest model at 1 KB context is the slowest at 1 MB

People quote first-token latency numbers as if a model has a latency. It has a curve, and the curves cross. Choose wisely.

fig. 3TTFT floor vs prompt sizelog-log · min per byte bucket

gpt-4.1-nano wins on tiny/sub-second queries, gemini-3.1-flash-lite wins on large queries >600KB (~150K tokens).


tab. 3TTFT floor rankings at representative byte sizes (lower is faster).
model tiny (<1 KB) 64 KB 256 KB 601 KB 1 MB
gpt-4.1-nano #1 (176 ms) #2 (359) #2 (882) #3 (1,730) #5 (4,876)
gpt-5.4-mini #2 (233) #1 (349) #1 (779) #2 (1,376) #4 (3,353)
gemini-2.5-flash #5 (289) #3 (590) #6 (2,290) #5 (2,482) #6 (4,650)
claude-haiku-4-5 #7 (391) #4 (608) #4 (1,241) #4 (2,229) rejected
gemini-3.1-flash-lite #8 (461) #5 (733) #3 (1,181) #1 (912) #1 (1,732)
gemini-3-flash-preview #9 (492) #6 (911) #5 (1,268) #6 (2,127) #3 (2,861)

04Gemini Flash Lite goes the wrong way

A query with 144 k input tokens is faster than one with 62 k by a good margin. 2.3x more input tokens will give you a 1.5x faster response. Reproducibly!

fig. 4Gemini Lite negative-scaling zone, TTFT and total latencysong-lyrics-prompt · min per token bucket

Breaking it down by stage, both inferred prefill and decode times drop around the same threshold. The simplest story is a routing transition to different hardware somewhere near the 100 k-token band.

fig. 5Gemini Lite decode cost vs input tokenssong-lyrics-prompt · median · p10–p90 bars
fig. 6prefill throughput — KB / sec at the latency floorlog-log · prompt bytes / min TTFT

It's almost like Google rewards and OpenAI punishes large context sizes.

05Tokens are not created equal

Ending on a practical note. Each provider uses the same tokenizer across all models tested. Across providers tokens are not apples-to-apples: going from OpenAI to Anthropic will cost you an additional 14% in tokens that most don't account for in their math.

fig. 7bytes per token by provider family and content typeavg at > 100 KB prompt

As expected, 4 chars per token is a good estimate for English text. With our random prompt (hex-encoded random bytes), the ratio was closer to 1, so keep that in mind if you're sending high-entropy content.