Settings

Theme

Benchmarking open-weight models for security research

dualuse.dev

1 points by lebovic a month ago · 1 comment

Reader

lebovicOP a month ago

GLM 5.1 is surprisingly capable. Anecdotally, I couldn't notice a difference until ~120K tokens.

Qwen 3.6 35B A3B also exceeded my expectations. It's surprisingly performant, even though the previous generation wasn't even able to use the testing harness.

(Tbd on Kimi K2.6; the eval is still running.)

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection