Settings

Theme

Show HN: Actual Claude Tokenizer

tokenizer.robkopel.me

3 points by robkop 3 days ago · 1 comment · 1 min read

Reader

I've seen a few "Claude tokenizers" floating around lately with all the 4.7 chatter, but most of them just hit the count_tokens endpoint and hand you back a number. You don't actually see how your text gets split or understand the changes from 4.6 to 4.7.

I built this a while back for doing some mech interp research. It faithfully represents Claude token splitting - showing hidden tokens, real boundaries and so on. It is not cheap to run - essentially n^2 cost - you could optimise for longer sequences but you are not guaranteed a faithful representation if so.

Open Source: https://github.com/R0bk/claude-tokenizer

Feedback welcome, let me know if there are any edge cases that look wrong.

P.S. I'd expect this to face a similar fate as streaming chunk and prefill based token extraction methods did. I do worry about the ability to do independent research once it's fully closed off and would love it if there was more public frontier tokenizers.

anonymoushn 2 days ago

You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY.

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection