Show HN: Actual Claude Tokenizer
tokenizer.robkopel.meI've seen a few "Claude tokenizers" floating around lately with all the 4.7 chatter, but most of them just hit the count_tokens endpoint and hand you back a number. You don't actually see how your text gets split or understand the changes from 4.6 to 4.7.
I built this a while back for doing some mech interp research. It faithfully represents Claude token splitting - showing hidden tokens, real boundaries and so on. It is not cheap to run - essentially n^2 cost - you could optimise for longer sequences but you are not guaranteed a faithful representation if so.
Open Source: https://github.com/R0bk/claude-tokenizer
Feedback welcome, let me know if there are any edge cases that look wrong.
P.S. I'd expect this to face a similar fate as streaming chunk and prefill based token extraction methods did. I do worry about the ability to do independent research once it's fully closed off and would love it if there was more public frontier tokenizers. You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY.