Show HN: Actual Claude Tokenizer
tokenizer.robkopel.meI've seen a few "Claude tokenizers" floating around lately with all the 4.7 chatter, but most of them just hit the count_tokens endpoint and hand you back a number. You don't actually see how your text gets split or understand the changes from 4.6 to 4.7.
I built this a while back for doing some mech interp research. It faithfully represents Claude token splitting - showing hidden tokens, real boundaries and so on. It is not cheap to run - essentially n^2 cost - you could optimise for longer sequences but you are not guaranteed a faithful representation if so.
Open Source: https://github.com/R0bk/claude-tokenizer
Feedback welcome, let me know if there are any edge cases that look wrong.
P.S. I'd expect this to face a similar fate as streaming chunk and prefill based token extraction methods did. I do worry about the ability to do independent research once it's fully closed off and would love it if there was more public frontier tokenizers. You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY. Could you please elaborate a bit more for my understanding? What in particular about this method breaks correct token boundaries? On my first read I read your comment as there are special tokens that require multiple tokens to emit, hence you can't get certain tokens emitted alone - but I don't think that's what you're getting at on a second read? Interesting that you've found similarities between "d" and the hidden tokens for opening an xml tag, pressing caps lock and the other hidden tokens of note. I haven't run into any trouble extracting "d" tokens, is it a particular model that you see create that pattern? That's " 'd ".strip(), an english contraction suffix. it's 1 token, but using this echo approach you will be served the apostrophe and the subsequent letter for the first time in different steps. I couldn't reproduce this behavior with Sonnet 4, and Sonnet 3.7 has been deprecated since I messed with this stuff. You can try tokenizing the string "<hello> </hello>" I think the correct tokenization of the string will not have any tokens that contain mixed punctuation and letters, but the result of this approach does contain such claimed tokens.