Show HN: Actual Claude Tokenizer

tokenizer.robkopel.me

3 points by robkop 2 months ago · 4 comments · 1 min read

Reader

I've seen a few "Claude tokenizers" floating around lately with all the 4.7 chatter, but most of them just hit the count_tokens endpoint and hand you back a number. You don't actually see how your text gets split or understand the changes from 4.6 to 4.7.

I built this a while back for doing some mech interp research. It faithfully represents Claude token splitting - showing hidden tokens, real boundaries and so on. It is not cheap to run - essentially n^2 cost - you could optimise for longer sequences but you are not guaranteed a faithful representation if so.

Open Source: https://github.com/R0bk/claude-tokenizer

Feedback welcome, let me know if there are any edge cases that look wrong.

P.S. I'd expect this to face a similar fate as streaming chunk and prefill based token extraction methods did. I do worry about the ability to do independent research once it's fully closed off and would love it if there was more public frontier tokenizers.

anonymoushn 2 months ago

You can't reliably obtain correct token boundaries with this method. For example, "'d" is 1 token, but the API will return "d" stuck to the next token. Weirdly this seems to be specific to the letter "d". Similar stuff happens around "<". About all caps words, some words are in the vocab in all caps, such as MERCHANTABILITY.

robkopOP a month ago

Could you please elaborate a bit more for my understanding?
What in particular about this method breaks correct token boundaries?
On my first read I read your comment as there are special tokens that require multiple tokens to emit, hence you can't get certain tokens emitted alone - but I don't think that's what you're getting at on a second read?
Interesting that you've found similarities between "d" and the hidden tokens for opening an xml tag, pressing caps lock and the other hidden tokens of note. I haven't run into any trouble extracting "d" tokens, is it a particular model that you see create that pattern?
- anonymoushn a month ago
  
  That's " 'd ".strip(), an english contraction suffix. it's 1 token, but using this echo approach you will be served the apostrophe and the subsequent letter for the first time in different steps.
  - anonymoushn a month ago
    
    I couldn't reproduce this behavior with Sonnet 4, and Sonnet 3.7 has been deprecated since I messed with this stuff. You can try tokenizing the string "<hello> </hello>"
    I think the correct tokenization of the string will not have any tokens that contain mixed punctuation and letters, but the result of this approach does contain such claimed tokens.

Settings

Show HN: Actual Claude Tokenizer

Keyboard Shortcuts