Background
Not too long ago, I was tangentially involved in a discussion around scanning for leaked secret tokens.
While tools exist, it's also generally worth designing tokens to be more easily identifiable. (e.g.)
For the specific discussion I was in, the problem was that the token might end up base64 encoded, which is a very common thing to do with secrets.
I needed to explain how we can't "just" scan for base64 encoded secrets, as the the preceding bytes will change how the secret gets encoded; we need to decode anything that looks like base64 to scan it for tokens.
Or, potentially, we might be able to scan for the different encoded variants of the tokens, assuming it's a small set of strings and we can ignore the first and last characters (as base64 can mix those with adjacent characters). Though then we're scanning for lots of extra strings, the approach doesn't generalize to arbitrary secrets, we need different searches for different base64 variations, and so on.
Idea
But what if we could design a token to be easily identified in base64?
What if we could have the token look the same in base64?
Well, that's impossible. But we can try making part of the token the same.
It's fairly trivial to find a short string that copies itself when base64 encoded:
$ printf "Vm0w" | base64
Vm0wdw==
(If you're bored, you can extend this as long as you want: just add the next encoded character to the original string and iterate)
If this were all there was to it, then prefixing tokens with "Vm0w" would be sufficient for our purposes (assuming four characters is enough).
But we can't guarantee that our token will be at the start of what gets encoded, and changing the prefix changes the result:
$ printf "_Vm0w" | base64
X1ZtMHc=
$ printf "__Vm0w" | base64
X19WbTB3
...Until we loop back around to a multiple of three, at least:
$ printf "___Vm0w" | base64
X19fVm0wdw==
^--^
Is there a way to guarantee we always have at least one copy of the string that lands in the right place?
Sure: just make three copies of it.
$ printf "Vm0wVm0wVm0w" | base64
Vm0wd1ZtMHdWbTB3
^--^
$ printf "_Vm0wVm0wVm0w" | base64
X1ZtMHdWbTB3Vm0wdw==
^--^
$ printf "__Vm0wVm0wVm0w" | base64
X19WbTB3Vm0wd1ZtMHc=
^--^
An additional trick to expand the search needle from four characters to five is to include the trailing "d" that always gets generated from a properly-aligned "Vm0w" in the original string:
v---v
$ printf "Vm0wVm0wVm0wd" | base64
Vm0wd1ZtMHdWbTB3ZA==
^---^
Note that we're now no longer looking for a "copy" of anything, but rather a substring that has been prefixed in a way that it will reliably base64-encode to itself.
If we ignore the prefix and search on the suffix, it's fairly trivial to brute-force different candidate strings and hopefully find one with a little more character:
#!/usr/bin/env python import re from base64 import b64encode from itertools import product # Note: sticking to the shared 62 character subset lets us work across different base64 variations albet = [c.encode() for c in '0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz'] def check(s: bytes) -> None: """Check if base64 encoding preserves suffix of s. If so, print out a searchable string and tag generated from s. """ c = s[1:] x = s * 3 x64 = b64encode(x) if c not in x64: return y64 = b64encode(b' ' + x) z64 = b64encode(b' ' + x) if c in y64 and c in z64: o = x64.find(c) needle = x64[o:o+len(s)] # Include trailing character tag = x[:2*len(s) + 1] + needle print(needle.decode(), tag.decode(), '*' if tag[0] == tag[-1] else '') for r in (4, 5): for inp in product(albet, repeat=r): check(b''.join(inp))
Note that a trailing "*" indicates that the tag begins and ends with same character, so that the final character becomes optional.
Results
There's about six hundred 13- and 16-character strings that will retain a common substring when base64 encoded. Many of them are minor variations on "Vm0w", but there's plenty of other options to choose from.
Aesthetically, I feel like there's one result worth calling out:
w3M00w3M00w3M00w
The string "w3M00w3M00w3M00w" (final "w" optional) is visually distinct, easy to remember (as it almost describes itself), and always produces "3M00w" when base64 encoded.
Using this as part of a secret will produce something that can be easily searched for, even if it ends up in an HTTP header:
v---v
dmVyeXNlY3JldHRva2Vu_w3M00w3M00w3M00w_c29tZXRva2Vuc2VjcmV0
Authorization: Basic Ym9iOmRtVnllWE5sWTNKbGRIUnZhMlZ1X3czTTAwdzNNMDB3M00wMHdfYzI5dFpYUnZhMlZ1YzJWamNtVjA=
^---^
Is this actually useful? 🤷