The Base45 Data Encoding

datatracker.ietf.org

93 points by b5 5 years ago · 49 comments

Reader

Confiks 5 years ago

Note that this encoding isn't of the same efficiency as QR binary mode, as it converts 3 bytes into 2 base45 characters. So it's more like 'base41 using the base45' charset.

I'm still a bit sad that with this standard and the packages available now, the namespace of 'base45' is clobbered with this suboptimal implementation. It can best just be renamed to 'base41'. It's a good tradeoff for the DCC, but not for the rest of possible implementations.

For the Dutch variant of the green pass using unlinkable signatures [1], we need all the space we can get, so we use a base45 encoding that uses the exact same method as base58 [2][3], and which has the exact same efficiency as QR binary mode.

[1] https://github.com/minvws/nl-covid19-coronacheck-app-coordin...

[2] https://gist.github.com/confiks/8fcb480d87a50cf1bb5e40e2f093...

[3] https://github.com/confiks/base45-go/tree/main/base45

mjevans 5 years ago

32 bits raw <==> 33 bits (3 pairs of 11 bits) QR alphanum.
In blocks of 4 bytes this encodes as 6 'base45' (QR alphanum) characters, and uses the same lookup table.
https://en.wikipedia.org/wiki/QR_code#Encoding
The "Alphanumeric character codes" table, at least at a visual glance, is identical to the RFC's lookup table.
- lifthrasiir 5 years ago
  
  The GP is saying that the equal efficiency is possible with using only 41 out of 45 characters, so reducing the symbol set would make base45 (now base41) more useful as a general encoding, not just an encoding for QR codes.
  - mjevans 5 years ago
    
    That's a fair point that wasn't made clearly enough. I have two questions though:
    1) Is it worth no longer sharing the same lookup table? My supposition is that this wouldn't practically matter in the modern bloated environment anyway, but it's still annoying and might be a source of additional bugs.
    2) Which 4 characters would be skipped for what reason?
    I'd initially propose not including: non-printing character space (b45[36]), possible variable wildcard dollar-sign (also a currency symbol that might get translated) (b45[37]), html escape and database wildcard percent (b45[38]), then typical wildcard asterisk (star) (b45[39]). Thus the sequence string would be.
    b41 = "0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ+-./:";
pulse7 5 years ago

One could take 10 digits, 26 letters and 5 operands (+ - * / %) for Base41 encoding. 41 * 41 * 41 = 68921 is enough for 2 bytes (65536 combinatons).
- lifthrasiir 5 years ago
  
  / and % don't play well with URLs. I would take + - * . $ instead.
  - jalk 5 years ago
    
    '+' doesn't play well with URLs
    
    lifthrasiir 5 years ago
    
    Less so than %, for example. It is classified as a sub-delim subset of reserved characters [1], meaning that they won't conflict with the generic URI components syntax and thus are reasonably safe. (* and $ are also included in this set.) You might be worrying that it gets "converted" into %2B but this conversion is not required and fully reversible.
    [1] https://datatracker.ietf.org/doc/html/rfc3986#section-2.2

radicalbyte 5 years ago

A nice tidbit: this RFC has its roots in the EU Covid Certificate project. The encoding was designed to cut the size of the QR payload (which for DCC is a CBOR - binary encoded - object) :)

The smaller the payload, the better and faster the scanning. Which is important for something that is designed to be used during border crossings and the like.

We have a number of implementations here:

https://github.com/ehn-dcc-development/

lawl 5 years ago

Interesting. The swiss system, which is supposedly compatible with the EU uses JSON + base64 [0].
Seems we also use RSA, and from a quick glance the EU seems to take any x509 certificate authority [1].
Does anyone know if there is a reason elliptic curves weren't been mandated, which should cause smaller signatures than RSA and thus a smaller payload?
[0] https://github.com/admin-ch/CovidCertificate-Apidoc/#respons...
[1] https://github.com/ehn-dcc-development/dgc-java/blob/main/cr...
- Nursie 5 years ago
  
  >> Does anyone know if there is a reason elliptic curves weren't been mandated
  Support still patchy ?
  I work in this area (certificates mostly) and find issues every so often with various platforms having gaps in their EC stuff. That said most things can now deal with the base subset and sticking to (say) P-256 would probably be pretty trouble-free. Getting europe-wide agreement on this would likely be hard though, and european standards[0] often have stuff in that's not widely supported, like FRP256v1, which doesn't seem to be in openssl yet, or wasn't last time I looked, let alone more obscure or outdated implementations.
  [0] - e.g. https://www.etsi.org/deliver/etsi_ts/119300_119399/119312/01...
- chugy1 5 years ago
  
  base64 is just used to encode the PDF and PNG to send them over JSON (JSON does not support binary content). The data in the QR code is encoded in base45 as in the EU.
thehappypm 5 years ago

Honestly feels like premature/unnecessary optimization.

EdSchouten 5 years ago

The idea of this encoding is to store two bytes of data in three characters. To me it's not obvious why you need a base as high as 45 for that.

Assuming you either want to store two bytes, or a trailing one, you have 256*256 + 256 combinations: 65792. Using three base45 characters, you can get up to 45^3=91125 combinations. It looks like base41 would have been sufficient. That way you can get rid of some of those special characters, making it easier to use through different transports.

dolmen 5 years ago

+1
This would allow to avoid space, %, / and + to be more URL friendly, and so allow more usages.
- kristov 5 years ago
  
  Having these chars is a pain (particularly space). Manipulating lists of these in files with standard Unix tools will mean having to escape, quote, etc.
  - mr_mitm 5 years ago
    
    I don't understand why they didn't use lower case letters instead of special characters. Does anyone here know?
    
    lifthrasiir 5 years ago
    
    You can't put lowercase letters in the alphanumeric mode (5.5 bits per character) and need to switch to less efficient byte mode (8 bits per character).

kstenerud 5 years ago

Unfortunately, the QR code "binary" mode specification defaults to ISO 8859-1 for the encoding (because it was not originally intended to store actual binary data), and there's also no way to indicate what format is actually encoded. So all decoders of course just assume ISO 8859-1 because they have no way of knowing otherwise.

However, we could in theory get around this by using binary data formats that always begin with an invalid text character (such as 0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.

I've actually put this into Concise Encoding [1]

[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...

nly 5 years ago

For context, it seems this encoding was covered by this discussion a few days ago:

What's Inside the EU Green Pass QR Code?

https://news.ycombinator.com/item?id=27589913

codeflo 5 years ago

So instead of extending QR codes, which are inherently binary, to efficiently handle binary payloads, we invent yet another ASCII-based tunneling scheme. Why ever fix any problem when we can just pile workaround upon workaround upon workaround?

masklinn 5 years ago

Qrcode already has a binary mode. The problem is that qrcode software treats qrcode data as text and fucks up. You’re not going to get every software out there fixed.
This does not extent qrcode at all, instead it defines a binary-to-text encoding designed to fit in qrcode’s existing alphanumeric mode, exactly like base32, base64 or base85 (or base36, base62, binhex, quopri) but fitting the specific constraints of the qrcode medium.
- dolmen 5 years ago
  
  > You’re not going to get every software out there fixed.
  You don't need every QR code software to be fixed anyway. Only the one which will use those QR codes which don't yet exist. These softwares will have to understand Base45 anyway.
  If you wrote that QRcode is not extensible (I don't know if that's the case) I would have agreed.
  - masklinn 5 years ago
    
    > You don't need every QR code software to be fixed anyway. Only the one which will use those QR codes which don't yet exist.
    No, the existing software misbehaves (and possibly crashes) when scanning pure binary data. If it did nothing problematic then the use of binary data would not matter, only the sofware which “will have to understand base45 anyway” would need to be binary-clean.
    And here qrcode software does not need to understand base45: the user can copy the base45 textual data to whichever other program cares for it.
    > If you wrote that QRcode is not extensible (I don't know if that's the case) I would have agreed.
    Why would I write that when it has nothing to do with the issue?
    And qrcode can be extended: the encoding mode is specified by a nibble, only 9 values are currently in use.
- rkangel 5 years ago
  
  Realistically, if you test with the camera apps from Apple, Google and Samsung I suspect you have 90% of the public's usage covered and everyone else will be forced to follow.
kstenerud 5 years ago

Unfortunately, the QR code "binary" mode specification defaults to ISO 8859-1 for the encoding (because it was not originally intended to store actual binary data), and there's also no way to indicate what format is actually encoded. So all decoders of course just assume ISO 8859-1 because they have no way of knowing otherwise.
However, we could in theory get around this by using binary data formats that always begin with an invalid text character (0x80-0x9f). This way, an implementation can know that the data is not ISO 8859-1, and try to decode whatever format it discovers through the beginning byte signature.
I've actually put this into Concise Encoding [1]
[1] https://github.com/kstenerud/concise-encoding/blob/master/cb...
atoav 5 years ago

Becaus sometimes you need your solution to be used be the public, voluntarily. If they have to install an extra app to read your "looks-like-QR-code-but-not-quite" thing, then you have at least an increased risk of not getting the adoption you want to get.
This is not a realistic choice for many projects.
- dolmen 5 years ago
  
  Do you mean that humans will have to decode Base45?
  If not, there is a software layer anyway. Adding another skin to the onion is not the best way to compress data.
  - cellularmitosis 5 years ago
    
    The issue is that there are two layers of software: the SDK, and the software which you write which uses the SDK.
    If the crash is happening in the SDK (on binary QR codes), then your only option is to ditch binary QR and implement a workaround.
tlamponi 5 years ago

Because QR code scanners are very widespread already, can be understood by even basic cameras nowadays and some devices are hard to update or would need yet another self-invented thingy, which would make people rather more suspicious if done for something like the use case this initially had – codes for verifying that a person is tested, vaccinated or recovered OK on borders and the like.
I at least find it good that existing solutions still get optimizations, even and are enhanced, vs. just throwing everything away at the slightest issue and redo everything, the that churn just costs a lot of $€£ while never bringing out something mature, i.e., useful for the masses.

mjevans 5 years ago

This could be useful for storing non-ascii armored crypto keys within a printed QR code format that can be stored in a fireproof safe and reasonably OCRed and decoded for use in disaster recovery or other applications.

chrismorgan 5 years ago

So, it’s using a 45-character alphabet which matches the QR code alphanumeric values table, which lets the QR code encoder switch to a more efficient mode that takes less space.

I just tried rendering a QR code of the 692 characters of the introductory paragraph (with lines joined appropriately), and compared it with a QR code of the same text, uppercased and with out-of-range characters `,`, `[` and `]` changed to %. This reduced an 89×89 code down to 77×77, a 24% reduction in area. If this is roughly the ratio, then Base45-encoding binary data by QR code will yield roughly 17% area savings compared with Base64. (Base45 gets 50% bloat, then 24% shrink = multiple of 1.14; Base64 gets 33% bloat = multiple of 1.33; Base45 / Base64 = 1.14 / 1.33 = 0.83.)

[Edit: edflsafoiewq’s figures on 40-L codes at https://news.ycombinator.com/item?id=27627915 come to about 23% savings, markedly more than my 17%.]

I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.

Of course, the specs for all this are ISO specs, so I can’t read them without coughing up the moolah.

In case it’s not clear, I am utterly inexpert in this domain.

Further thoughts:

Base45 encoding two octets in three characters is pretty wasteful: 45³ ÷ 256² ≈ 1.39, which is 39% waste. (By contrast, Base64 is 100% efficient with its alphabet: 64⁴ = 256³.) This means that if you were willing to do more complex encoding and decoding, you could shrink your QR code by roughly 39% more—to about 52% of the size of the Base64, rather than 83%. Leaving such a huge gap on the table puzzles me—I’d have thought that either you’d want something simple (where Base64 is well-understood) or want to minimise your QR codes, and Base45 sits in an awkward place in the middle.

For UTF-8, base-128 will be the most efficient you get. That’ll be ~14% inflation (7 bytes in 8 characters). Which… huh, that looks to be within ε of Base45’s 50% bloat and 24% shrinkage. Not sure if that’s a coincidence or not because I don’t know how alphanumeric mode versus byte mode works in QR codes. But this suggests that alphanumeric mode and Base45-but-not-wasteful would be markedly more efficient than byte mode and Base-122. Still leaves numeric and kanji modes open as possibilities. Again, I’m inexpert and don’t know how the encodings are actually done, and that’ll matter.

On edflsafoiewq’s 40-L figures: Base64 gets 2214 bytes, Base45 gets 2864 bytes, optimally-efficient base-45 would get log₂₅₆ 45⁴²⁹⁶ = 2949 bytes, only around 3% more. I think I must have made a mistake somewhere with some of my numbers.

anderskaseorg 5 years ago

The right way to measure the area ratio is using entropy. An optimal encoding would save at most 3% area over Base45:
Base64 in binary: log₂ 64 / 8 = 75.000% efficient
Base45 in alphanumeric: 4 log₂ 256 / 33 = 96.970% efficient
Optimal numeric: 3 log₂ 10 / 10 = 99.657% efficient
Optimal alphanumeric: 2 log₂ 45 / 11 = 99.851% efficient
Optimal binary (ISO 8859-1): log₂ 191 / 8 = 94.718% efficient
Optimal binary (UTF-8, single-byte subset): log₂ 128 / 8 = 87.500% efficient
Optimal binary (UTF-8, full): 1 = 128 / 2^(8α) + 1920 / 2^(16α) + 63488 / 2^(24α) + 1048576 / 2^(32α) ⇒ α = 89.706% efficient
Optimal kanji (JIS X 0208): log₂ 6879 / 13 = 98.061% efficient
The mistake in your 39% calculation is that you forgot to take logarithms before calculating the ratio.
- chrismorgan 5 years ago
  
  Ah hah, yes, the 39% was linear but needed to be log. Thanks for that, and all the other figures too.
lifthrasiir 5 years ago

> I can’t help but wonder if any of the other modes could be more efficient still—numeric mode, kanji mode and byte mode.
Byte mode is obviously most efficient. Alphanumeric mode uses 5.5 bits per each character, so combined with base45 it uses 5.5 * 3 = 16.5 bits to pack two octets. Base45 is actually not that bad as it seems (3.1% overhead). [EDIT: I've since seen edflsafoiewq's mention that typical QR readers try to interpret binary data as UTF-8, so it is not pointless.]
[MORE EDIT: I've completely missed the last paragraph, so I was actually just confirming what chrismorgan said.]
- chrismorgan 5 years ago
  
  > Maybe a compatibility issue?
  From the introduction:
  > Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.
  Back to your comment:
  > Base45 is actually not that bad as it seems (3.1% overhead)
  That’s very close to my logarithm calculations from edflsafoiewq’s 40-L figures (the log₂₅₆ 45⁴²⁹⁶ bit of my comment, just added). Would you mind explaining to me what I did wrong with my 45³ ÷ 256² calculation that said 39%?
  - lifthrasiir 5 years ago
    
    Oops, I've completely missed the last paragraph and thought that you were giving a wildly larger figure with a wrong assumption. Your actual assumption for optimally-efficient base-45 is correct and close enough to the actual QR code spec.
    
    chrismorgan 5 years ago
    
    I’ve been adding bits to my comment as I went. It wasn’t there when you wrote your comment. I’m stopping now and going to go eat, confused about the 39% and 3% and what’s going on. I haven’t worked with QR codes enough!

eventreduce1 5 years ago

What are the benefits to base58?

Base45 uses chars like backslash. This is super annoying when the encoded string is used in an url.

edflsafoiewq 5 years ago

It's not for avoiding looks-similar-to-a-human chars like Base58. It's for QR codes. The alphabet is exactly the alphabet for Alphanumeric mode QR codes.
A 40-L code allows 4296 characters in Alphanumeric-mode = 2864 bytes after Base45 decoding.
The same code allows 2953 characters in Byte mode. If you use Byte mode to hold Base64, that's 2214 bytes after decoding, so Base45 is more efficient.
The reason it gives for why you can't use Byte mode to directly hold binary data is
> Even in Byte mode a typical QR-code reader tries to interpret a byte sequence as an UTF-8 or ISO/IEC 8859-1 encoded text. Thus QR-codes cannot be used to encode arbitrary binary data directly.
It also says you're not supposed to use it anywhere but QR codes (like URLs)
> If the data is to be sent via some other transport [not stored in a QR-code], a transport encoding suitable for that transport should be used instead of Base45. It is not recommended to first encode data in Base45 and then encode the resulting string in for example Base64 if the data is to be sent via email. Instead the Base45 encoding should be removed, and the data itself should be encoded in Base64.
DocTomoe 5 years ago

The draft specifically tells you not to do that:
"If the data is to be sent via some other transport, a transport encoding suitable for that transport should be used instead of Base45."
Base45 is mainly useful for binary information in QR codes.
- travisgriggs 5 years ago
  
  I wish they would call it something like QR45 then, not Base45.
  - masklinn 5 years ago
    
    It’s a B2T encoding, not a qrcode mode (the mode is 2).
zaxomi 5 years ago

slash = /
backslash = \
They are using slash.
And space, which is confusing. Did not see the space in the "Hello!!" example since the space is the last character on the line.
- sixothree 5 years ago
  
  Trailing spaces too?

simojk 5 years ago

How does this compare with Data Matrix and C40 encoding?

upofadown 5 years ago

Why would this be a RFC? It seems quite specific to QR codes. Nothing specific to the Internet.

Settings

The Base45 Data Encoding

Keyboard Shortcuts