Part 1 discussed the last two or four unused bits at the end of the Base64 string. Those bits should be zeroed but generally nothing verifies that, so they can also be used for different purposes. In this final part 2, we will focus on the more obvious non-canonical forms of Base64.
Posts in the series:
- Base64 Beyond Encoding – Steganography and Canonical Form (part 1)
- Base64 Beyond Encoding – Steganography and Canonical Form (part 2)
Base64 alphabet
Formally, Base64 uses characters from the ranges A-Z
, a-z
, 0-9
, as well as +
and /
. Both special characters can differ in other Base64 variants. For example, both the +
character as well as /
have a special meaning in URLs: +
in the query part replaces space (it is the shorter form of %20
), and /
is of course the path separator. This means that Base64 in its standard version is not very usable in web addresses. Here the “web version”, that is base64url, comes to the rescue and replaces +
with -
and /
with _
.
Regardless of the variant, all other characters are not part of the Base64 alphabet. Even if we stick with the basic 7-bit ASCII, it gives us 31 unused printable characters and 32 unused control characters. What if such a character appeared in the Base64 stream?
It all depends a bit on the actual specification. The basic specification (RFC 4648) says that implementations should reject strings that entail any characters outside of the Base64 alphabet… unless the referring specification says otherwise:
> Implementations MUST reject the encoded data if it contains
> characters outside the base alphabet when interpreting base-encoded
> data, unless the specification referring to this document explicitly
> states otherwise.
This puts the general purpose implementation in a bit of a weird spot: on the one side, the specification says that extra characters should be rejected but, on the other, users will probably want to use decoders in various situations. Looking for a compromise of sorts, implementations by default often ignore extra characters:
// PHP
$s = ':S!G@V#4$Q%X^J&j*Y(W)5[h]';
base64_decode($s); // "HexArcana"
base64_decode($s, /*strict=*/true); // false
# Python
s = ':S!G@V#4$Q%X^J&j*Y(W)5[h]'
base64.b64decode(s); # b"HexArcana"
base64.b64decode(s, validate=True); # binascii.Error: Only base64 data is allowed
# Node.js
const s = ':S!G@V#4$Q%X^J&j*Y(W)5[h]';
Buffer.from(s, "base64"); // <Buffer "HexArcana">
What follows is that nothing stands in the way of weaving a second data stream, encoded using the remaining characters, into the Base64 string. Actually, all that is needed are two characters and binary encoding.
Or just one character and an encoding using the distance between occurrences (measured in characters). For example, in the specification for encoding email attachments (MIME, RFC 2045), the following can be found:
> The encoded output stream must be represented in lines of no more
> than 76 characters each.
As such, we can encode additional data using the length of specific lines – after all "no more than 76 characters in line" does not enforce the lines to have exactly 76 characters (although it needs to be pointed out that the same specification says that all characters outside the alphabet should be ignored).
We are inviting all those who are interested to test the above methods and uncover data “hidden” in the string below (hint: hex):
TG9y$ZW0gaXBz$dW0gZG$9sb3I$gc2l0IG$FtZXQsIG$Nvbn$N$lY3RldH$VyIGFkaXBpc$2Npb$mcgZWxpd$C4gUGV$sbGVudGVz$cXVlIH$NvZG$FsZXMg$YSBua$XNs$IGVnZX$QgY$WNjd$W1zY$W$4uIE1v$cmJpIGdyYXZpZG$EsIGVs$aXQg$YWMgZ$3Jh$dmlkYS$Bjb25$2YWxsa$XMsIG$1hZ25h$IGFudGUgdWx$0cm$ljZXMg$YXJ$jdSw$gdGVtcG$9yIHRyaXN0aXF$1ZSBlbmltIHB1cnVzIGVnZXQgdXJuYS4gU2VkIHRyaXN0aXF1ZSB0aW5jaWR1bnQgYXVndWUsIHZlbCBpbXBlcmRpZXQgZXJhdC4gUHJhZXNlbnQgY29uc2VjdGV0dXIgdWx0cmljaWVzIGVzdCBhdCBtb2xsaXMuIEFlbmVhbiBldCBwbGFjZXJhdCBudWxsYS4gU3VzcGVuZGlzc2UgdGluY2lkdW50IHRlbXBvciBxdWFtLCBzZWQgdGVtcHVzIHF1YW0gc2NlbGVyaXNxdWUgYS4gRG9uZWMgdmVzdGlidWx1bSwgZWxpdCBhdCBydXRydW0gc29kYWxlcywgdHVycGlzIGxlY3R1cyBoZW5kcmVyaXQgbmVxdWUsIHF1aXMgZWdlc3RhcyBsb3JlbSBxdWFtIG9ybmFyZSB2ZWxpdC4gTW9yYmkgbG9ib3J0aXMgYWNjdW1zYW4gcGVsbGVudGVzcXVlLg==
And one more random tidbit – Python implementation, even in the validate=True
mode, ignores extra padding characters (=
) at the end of the string:
# Python
s = 'SGV4QXJjYW5h==========='
base64.b64decode(s); # b"HexArcana"
base64.b64decode(s, validate=True); # b"HexArcana"
Base64 as a key and the canonical form
Finally, I would like to point out one more essential thing. Using Base64 as a key (as in lookup, e.g. for a key in a dictionary/map, element in a set, etc.) is a bit dangerous, as in practice – outside of the very restrictive canonical form – numerous different Base64-somewhat-compliant strings can be decoded to the same output. That is, the same characteristics that enable hiding additional data in a Base64 string (unused bit, ignored characters outside the alphabet, ignored padding characters), also cause non-restrictive Base64 to NOT BE a 1-to-1 mapping.
For example, each of the strings below is decoded to the same output stream:
# Python
b64decode("SGV4QXJjYW5hIGlzIGNvb2w=") # b'HexArcana is cool'
b64decode("SGV4QXJjYW5hIGlzIGNvb2x=") # b'HexArcana is cool'
b64decode("SGV4QXJjYW5hIGlzIGNvb2w====") # b'HexArcana is cool'
b64decode("@SGV$4QXJjYW5h%IGl*zIGNv(b2:w=") # b'HexArcana is cool'
For example, banning public keys in their encoded Base64 form might not work as expected – and that's without even delving into ambiguities stemming from binary formats used to encode public keys.
Summary
Some final random tidbits:
- Base64 strings must be decoded from the first character (in the 4-character "chunk") to obtain a sensible output. Nonetheless, if the first characters are skipped or we do not know which character is first, it is enough to try and decode the string 4 times starting from the first available character, then the second, then the third, and finally the fourth. That gives us 100% certainty that in one of the four approaches, we’ll “synchronize” with the actual Base64 stream.
- All Base64 string corruptions are local and do not influence the decoding of remaining and uncorrupted data. More specifically, each corrupted Base64 character will influence at most two decoded bytes.
- Password hashes written in
/etc/shadow
also use Base64 but with a different alphabet – specifically, the order of characters in the alphabet is completely different. In classic Base64, we useA-Za-z0-9+/
. Whereas, the encoding (called B64) used under *nix for hashes uses the./0-9A-Za-z
alphabet.
In the end, Base64 is a quite simple encoding but it still has its curiosities and quirks.