A mistake very often made by beginners (and which IS NOT the topic of this post) is calling Base64 encoding “encryption”. Base64 is of course no encryption as dr Iwona Polak brilliantly explains in her newest LI entry (PL only). That being said, Base64 and some of its implementations have some curious quirks, and they will be the topics of this series of posts.
Posts in the series:
- Base64 Beyond Encoding – Steganography and Canonical Form (part 1)
- Base64 Beyond Encoding – Steganography and Canonical Form (part 2)
In a nutshell, Base64 is an encoding which transforms 3 raw bytes to 4 text characters, which is very useful if we need to print or share binary data through a text medium. A good modern example is the data:
scheme, which enables, among other things, the use of Base64 to place binary image files directly in the HTML text code:
<img
src="data:image/png;base64,
iVBORw0KGgoAAAANSUhEUgAAAAUAAAAEAQMAAAB8/WcDAAAABlBMVEUAAAD
///+l2Z/dAAAAEElEQVQI12NYwfCDoYChAwALUAKZCQz4kwAAAABJRU5Erk
Jggg=="
style="width: 120px; height: 100px; image-rendering: pixelated;">
Effect of the code above:
Base64, in its various variants, can be found practically everywhere – from web technologies through cryptography (storing binary keys, JWT) to network protocols (such as SMTP or IMAP – to encode attachments). What follows is that we are handling many different implementations that are not always fully compatible with each other and don’t always follow specification (RFC 4648) to a T (although it needs to be added that it sometimes results from the referring specification which enforces discrepancies).
Steganography in unused bits
Let’s start from, in my opinion, the most interesting Base64 quirk feature – that is unused bits in the last coding character.
Where do we even take these “unused bits” from? As I mentioned before, Base64 encodes (maps) 3 bytes to 4 characters. To be more precise – entering into “bit maths”, 24 input bits, originally packed into 3 bytes – 8 bits each (what is currently the standard but wasn’t always), are outputted using 4 characters, 6 bits each (3 * 8 = 4 * 6).
What if there is less data? That is, 1 byte or 2 bytes? Let’s calculate:
1 byte == 8 bits » that gives us 1 character (6 bits)
and 2 more bits in the second character
... and 4 unused bits?
2 bytes == 16 bits » that gives us 2 characters (12 bits)
and 4 more bits in the third character
... and 2 unused bits?
The specification says (section 3.5. Canonical Encoding) that those unused bits need to be zeroed. Are the popular decoders checking that though? No, because they don’t have to (keyword below: MAY):
> In some environments, the alteration is critical and therefore
> decoders MAY chose to reject an encoding if the pad bits have not
> been set to zero.
How to test that? The simplest way is to encode one or two zero bytes – we will canonically subsequently receive AA==
and AAA=
. Now we can exchange one of the final characters (that is the last A
) into, for example, B
, so that we obtain: AB==
and AAB=
(what de facto transforms the last unused zero bit to 1
).
We can pass so prepared Base64 strings into decoders, e.g.:
// PHP
base64_decode("AB=="); // Will return \0 (no errors).
base64_decode("AB==", /*strict=*/true); // Ditto (also no errors).
# Python
base64.b64decode("AB=="); # Will return \0 (no errors).
base64.b64decode("AB==", validate=True); # Ditto (also no errors).
# Node.js
Buffer.from("AB==", "base64"); // Will return \0 (no errors).
And indeed, even in the strict
(PHP) mode or with validate=True
(Python) set, standard library decoders have no problem with non-zero unused bits.
Which brings us to steganography (hiding information), that is those unused bits can be used to hide four or two bits of data.
Of course, 2 or 4 bits is very, very little, therefore, in the case of using this steganography method, the hidden data is split into numerous 2- or 4-bit packets that are hidden in many many separate encoded Base64 strings.
Hint for CTF players: If you receive a lot of small files encoded with Base64 or one big file with many separate Base64 strings, then you are very likely dealing with the steganographic technique trick described above.
Of course, such things are best to put into practice right away, so... good luck!
TG9yZW1= aXBzdW0= ZG9sb3K= c2l0 YW1ldCw= Y29uc2VjdGV0dXJ= YWRpcGlzY2luZ5== ZWxpdC5= UGVsbGVudGVzcXVl c29kYWxlc3== YY== bmlzbE== ZWdldB== YWNjdW1zYW4u TW9yYml= Z3JhdmlkYSz= ZWxpdL== YWN= Z3JhdmlkYQ== Y29udmFsbGlzLF== bWFnbmE= YW50ZS== dWx0cmljZXN= YXJjdSy= dGVtcG9y dHJpc3RpcXVl ZW5pbZ== cHVydXN= ZWdldN== dXJuYS4= U2Vk dHJpc3RpcXVl dGluY2lkdW50 YXVndWUs dmVs aW1wZXJkaWV0 ZXJhdC5= UHJhZXNlbnQ= Y29uc2VjdGV0dXL= dWx0cmljaWVz ZXN0 YXT= bW9sbGlzLm== QWVuZWFu ZXR= cGxhY2VyYXS= bnVsbGEu U3VzcGVuZGlzc2V= dGluY2lkdW50 dGVtcG9y cXVhbSw= c2Vk dGVtcHVz cXVhbS== c2NlbGVyaXNxdWV= YS6= RG9uZWM= dmVzdGlidWx1bSx= ZWxpdH== YXQ= cnV0cnVt c29kYWxlcyz= dHVycGlz bGVjdHVz aGVuZHJlcml0 bmVxdWUs cXVpc2== ZWdlc3Rhc1== bG9yZW0= cXVhbd== b3JuYXJl dmVsaXQu TW9yYmm= bG9ib3J0aXM= YWNjdW1zYW7= cGVsbGVudGVzcXVlLk== U2Vk ZWdldH== c2FwaWVu dXT= bGlndWxh c2VtcGVy cG9ydHRpdG9yLk==
There are a few more tidbits connected with Base64, but since this post is already quite long, we will leave them for another day (we have RSS/Atom and a Newsletter so that you don't miss a thing).
Next post in this series: