Base64 Beyond Encoding – Steganography and Canonical Form (part 1)

Gynvael Coldwind

2024-08-16

A mistake very often made by beginners (and which IS NOT the topic of this post) is calling Base64 encoding “encryption”. Base64 is of course no encryption as dr Iwona Polak brilliantly explains in her newest LI entry (PL only). That being said, Base64 and some of its implementations have some curious quirks, and they will be the topics of this series of posts.

In a nutshell, Base64 is an encoding which transforms 3 raw bytes to 4 text characters, which is very useful if we need to print or share binary data through a text medium. A good modern example is the data: scheme, which enables, among other things, the use of Base64 to place binary image files directly in the HTML text code:

<img
  src="data:image/png;base64,
    iVBORw0KGgoAAAANSUhEUgAAAAUAAAAEAQMAAAB8/WcDAAAABlBMVEUAAAD
    ///+l2Z/dAAAAEElEQVQI12NYwfCDoYChAwALUAKZCQz4kwAAAABJRU5Erk
    Jggg=="
  style="width: 120px; height: 100px; image-rendering: pixelated;">

Effect of the code above:

Base64, in its various variants, can be found practically everywhere – from web technologies through cryptography (storing binary keys, JWT) to network protocols (such as SMTP or IMAP – to encode attachments). What follows is that we are handling many different implementations that are not always fully compatible with each other and don’t always follow specification (RFC 4648) to a T (although it needs to be added that it sometimes results from the referring specification which enforces discrepancies).

Steganography in unused bits

Let’s start from, in my opinion, the most interesting Base64 quirk feature – that is unused bits in the last coding character.

Where do we even take these “unused bits” from? As I mentioned before, Base64 encodes (maps) 3 bytes to 4 characters. To be more precise – entering into “bit maths”, 24 input bits, originally packed into 3 bytes – 8 bits each (what is currently the standard but wasn’t always), are outputted using 4 characters, 6 bits each (3 * 8 = 4 * 6).

base64-explained-en.png

What if there is less data? That is, 1 byte or 2 bytes? Let’s calculate:

1 byte  ==  8 bits » that gives us 1 character (6 bits)
                     and 2 more bits in the second character
                     ... and 4 unused bits?

2 bytes == 16 bits » that gives us 2 characters (12 bits)
                     and 4 more bits in the third character
                     ... and 2 unused bits?

base64-unused-en.png

The specification says (section 3.5. Canonical Encoding) that those unused bits need to be zeroed. Are the popular decoders checking that though? No, because they don’t have to (keyword below: MAY):

> In some environments, the alteration is critical and therefore
> decoders MAY chose to reject an encoding if the pad bits have not
> been set to zero.

How to test that? The simplest way is to encode one or two zero bytes – we will canonically subsequently receive AA== and AAA=. Now we can exchange one of the final characters (that is the last A) into, for example, B, so that we obtain: AB== and AAB= (what de facto transforms the last unused zero bit to 1).

We can pass so prepared Base64 strings into decoders, e.g.:

// PHP
base64_decode("AB==");  // Will return \0 (no errors).
base64_decode("AB==", /*strict=*/true);  // Ditto (also no errors).

# Python
base64.b64decode("AB==");  # Will return \0 (no errors).
base64.b64decode("AB==", validate=True);  # Ditto (also no errors).

# Node.js
Buffer.from("AB==", "base64");  // Will return \0 (no errors).

And indeed, even in the strict (PHP) mode or with validate=True (Python) set, standard library decoders have no problem with non-zero unused bits.

Which brings us to steganography (hiding information), that is those unused bits can be used to hide four or two bits of data.

Of course, 2 or 4 bits is very, very little, therefore, in the case of using this steganography method, the hidden data is split into numerous 2- or 4-bit packets that are hidden in many many separate encoded Base64 strings.

Hint for CTF players: If you receive a lot of small files encoded with Base64 or one big file with many separate Base64 strings, then you are very likely dealing with the steganographic technique trick described above.

Of course, such things are best to put into practice right away, so... good luck!

TG9yZW1=
aXBzdW0=
ZG9sb3K=
c2l0
YW1ldCw=
Y29uc2VjdGV0dXJ=
YWRpcGlzY2luZ5==
ZWxpdC5=
UGVsbGVudGVzcXVl
c29kYWxlc3==
YY==
bmlzbE==
ZWdldB==
YWNjdW1zYW4u
TW9yYml=
Z3JhdmlkYSz=
ZWxpdL==
YWN=
Z3JhdmlkYQ==
Y29udmFsbGlzLF==
bWFnbmE=
YW50ZS==
dWx0cmljZXN=
YXJjdSy=
dGVtcG9y
dHJpc3RpcXVl
ZW5pbZ==
cHVydXN=
ZWdldN==
dXJuYS4=
U2Vk
dHJpc3RpcXVl
dGluY2lkdW50
YXVndWUs
dmVs
aW1wZXJkaWV0
ZXJhdC5=
UHJhZXNlbnQ=
Y29uc2VjdGV0dXL=
dWx0cmljaWVz
ZXN0
YXT=
bW9sbGlzLm==
QWVuZWFu
ZXR=
cGxhY2VyYXS=
bnVsbGEu
U3VzcGVuZGlzc2V=
dGluY2lkdW50
dGVtcG9y
cXVhbSw=
c2Vk
dGVtcHVz
cXVhbS==
c2NlbGVyaXNxdWV=
YS6=
RG9uZWM=
dmVzdGlidWx1bSx=
ZWxpdH==
YXQ=
cnV0cnVt
c29kYWxlcyz=
dHVycGlz
bGVjdHVz
aGVuZHJlcml0
bmVxdWUs
cXVpc2==
ZWdlc3Rhc1==
bG9yZW0=
cXVhbd==
b3JuYXJl
dmVsaXQu
TW9yYmm=
bG9ib3J0aXM=
YWNjdW1zYW7=
cGVsbGVudGVzcXVlLk==
U2Vk
ZWdldH==
c2FwaWVu
dXT=
bGlndWxh
c2VtcGVy
cG9ydHRpdG9yLk==

There are a few more tidbits connected with Base64, but since this post is already quite long, we will leave them for another day (we have RSS/Atom and a Newsletter so that you don't miss a thing).

Don't miss anything from HexArcana! Add our blog to your RSS/Atom reader or subscribe to our newsletter below.