Blogs & Stories

SpiderLabs Blog

Attracting more than a half-million annual readers, this is the security community's go-to destination for technical breakdowns of the latest threats, critical vulnerability disclosures and cutting-edge research.

Base64 versus Plaintext Observations

Recently we have been working on the libmodsecurity project. As part of the project we no longer use the Apache Portable Runtime (APR) as part of the core ModSecurity. While this change has allowed us to increase performance, portability, and reliability it means that some aspects of the code that we previously relied on APR for had to be recreated. One such aspect was base64 encoding/decoding support. While developing our base64 implementation I came to ponder an interesting security question. Given an unknown string, such as a password, is it quicker to bruteforce the base64-encoded version of the password or the plaintext version?

This is not an all-together an unreasonable question as one of the most common forms of authentication on the web is still Basic Authentication. There are many reasons for this; chief among them is the simplicity of configuring it. However, generally, Basic Authentication uses Base64 to encode the sensitive information (username, password) as it goes across the wire. Of course base64 is a two way function and as a result our VAT product Trustwave Vulnerability Scanner will detect this as a vulnerability if there is no additional form of communication security, such as SSL.

There is of course some logic behind this question as well. A quick glance at the base64-encoding standard will tell you that it has a 64 character alphabet (hence the name). Meanwhile a quick glance at ASCII will tell you there are 95 printable characters (https://en.wikipedia.org/wiki/ASCII#ASCII_printable_characters). This is really a minimum subset because the astute reader will mention that technically base64 can encode non-printable characters and Unicode; but for now let us only consider printable ASCII.

From these naïve assumptions one might assume that it is significantly easier to brute force characters within the base64-encoded charset. The first assumption, mostly for simplicity, is that any given password is made up of the 95 printable ASCII characters. This assumption means that in order to fully search through a one character password in plaintext, it would take 95^1 (95) iterations. Meanwhile, knowing that Base64 encoding has an alphabet of 64 characters, one could assume that it would take 65^1 iterations to completely try all 1 character password combinations.

Given the above information you may be inclined to try this at home, but these assumptions fail to take into account the nature Base64. Consider Base64 as if it was a hashing algorithm (it's not), the above assumption would be indicative of many, many collisions within the space. But we know that Base64 can be both encoded and decoded quite easily, so how does Base64 encode variable sized input within a smaller character set? The answer is that it uses variable sized - that is to say longer – output. A little bit of background is needed in order to fully understand the previous statement.

Base64 encoding is actually quite simple. Take a simple word like 'Mana'. We first take the ASCII value of each letter 77, 97, 110, and 97 respectively. We then convert these each into their binary representations: 01001101, 01100001, 01101111, and 01100001. When finished we combine these into a single 'bitstring': 01001101011000010110111101100001. Next we take this bit string and break it into 6 bit chunks: 010011 010110 000101 101111 011000 01. We'll notice that 32 is not divisible by 6 and so we pad our last number with zeros until we get a multiple of 6, it then becomes 01000. Our final set appears as follows: 010011 010110 000101 101111 011000 01000. We now do the relatively simple task of converting these back to decimal and looking them up in the table below (19->T, 22->W, 5->F, 46->u, 24->y, and 16->q). We end up with the letters TWFuYQ.


It is important to note that technically according to the standard if our output is not a multiple of 4 then we will pad with equal signs until we achieve this feature (TWFuYQ==). However, since the decoding process is simply the encoding process in reverse, this isn't strictly needed. Additionally, given an encoding we can very easily determine how much padding is need.

Using the example above we can extrapolate that given an input of length n as input our base64 output length (without padding) would be (ceil(n/3.0)) + n. This now allows us to calculate our given search space more accurately as 64^(ceil(n/3.0)) + n). This pattern means that for every three plaintext additional characters we add one additional base64 bit character. Of course, when we when used as the exponent this gives us a really undesirable feature that after every three characters of input our search space will quadruple (since increasing an exponent by one doubles it).

So we got through a little bit of math. But what does this mean for our original question? It turns out that while when using plaintext we have a larger base (96), the exponent involved with searching a Base64 encoded space grows at a much faster rate. In layman's terms we would never want to search using Base64 versus ASCII.

Below I've visualized what the difference between these search spaces look like for a text of a particular length. When we get to longer passwords these sizes really, for instance searching 32 character text using Base64 encoding you'd have to search ~463167999999998062890000000000000000000000000000000000000000000000000000000000 more combinations. As a result of this I've expressed the graph below on a logarithmic scale. Even here you can see that the differences are marked:


In summary, it makes sense on the surface to assume that Base64 is easier to search than ASCII due to the smaller keyspace. However, as we have shown, keyspace doesn't tell the whole story. Due to the nature of how Base64 encodes strings it is decidedly harder to bruteforce an unknown string encoded in Base64 than it is in ASCII.