How much data? Apache, Ubuntu and the Lies of the Logs.

Forensic investigators rely heavily on log file data in order to analyse attacks and draw conclusions regarding attacker actions. However, this blog post shows that we don't just have to worry about attackers trying to cover their tracks by destroying logs; Linux distributions themselves mess about with log file formats in a misleading manner and further thwart investigations.

As part of the research into a presentation I gave with Tom Mackenzie last year we took a close look at Apache logs. One key aspect of our talk was designing web based malware which could operate while making the log file information from the website server useless by crafting data flow which was indistinguishable from legitimate access.

Our work looked at the most frequently used Apache log formats; common and combined. These default formats are used by the vast majority of servers. For the purposes of the blog post we will concentrate on the following field which is present in both the common and combined log formats.

[This field] indicates the size of the object returned to the client, not including the response headers.

Source: http://httpd.apache.org/docs/2.2/logs.html

This field measures the data in bytes, therefore, if we return a document of 150 bytes the field in the log should read '150'. This field is useful for a forensic investigator as we can show how much data is transferred in a request. For example: If we find a unexplained dump of a mysql database, and we can see that within minutes of that dump file being created a data transfer of the same size is made via HTTP, we can reasonably conclude that the database backup was transferred out of the network.

The first step of the research was to be able to confirm and reproduce this behaviour so that we could then start to abuse it. We used an install of Ubuntu Linux with the default Apache and php install. However, when we looked to reproduce the documented behaviour we were presented with some unexpected results.

After placing a "Hello World!" (that's 13 bytes including a new line character) text file on the server and accessing it with the rekonq browser, we see the following line in the log file. I have highlighted the relevant field.

127.0.0.1 - - [24/Jun/2012:19:13:38 +0000] "GET /hello.txt HTTP/1.1" 200 369 "-" "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/534.34 (KHTML, like Gecko) rekonq Safari/534.34"

We can see from this line that we have reportedly transferred 369 bytes. How can this be? We only requested 13 bytes of data. It was time to understand this behaviour further.

The first step was to take a look at the data which the client was receiving. A request to the same resource was made with the wget program:

$ wget localhost/hello.txt
--2012-06-24 19:17:51-- http://localhost/hello.txt
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13 [text/plain]
Saving to: `hello.txt'

100%[=====================================================================>] 13 --.-K/s in 0s

2012-06-24 19:17:51 (648 KB/s) - `hello.txt' saved [13/13]

We can see from the wget output that thirteen bytes were transferred. This is as we would expect so what does the Apache log state in this situation?

127.0.0.1 - - [24/Jun/2012:19:17:51 +0000] "GET /hello.txt HTTP/1.1" 200 325 "-" "Wget/1.13.4 (linux-gnu)"

In this case the log line reports that 325 bytes have been transferred. This is a really unexpected result, as it is neither the 13 bytes we were expecting to see nor the 369 bytes that resulted from the Reconq request. Why does the log not read 13 bytes and why does this differ depending on whether we access the data with wget or rekonq?

Time to look inside the configuration files of the server to understand the logging more clearly. The relevant line of the log file is the following:

CustomLog ${APACHE_LOG_DIR}/access.log combined

This configuration files shows that the combined log file format is in use and therefore that the log should show 13 when a 13 byte file is transferred.

On order to try to understand this behaviour we then launched a barrage of other tests. Time passed….

Balboa_10_bg_082904

After running a number of tests from which no discernible pattern emerged, I returned to the configuration files for a closer look. This is when things started to become clear. The following line from the Ubuntu Apache configuration files defines what the combined log format is.

LogFormat "%h %l %u %t \"%r\" %>s %O \"%{Referer}i\" \"%{User-Agent}i\"" combined

However, if we compare this to the extract from the Apache documentation:

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-agent}i\"" combined

We can see a clear difference; in the Ubuntu apache.conf file the seventh field '%b' has been changed to '%O'! Let us take a further look at the Apache documentation and understand what the '%O' writes to the logs:

Bytes sent, including headers, cannot be zero.

Source: http://httpd.apache.org/docs/2.2/mod/mod_logio.html

This explains the difference we are seeing. If we download the same file again with wget but save the headers and data into a file the number in the logs then matches the data transferred.

Saving the headers and resource data to a file:

$wget –save-headers localhost/hello.txt
--2012-06-24 19:47:13-- http://localhost/hello.txt
Resolving localhost (localhost)... 127.0.0.1
Connecting to localhost (localhost)|127.0.0.1|:80... connected.
HTTP request sent, awaiting response...
HTTP/1.1 200 OK
Date: Sun, 24 Jun 2012 19:47:13 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Sun, 24 Jun 2012 19:11:31 GMT
ETag: "a378f-d-4c33ca211d57d"
Accept-Ranges: bytes
Content-Length: 13
Vary: Accept-Encoding
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/plain
Length: 13 [text/plain]
Saving to: `hello.txt'

100%[===================================================================================================================>] 13 --.-K/s in 0s

Checking the size of the file:

$ du --bytes hello.txt
325 hello.txt

Checking the contents of the file:

$ cat hello.txt.6
HTTP/1.1 200 OK
Date: Sun, 24 Jun 2012 19:47:13 GMT
Server: Apache/2.2.22 (Ubuntu)
Last-Modified: Sun, 24 Jun 2012 19:11:31 GMT
ETag: "a378f-d-4c33ca211d57d"
Accept-Ranges: bytes
Content-Length: 13
Vary: Accept-Encoding
Keep-Alive: timeout=5, max=100
Connection: Keep-Alive
Content-Type: text/plain

Hello World!

Checking that this matches the log:

127.0.0.1 - - [24/Jun/2012:19:47:13 +0000] "GET /hello.txt HTTP/1.1" 200 325 "-" "Wget/1.13.4 (linux-gnu)"

Success!

So we have shown how the changes implemented in the Ubuntu Apache package affect the log file evidence available but what are the possible advantages of this modification?

'%O' vs '%b'

It is likely that the change from '%b' to '%O' was made in order for bandwidth utilisation to be measured more accurately. My preference from a forensic standpoint is to have '%b' as it is more useful given that it records exact and predictable data. However, the apache developers have considered this and have included the following code in the default configuration file available with the apache source.

<IfModule logio_module>

# You need to enable mod_logio.c to use %I and %O

LogFormat "%h %l %u %t \"%r\" %>s %b \"%{Referer}i\" \"%{User-Agent}i\" %I %O" combinedio

</IfModule>

This is an elegant solution to the problem; a log file format with a unique and clear name which records:

'$b' The size of the object returned to the client, not including the response headers

'$O' Actual number of bytes sent on the network.

'$I' Actual number of bytes received on the network.

This is a perfect way to meet the requirements of both accurate reporting of transferred resource size and more accurate bandwidth usage. So why has a distribution decided that their new solution is better and what can we do moving forward?

For forensic investigators:

Be extremely cautious as to the format of log files. Dig right through the configuration files to understand the exact makeup of the log file format for each specific case. By just looking at the log from Ubuntu Apache it is almost impossible to tell that it is different from the vanilla Apache logs; in both logs the seventh field is a numeric value which fluctuates in relation to the amount of data downloaded. Do not trust distributers to follow common sense.

For Distributers:

In short: 'If it ain't broke, don't fix it'. Please treat your users nicely and don't mess around with configurations without making it really clear to users what you have done (and preferably why you have done it). For instance, the solution the apache development team have implemented is practical and elegant. However, even something as simple as a single line comment above the definition in the configuration file would have at least halved the time I personally spent understanding the issue.

Note: All the issues found here are reproduceable in Debian. A bug report has been raised.

Trustwave reserves the right to review all comments in the discussion below. Please note that for security and other reasons, we may not approve comments containing links.