Analyzing PDF Malware - Part 2

Where were we?

As the title states, this is the second part of Analyzing PDF Malware. If you haven't read the first part you can find it here. Go ahead and read it now if you haven't already, we'll wait.

In Part 1 we identified JavaScript embedded within our suspect file 'sample1.pdf'. We analyzed that code and discovered that the output was more JavaScript, a second stage. Once we cleaned up the code of the second stage it was easy to recognize the specific attacks being used by identifying well known exploited function calls to Collab.collectEmailInfo(CVE-2007-5659), util.printf(CVE-2008-2992), Collab.getIcon(CVE-2007-5659), and media.newPlayer(CVE-2007-5659).

Our suspicions have been confirmed. We know that our sample contains malicious code. So what is the next step? In some cases, just knowing that the file is "bad" can be the endgame of the analysis. However, in other situations we may want to dig even deeper. What if one or more of the exploits we identified was successful in executing on the target system? How was the system affected? Does it cripple the device, install additional software, or steal passwords? The more succinct question is: "What is the payload of the attack?"

To find out the answer to that question we will use a blend of both static and dynamic methodologies to dig into the sample and analyze exactly what is happening under the hood of this PDF malware, step-by-step.

The Game Plan:

  1. Study JavaScript to identify payload
  2. Identify and isolate shellcode within payload
  3. Analyze shellcode to determine its capabilities

Combing the Desert

So let's jump right in. Where should we start looking for our payload? Hmmm, I don't know about you, but this line sure seems to stand out to me and may be worth some further investigation:


Yes, the variable "payload" is actually in the second stage JavaScript even prior to our cleanup of the code. Why would someone who is trying to hide the true nature of his or her malware be so obvious with a variable name? More than likely the author was confident enough in the preceding layers of obfuscation that by this point in the game they weren't concerned with hiding the meaning of every variable name. This definitely makes things easier for our purposes. As the function(void) says, "No arguments here." :) Sorry, bad joke, I couldn't resist.

We have already 'beautified' our second stage JavaScript by adding some meaningful variable names and proper spacing to make it more readable for analysis (see full listing below), but I wanted to backtrack a bit and point out that the original code was fairly informative prior to the cleanup. That being said, we shouldn't just accept a variable name to be accurately self-describing; this is malware after all. Instead, we will confirm that 'payload' is indeed a payload and not some sort of red herring. If we continue by looking at our seemingly obvious lines of JavaScript, we see that the 'payload' variable is assigned the unescaped value of 'bjsg'. We find a similar line of code in all four of our identified exploits:

jsdecode_stage2.js:16: var payload = unescape(bjsg);jsdecode_stage2.js:37: var payload = unescape(bjsg);jsdecode_stage2.js:61: var payload = unescape(bjsg);jsdecode_stage2.js:104: var e = unescape(bjsg);

This makes sense as each exploit would need a payload, and it appears that all of them are using the same one. The 'bjsg' variable being assigned to 'payload' is a long string (1062-bytes total) of four hexadecimal characters delimited by '%u'.


This is a common way to encode binary data into a string (USC2/UTF-16) and it immediately seems like a good shellcode candidate. Before we get too excited, let's look a little closer and make sure our assumption makes sense. The first few bytes are 0x9090, which is the Intel x86 opcode for NOP (No Operation Performed). It is common for shellcode to be preceded by NOPs. This provides the attack code a little padding so that the exploit only needs to get close to the memory location of the shellcode for it to be sucessful. This is commonly known as a NOP sled. So far, so good. The next non-NOP instruction we see is 0x16EB which if represented as little-endian machine code it would be the equivalent of a relative short jump (JMP) call. Again this would make sense in the context of shellcode. Since all of the puzzle pieces seem to be lining up just right, we will go ahead and rename 'bjsg' to 'shellcode':

Jsdecode_stage2_js_cleaned*Second Stage JavaScript - Full Listing

2-byte Monte

Now that we have our shellcode payload, let's begin to dissect it to find out how it does what it does. The first thing we need to do is convert it from its current printable form into something more suitable for analysis.

We left off Part 1 of this blog post with our second stage JavaScript loaded up in the analysis tool Malzilla. Since we're already setup, let's continue with that tool for now. Copy the value of our shellcode into the 'MiscDecoder' tab.


Selecting the "UCS2 to Hex" button will give us our little-endian hexadecimal byte representation of the shellcode.


Selecting "Hex to File" will save off our file to disk as a much more useable '.bin' file.

Plumbing Rabbit Holes

Now we come to a fork in the analysis and need to choose what we want to do next from a list of options. We can take our newly created '.bin' and disassemble it to study it statically, we can execute our code dynamically in an emulator, or wrap it in an executable shell to debug the code step-by-step in a debugger. My vote is to start with the easiest of the bunch, emulated execution, and see if we can get a quick win. I'm guessing you agree.

Libemu is a C based x86 emulator used for the detection of shellcode among other things. It's a great tool, but it's not fail-proof. Included with the libemu test suite is a tool called sctest. Its authors describe the sctest tool as the following:

"…a useful source for those who want to setup shellcode emulation allowing win32 api calls and offering hooks on these calls."

With a description like that, this tool is definitely made-to-order for our task. If you don't want to bother with installing it yourself, sctest comes pre-compiled along with a collection of other great tools on Lenny Zeltser's REMnux linux distro. A brand new version, REMnux 3, was just made available for download in mid-December 2011.

Running sctest to emulate our hexfile.bin we can attempt to hook the Win32 API calls being made by the malware in an effort to see if they reveal any juicy bits of info:


From a snippet of the very verbose sctest output shown above, we discover that LoadLibrary is being used to first load the urlmon library. We also discover that the URLDownloadToCacheFile function of urlmon.dll is specifically being called. According to Microsoft's MSDN documentation, this function "downloads data to the Internet cache and returns the file name of the cache location for retrieving the bits". This is a huge clue as to what the malware is attempting to do on the victim system. It doesn't answer all of our questions but it gives us some good indications for now. Specifically, it appears that we are dealing with the "download and execute" class of shellcode.

So a file is being downloaded, but where is the file coming from? Where does the malware call home? Maybe we will get lucky and find an IP address or URL that the shellcode is attempting to contact by running strings against the file:


No dice. I guess we're not going to get that lucky with this sample. No meaningful clear text strings in the file at all. It was worth a shot. Before moving on, there is one other bit of sctest functionality worth mentioning. Issuing commands like the following will create a nice instruction call graph of our shellcode:

$ sctest -Ss 1000000000 -G < hexfile.bin
$ dot -T png -o shellcode_graph.png

Shellcode Instruction Call Graph

*click to enlarge graph

Who doesn't like a nice graph? This is more than just nerd art though; we can use it to quickly see conditional branching and looping structures before looking at a single assembly code operation. For example, if we zoom in and look at the first loop at memory location 0x00417015 we see something interesting:

Call Graph Loop

Ok, maybe it's not that interesting yet, but the graph gives us a place to start investigating. To get to the interesting stuff we need to see what is actually going on in the code at those locations, and to do that we need to get a disassembly of our shellcode. There are many ways to do this, but I recently came across a standalone tool for Windows written by Alain Rioux called ConvertShellcode.exe. You can paste in escaped shellcode on the command line as an argument and get your dead listing of the assembly code.


Now, if we investigate those four instructions from our call graph that we were interested in, we can finally see what is going on in that loop.


The loop is reading a byte of shellcode, applying an XOR mask, writing the new result back, and repeating. This is what's known as a staged XOR loader. The first stage of the shellcode is just enough to load this loop, leaving the remaining shellcode obfuscated. The second stage is then loaded at run time after the code modifies itself by applying the XOR key (0x19). This is one of the simple ways that malware authors use to complicate the analysis of static code.

Armed with this newfound information, let's confirm what we know by running Didier Steven's XORSearch tool against the shellcode and see if that uncovers anything. XORSearch will brute force XOR, ROL, and ROT encodings looking for the string or a file of strings you provide. Case does matter with this type of searching, so keep that in mind or you may get confusing results.


We finally get a hit on the search for "http". This gives us confirmation that the XOR key being used to encode the shellcode is 0x19 and it also gives us an interesting URL: hxxp:// This is likely the answer to the question about where the malware is attempting to download a file.

Let's Sum Up

So far we've analyzed the final stage of JavaScript that we extracted from the sample1.pdf, and successfully located the payload. We discovered through investigation that each of the four exploits were all using the same shellcode. We emulated the shellcode using libemu to discover the Win32 API calls being made by the malware. This allowed us to classify the shellcode as "download and execute". We then created an instruction call graph to quickly identify interesting structures and disassembled the code to analyze a particular loop of assembly code. The loop turned out to be a staged XOR loader. To confirm our suspected key we brute forced the XOR obfuscation and got our confirmation along with the added benefit of the revelation of a very interesting URL string.Attack_flow2_white

In the next part of this series we will decode the second stage shellcode and load it into our favorite disassembler. We will trace through the functions and fully map the code. We will also investigate the URL we discovered and see what sort of things that might get us tangled up with.

Stay tuned…

Tools used:

  • ConvertShellcode.exe - Converts shellcode strings into x86 assembly instructions
  • Malzilla - Malware hunting tool
  • XORSearch - Tool to search in an XOR, ROL or ROT encoded binary file
  • Libemu - x86 Shellcode Emulation
  • Graphviz - Open Source graph visualization software
  • REMnux -A Linux Distribution for Reverse-Engineering Malware

    Trustwave reserves the right to review all comments in the discussion below. Please note that for security and other reasons, we may not approve comments containing links.