When trying to identify crimeware/malware, it's a good idea to design a multi-part system that deploys a variety of detection techniques to increase your chances of detection. You can start with one technique and then layer on additional techniques as time and resources will allow.
In this short blog post, I'm going to share just one of those techniques (using edit-distance) that you can plug into your multi-part system to perform rudimentary detection for popular crimeware admin panel strains like Pony, Citadel, and Zeus.
Edit Distance Basics
Edit distance (aka: Levenshtein distance) is a term for determining how different two strings are from one an another. The basic idea is that we take String A ("bananas") and String B ("apples") and determine how many individual changes would be required to make the first string equal the second string. Each change can be an insertion, a deletion or a substitution.
For example, if we wanted to compute the edit distance between A and B we can do this manually like so:
- Delete the 'b' (ananas)
- Sub first 'n' for 'p' (apanas)
- Sub second 'a' for 'p' (appnas)
- Sub second 'n' for 'l' (applas)
- Sub last 'a' for 'e' (apples)
So, assuming we took the most efficient path from bananas to apples, we have an edit distance of 5 between the two strings.
It's a very simple concept, but how can something this simple help us identify crimeware?
Let's start by getting our hands on some crimeware.
Obtaining Crimeware Samples
There is a metric ton of web-based crimeware that's available in the wild, many of which we at Trustwave already classify using more sophisticated means. I've taken 2 separate instances of 3 different "strains" of web-based crimeware (Pony, Citadel, and Zeus) from our malware repositories to demonstrate this technique.
These are the files I'm starting with:
Now that we have some samples, let's identify them with edit-distance.
Identifying Crimeware Strains
We start this process by identifying a baseline sample for each strain. Let's use sample #1 for each strain. We'll take the baseline examples and place them in a templates folder and then move the remaining items in a samples folder. We can also add 100 normal HTTP responses and play a little game called "find the crimeware."
Now, on disk, our footprint looks like this:
I've written this small proof of concept code to demonstrate the process with a couple performance and tuning tweaks added, including normalized edit distance and a sample qualifying pre-processor: