SpiderLabs Blog

Optimizing Regular Expressions

Written by | Jun 27, 2007 6:22:00 AM

As many of you have noticed, the Core Rule Set contains very complex regular expressions. For example:

(?:\b(?:(?:s(?:elect\b(?:.{1,100}?\b(?:(?:length|count|top)\b.{1,100}
?\bfrom|from\b.{1,100}?\bwhere)|.*?\b(?:d(?:ump\b.*\bfrom|ata_type)|
(?:to_(?:numbe|cha)|inst)r))|p_(?:(?:addextendedpro|sqlexe)c|...

These regular expressions are assembled from a list of simpler regular expressions for efficiency reasons. A single optimized regular expression test takes much less time than a series of simpler regular expression tests. The downside is readability and ease of editing. A future version of ModSecurity will overcome this limitation, but meanwhile, in order to optimize performance you have to think about optimization yourself.

Manual assembly and optimization is both hard and error prone, so for the Core Rule Set we use a clever Perl Module: Regexp::Assemble. As the name suggests, Regexp::Assemble knows how to assemble a number of regular expressions into one optimized regular expression.

Since Regexp::Assemble is not a program, but rather a Perl module, you will need some glue code to use it. The following instructions will help you if you are not Perl Wizards.

If you don't have Perl, you will need to install it. The easiest Perl distribution to install, especially if you use Windows, is ActivePerl.

Now install Regexp::Assemble. If you used ActivePerl, you can use the following command:

ppm install regexp-assemble

If you use another Perl distribution, you will need to download the module and use the normal Perl module installation procedure as outlined in the README file.

Once you have Perl and Regexp::Assemble installed, all you need is this little script:

#!/usr/local/bin/perl
use strict;
use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
while (<>)
{
$ra->add($_);
}
print $ra->as_string() . "\n";

The script will take either standard input or an input file with each line containing a regular expression and print out the optimized expression:

regexp_builder.pl simple_regexps.txt > optimized_regexp.txt

On a Unix system you might need to change the fist line to point to the local Perl interpreter. On Windows you may need to precede the script name with the command 'perl'.

And if all this is too complex, you can just download the pre-compiled version for Windows.