CVSS for Penetration Test Results (Part I)

Trustwave has been adding support for the Common Vulnerability Scoring System (CVSS) in PenTest Manager, our online reporting portal used for all SpiderLabs penetration tests. While this is a great step toward better metrics for our penetration test results, the exercise has revealed limitations in the industry's current vulnerability taxonomies. Applying CVSS scores to penetration test results feels like pounding square pegs into round holes. Is there a better way? Can we stretch CVSS to cover this new domain? Or should we go back to the drawing board?

I see two problems with using CVSS for penetration test results. One is easy to fix. The current documentation on how to score vulnerabilities for CVSS assumes the audience is the collective owners of the vulnerabilities -- i.e., the vendors producing the buggy code. The published guidance is misleading, and sometimes just plain wrong, when applied to the pentesting community.

The second problem will be harder to solve than updating some reference manuals. CVSS needs combinatorics.

Some background

To understand the strengths and weaknesses of CVSS, you need to understand something about its history and the alphabet soup of related industry projects: CVE, NVD, and CWE, to name a few. Common Vulnerability and Exposures – aka CVE – is a simple index of unique vulnerabilities. (Technically it's a list of "vulnerabilities" and "exposures", where "exposures" are vulnerabilities that cannot be exploited for unauthorized access, but provide information that can aide an attacker. We can safely ignore this distinction.) The CVE project was kicked off way back in 1999. Vulnerabilities get submitted to CVE "Numbering Authorities" who review the submissions. Submissions that meet their criteria receive an official CVE Identifier number (e.g., CVE-2012-0002) that, together with a description and list of references, forms the CVE ID. The CVE project does not attempt to provide a taxonomy to categorize or organize the vulnerabilities listed. It fact it doesn't provide any metrics, such as "risk" or "frequency", at all. It is simply a list intended to provide a consistent point of reference for vulnerability databases, vulnerability scanners, and other security products. As stated in the 1999 document referenced above, its design goals were simplicity, completeness, public availability, interoperability, and independence.

CVSS is the Common Vulnerability Scoring System. CVSS version 1 was created in 2004 by a subgroup of a working group of the Department of Homeland Security's National Infrastructure Advisory Council. It's goal was to simplify and unify all the disparate vulnerability scoring systems that were being used and promoted by major public and private industry players. The goal of CVSS is to provide a set of metrics for measuring properties of vulnerabilities to help defenders prioritize their work. 14,000 vulnerabilities have been added to the National Vulnerability Database in the last three years. That's about 14 new vulnerabilities per day. How is a defender to determine which vulnerabilities need to be addressed, and in which order of priority? That's the problem CVSS is intended to solve.

CVSS scores vulnerabilities according to 11 different metrics arrayed into three separate groups: Base Score Metrics, Environmental Score Metrics, and Temporal Score Metrics. Base scores are themselves divided into two groups: Exploitability Metrics and Impact Metrics. The first set defines how exploitable the vulnerability is: is it a remote or local exploit, does it require credentials, how complex is it? Impact metrics cover whether the vulnerability poses a risk to confidentiality, integrity, or availability.

The five metrics among the Environmental and Temporal scores are designed to customize the risk of a vulnerability to the specifics of time and place. Time impacts the scores as vulnerabilities go through a lifecycle of rumor, proof-of-concept exploit, working exploit, workaround, and patch release. Place impacts the scores since not every location is equally vulnerable, based on percentage of vulnerable systems and potential for loss.

The theory is that the organizations that produce and maintain CVEs and CVSS scores only complete the Base scores, leaving the Temporal and Environmental scores to the local end-user to complete. Again, the theory is that the Base scores will be fixed once and for all. The full CVSS scores -- Base + Temporal + Environmental -- will be unique to each site and change over time.

Problem #1 and solution

Given the history of the project, it makes perfect sense that CVSS was designed primarily to meet the following use case: vendor learns of vulnerability in their product, vendor requests and receives CVE ID, vendor calculates vulnerability's CVSS Base score and initial Temporal score, and then users of vulnerable product complete full CVSS score by adding Environmental metrics. This is in fact spelled out explicitly in the CVSS FAQ maintained by FIRST (the Forum of Incident Response and Security Teams):

How is the scoring done?

A: Scoring is the process of combining all the metric values according to specific formulas.

Base Scoring is computed by the vendor or originator with the intention of being published and once set, is not expected to change. [...]

Temporal Scoring is also computed by vendors and coordinators for publication, and modifies the Base score. [...]

Environmental Scoring is optionally computed by end-user organizations and adjusts combined Base-Temporal score. This should be considered the FINAL score and represents a snapshot in time, tailored to a specific environment. User organizations should use this to prioritize responses within their own environments.

But there are more things in heaven and earth than are dreamt of in this philosophy, namely penetration testers! It is ironic that the penetration testing community, who is in a unique position with regard to witnessing and understanding vulnerabilities, is completely out of the picture in the industry's most mature vulnerability scoring system.

If it is still not clear that the CVSS authors were blind to the perspective offered by penetration testers, consider these guidelines from the scoring advice offered by the primary CVSS manual:

From Scoring Tip #3:
Many applications, such as Web servers, can be run with different privileges, and scoring the impact involves making an assumption as to what privileges are used. Therefore, vulnerabilities should be scored according to the privileges most commonly used.

Scoring Tip #4:
When scoring the impact of a vulnerability that has multiple exploitation methods (attack vectors), the analyst should choose the exploitation method that causes the greatest impact, rather than the method which is most common, or easiest to perform. For example, if functional exploit code exists for one platform but not another, then Exploitability should be set to "Functional". If two separate variants of a product are in parallel development (e.g. PHP 4.x and PHP 5.x), and a fix exists for one variant but not another, then the Remediation Level should be set to "Unavailable".

Someone who is actually exploiting a web server doesn't need to make any assumptions about what privileges are owned by the compromised process -- they will know first-hand. Someone who is demonstrating a vulnerability in PHP needn't "choose the exploitation method that causes the greatest impact", but simply the exploitation method that was actually used.

Happily, these shortcomings in CVSS are easy to fix. We just need a rewrite of the CVSS guidelines around scoring that provide specific advice to penetration testers. For example, since penetration testers -- good ones at least -- work closely with their clients and come to a good understanding of their environments and the risks to their data, they should simply fill in the CVSS Environmental metrics.

The root cause of this issue is that CVSS is designed to score vulnerabilities at a level of abstraction above their particular instantiations during exploitation. It's not at the top of the abstraction food chain -- CVE's cousin CWE, Common Weakness Enumeration, is more abstract yet. But I don't see any issues with allowing CVSS scores to be applied to particular exploitation events in addition to the higher-level vulnerabilities to which they are traditionally applied.

Problem #2

Unfortunately, a deeper problem is lurking. In fact it is called out in the CVSS guide's "Scoring Tip #1":

Vulnerability scoring should not take into account any interaction with other vulnerabilities. That is, each vulnerability should be scored independently.

In the real world however, vulnerabilities don't live in a vacuum. They are part of a complex ecosystem, and their real impact cannot be understood independently from this surrounding context. Take for example this exceedingly common and successful internal network attack path.

  1. Malory spoofs responses to NetBIOS Name Service broadcast requests.
  2. Malory obtains Alice's domain NTLM hash (hashed against Malory's NTLMv2 server challenge).
  3. Malory cracks Alice's hash using brute-force methods.
  4. Malory connects to Alice's desktop over SMB and executes -- a la psexec -- a utility to dump hashes from the registry and from memory, and obtains the NTLM hash to the local Administrator account.
  5. Malory uses Alice's credentials to enumerate group membership data from a domain controller and learns the members of the Domain Admins group.
  6. Malory attempts to authenticate to each system on the local network as the local Administrator using pass-the-hash with Alice's local Administrator NTLM hash. For each system where authentication is successful, Malory attempts to dump hashes from memory. Malory obtains numerous domain LM hashes until she discovers the LM hash of a Domain Admin.

The point here is not to debate how easy or difficult it would be to follow this attack path in any given environment. The point is to understand what vulnerabilities this attack path exploits, and what their individual CVSS (base) scores are. (The list of letters following each numerical score is a CVSS vector.)

  • NetBIOS enabled
    CVSS=5.8 (AV:A/AC:L/Au:N/C:P/I:P/A:P)
  • Desktop firewall allows SMB
    CVSS=5.2 (AV:A/AC:L/Au:S/C:P/I:P/A:P)
  • User has local administrator privileges
    CVSS=6.8 (AV:L/AC:L/Au:S/C:C/I:C/A:C)
  • Shared local administrator passwords
    CVSS=7.7 (AV:A/AC:L/Au:S/C:C/I:C/A:C)

Now, I've made these vulnerabilities up and scored them from scratch, since I'm fairly certain I won't find any of them -- or any other relevant vulnerability -- in the NVD or any other index of CVEs. The fact that many, probably a majority, of the vulnerabilities that penetration testers exploit day-to-day aren't listed in the NVD is an important issue, but it's not what I'm going after at the moment. Right now I want to focus on the fact that a complete Windows Domain compromise can occur by stringing together the exploitation of a chain of vulnerabilities, none of which score higher than CVSS 7.7.

You're welcome to quibble over the details I've laid out in this little scenario, but don't miss the forest for the trees. The point is that CVSS cannot handle sets of vulnerabilities, or chains, or vectors, or whatever you want to call them. In fact it's explicitly designed not to do so. If you aren't convinced, go ahead and try it. Is the attack vector described above "local"? Is its complexity "high", "medium" or "low"? Does it require authentication? None of these properties quite hit the mark, and they do nothing in terms of describing why this attack vector is so successful.

This is a problem, if defenders are prioritizing their work with the help of CVSS scores. So what are we to do? In my next post I'll look at some related work and share my thoughts on what a solution might look like.

Trustwave reserves the right to review all comments in the discussion below. Please note that for security and other reasons, we may not approve comments containing links.