The SAMATE Project Department of Homeland Security

Static Analysis Tool Exposition (SATE) IV


Note: The planning meeting for SATE V was held on Monday, March 4, 2013 at NIST, from 1 to 4pm.


Introduction

Static Analysis Tool Exposition (SATE) is designed to advance research (based on large test sets) in, and improvement of, static analysis tools that find security-relevant defects in source code. Briefly, participating tool makers run their tools on a set of programs. Researchers led by NIST analyze the tool reports. The results and experiences are reported at a workshop. The tool reports and analysis are made publicly available later.

SATE's purpose is NOT to evaluate nor choose the "best" tools. Rather, it is aimed at exploring the following characteristics of tools: relevance of warnings to security, their correctness, and prioritization. Its goals are:

  • To enable empirical research based on large test sets,
  • To encourage improvement of tools,
  • To speed adoption of tools by objectively demonstrating their use on real software.

SATE IV is the fourth occurrence of SATE. There is information about and results from SATE 2010, SATE 2009 and SATE 2008 on-line.

Note. A warning is an issue (usually, a weakness) identified by a tool. A (tool) report is the output from a single run of a tool on a test case. A tool report consists of warnings.

Publication

"Report on the Static Analysis Tool Exposition (SATE) IV," Vadim Okun, Aurelien Delaitre, and Paul E. Black, U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 500-297, January, 2013

Abstract

The NIST Software Assurance Metrics And Tool Evaluation (SAMATE) project conducted the fourth Static Analysis Tool Exposition (SATE IV) to advance research in static analysis tools that find security defects in source code. The main goals of SATE were to enable empirical research based on large test sets, encourage improvements to tools, and promote broader and more rapid adoption of tools by objectively demonstrating their use on production software.

Briefly, eight participating tool makers ran their tools on a set of programs. The programs were four pairs of large code bases selected in regard to entries in the Common Vulnerabilities and Exposures (CVE) dataset and approximately 60 000 synthetic test cases, the Juliet 1.0 test suite. NIST researchers analyzed approximately 700 warnings by hand, matched tool warnings to the relevant CVE entries, and analyzed over 180 000 warnings for Juliet test cases by automated means. The results and experiences were reported at the SATE IV Workshop in McLean, VA, in March, 2012.

Download NIST SP 500-297

Data

Cautions on Interpreting and Using the SATE Data

SATE IV, as well as its predecessors, taught us many valuable lessons. Most importantly, our analysis should NOT be used as a basis for rating or choosing tools; this was never the goal.

There is no single metric or set of metrics that is considered by the research community to indicate or quantify all aspects of tool performance. We caution readers not to apply unjustified metrics based on the SATE data.

Due to the nature and variety of security weaknesses, defining clear and comprehensive analysis criteria is difficult. While the analysis criteria have been much improved since the first SATE, further refinements are necessary.

The test data and analysis procedure employed have limitations and might not indicate how these tools perform in practice. The results may not generalize to other software because the choice of test cases, as well as the size of test cases, can greatly influence tool performance. Also, we analyzed a small subset of tool warnings.

The procedure that was used for finding CVE locations in the CVE-selected test cases and selecting related tool warnings, though improved since SATE 2010, has limitations, so the results may not indicate tools’ ability to find important security weaknesses.

Synthetic test cases are much smaller and less complex than production software. Weaknesses may not occur with the same frequency in production software. Additionally, for every synthetic test case with a weakness, there is one test case without a weakness, whereas in practice, sites with weaknesses appear much less frequently than sites without weaknesses. Due to these limitations, tool results, including false positive rates, on synthetic test cases may differ from results on production software.

The tools were used in this exposition differently from their use in practice. We analyzed tool warnings for correctness and looked for related warnings from other tools, whereas developers use tools to determine what changes need to be made to software, and auditors look for evidence of assurance. Also in practice, users write special rules, suppress false positives, and write code in certain ways to minimize tool warnings.

We did not consider the tools’ user interfaces, integration with the development environment, and many other aspects of the tools, which are important for a user to efficiently and correctly understand a weakness report.

Teams ran their tools against the test sets in August through October 2011. The tools continue to progress rapidly, so some observations from the SATE data may already be out of date.

Because of the stated limitations, SATE should not be interpreted as a tool testing exercise. The results should not be used to make conclusions regarding which tools are best for a particular application or the general benefit of using static analysis tools.

Download SATE IV data (open with 7-zip)

Original call for participation

We invite participation from makers of static analysis tools that find weaknesses relevant to security. We welcome commercial, research, and open source tools. Participation to SATE is FREE.

If you have any questions, please email Vadim Okun.

Participating teams

  • Buguroo BugScout
  • Concordia University Marfcat
  • Cppcheck
  • Grammatech Codesonar
  • LDRA Testbed
  • Monoidics INFER
  • Parasoft C++test & Jtest
  • Red Lizard Software Goanna

Changes from SATE 2010

  • Participants will have more time to run their tools.
  • Test cases will be provided pre-compiled in a virtual machine.
  • Teams provide a sample tool report in either the SATE or the SAFES output format. Organizers review it and suggest formatting corrections if needed.
  • Introduction of a new track: PHP. Alternatives: C#, Python.
  • Introduction of synthetic test cases.
  • Use of the same test cases for the CVE-based analysis and the warning subset analysis.

Steps in the SATE procedure

The following summarizes the steps in the SATE procedure. The dates are subject to change.

  • Step 1: Plan and prepare
    • Organizers choose test sets
    • (Tentative) Security experts analyze selected test cases
    • Teams sign up to participate
  • Step 2: Provide test sets to teams (July 31, 2011)
  • Step 3: Teams run their tool on the test set(s)
    • Teams provide a sample tool report in either the SATE or the SAFES output format. Organizers review it and suggest formatting corrections if needed. (September 30, 2011)
    • Teams submit their report(s) (October 31, 2011)
    • Teams can withdraw from the exposition prior to this deadline. If a team withdraws, their intention to participate and decision to withdraw will never be disclosed.
  • Step 4: Organizers analyze the reports
    • Organizers identify tool warnings that match the CVEs
    • Organizers select a subset of tool warnings for analysis and share it with teams
    • (Optional) Teams return their review of the selected warnings from their tool's reports
    • Organizers provide preliminary analysis to the teams (mid-December 2011)
    • (Optional) Teams return their corrections to the preliminary analysis (mid-January 2012)
    • Organizers provide final analysis to the teams (January 31, 2012)
  • Step 5: Organizers and teams report and discuss their experience and observations at a workshop (February or March 2012)
  • Step 6: Teams are encouraged to submit a research report, to be published as part of a NIST Special Publication (SP), describing their experience running the tool, discussion of their tool's results, etc. (by May 2012)
  • Step 7: Publish NIST SP, including our research report, submitted reports and all data (between September and December 2012)

Details

Test sets

The exposition consists of 3 language tracks: C/C++, Java and PHP (Alternatives: C#, Python).

  • A test set for each language track
  • A test set consists of
    1. CVE-selected - pairs of open source programs: a vulnerable version with one or more publicly reported vulnerabilities (CVEs) and a fixed version. We will provide the list of CVEs to teams.
    2. The fixed version of the CVE-based test cases is used as a test case for the warning subset analysis.
    3. A set of synthetic test cases.
  • The CVE-based anaylsis and warning subset analysis differ. (See analysis procedure below.)
  • Size of each program is at least several thousand lines of code. (Except for the synthetic test cases.)
  • Each program has aspects relevant to security.
  • We expect programs to have various kinds of security defects.
  • We expect the code to be representative of today’s state of practice.
  • Compilable on a Linux OS using a commonly available compiler. Virtual machines containing the compiled test cases will be provided.

Coverage

SATE addresses different aspects of static analysis tools by using complementary kinds of test cases and analysis methods.

  • CVE-based test cases focus analysis on real-world exploitable vulnerabilities
  • Warning subset analysis helps understand what weaknesses are found by tools in real-world software
  • Synthetic test cases contain precisely characterized weaknesses.

Conditions for tool runs and submissions

Teams run their tools and submit reports following specified conditions.

  • Teams can participate in either language track or all.
  • Teams cannot modify the code of the test cases, except possibly for comments (e.g., annotations).
  • For each test case, teams do one or more runs and submit the report(s).
    • Teams are encouraged to do a custom run (e.g., the tool is configured with custom rules). For a custom run, specify the affected settings (e.g., custom rules) in enough detail so that the run can be reproduced independently.
    • Teams may do a run that uses the tool in default configuration.
  • Teams cannot do any hand editing of tool reports.
  • Teams convert the reports to a common XML format. See SATE output format for description of the SATE format. We also accept reports in the SAFES format.
    • Teams are also encouraged to submit the original reports from their tools, in addition to the reports in the common output format.
  • Teams specify the environment (including the operating system and version of compiler) in which they ran the tool.

Analysis procedure

Finding all weaknesses in a reasonably large program is impractical. Also, due to the likely high number of tool warnings, analyzing all warnings may be impractical. Therefore, we select subsets of tool warnings for analysis.

Generally the analyst first selects issues for analysis. Second, find associated warnings from tools. This results in a subset of tool warnings. Analyze this subset.

Methods 1 and 2 below apply to the general programs only. Method 3 applies to the CVE-selected programs. We will perform separate analysis and reporting for the resulting subsets.

Method 1: Statistical subset of tool warnings

Statistically select the same number of warnings from each tool report, assigning higher weight to categories of warnings with higher severity and avoiding categories of warnings with low severity.

This selection method is useful to the tool users because it considers warnings from each tool.

Tool warning selection procedure

We selected 30 warnings from each tool report using the following procedure:

  • Randomly select one warning from each warning class (identified by a warning name or by CWE id) with severities 1 through 4.
  • While more warnings are needed, repeat:
    • Randomly select 3 of the remaining warnings (or all remaining warnings if there are less than 3 left) from each warning class with severity 1,
    • Randomly select 2 of the remaining warnings (or all remaining warnings if there are less than 2 left) from each warning class with severity 2,
    • Randomly select 1 of the remaining warnings from each warning class (if it still has any warnings left) with severity 3.
  • If more warnings are still needed, select warnings from warning class with severity 4, then select warnings from warning class with severity 5.

If a tool did not assign severity, we assigned severity based on weakness names and our understanding of their relevance to security.

Method 2: Select tool warnings related to manually identified weaknesses

Security experts manually analyze the test cases and identify the most important weaknesses (manual findings). Analyze for both design weaknesses and source code weaknesses focusing on the latter. Since manual analysis combines multiple weaknesses with the same root cause, we anticipate a small number of manual findings, e.g., 10-25 per test case. Take special care to confirm that the manual findings are indeed weaknesses. Tools may be used to aid human analysis, but static analysis tools cannot be the main source of manual findings.

Check the tool reports to find warnings related to the manual findings. For each manual finding, for each tool: find at least one related warning, or conclude that there are no related warnings.

This method is useful because it is largely independent of tools and thus includes weaknesses that may not be found by any tools. It also focuses analysis on weaknesses found most important by security experts.

Method 3: Select tool warnings related to the CVEs

For each CVE-selected pair of test cases, check the tool reports to find warnings that identify the CVEs in the vulnerable version. Check whether the warnings are still reported for the fixed version.

This method is useful because it focuses analysis on exploited weaknesses.

Analysis guidelines

SATE IV will use the same guidelines as SATE 2010. The detailed criteria for analysis of correctness and significance and criteria for associating warnings are at http://samate.nist.gov/SATE2010/resources/sate_analysis/AnalysisCriteria.pdf.

Analysis of correctness

Assign one of the following categories to each warning analyzed.

  • True security weakness - a weakness relevant to security.
  • True quality weakness - requires developer's attention, poor code quality, but may not be relevant to security. Example: buffer overflow where input comes from the user input and the program is not run as SUID. Example: "locally true" - function has a weakness, but the function may always be called with safe parameters.
  • True but insignificant weakness. Example: database tainted during configuration. Example: a warning that describes properties of a standard library function without regard to its use in the code.
  • Weakness status unknown - unable to determine correctness
  • Not a weakness - false

Associating warnings

For each tool warning in the list of selected warnings, find warnings from other tools that refer to the same (or related) weakness. For each selected warning instance, our goal is to find at least one related warning instance (if it exists) from each of the other tools. While there may be many warnings reported by a tool that are related to a particular warning, we do not attempt to find all of them.

We will use the following degrees of association:
  • Equivalent - weakness names are the same or semantically similar; locations are the same, or in case of paths, the source and the sink are the same and the variables affected are the same.
  • Strongly related - the paths are similar, where the sinks are the same conceptually (e.g., one tool may report a shorter path than another tool).
  • Weakly related - warnings refer to different parts of a chain or composite; weaknesses are different but related in some ways, e.g., one weakness may lead to the other, even if there is no clear chain; the paths are different but have a filter location or another important attribute in common.

Criteria for analysis of warnings related to manual findings

Mark tool warnings related to manual findings with one of the following:

  • Same instance.
  • Same instance, different perspective.
  • Same instance, different paths. Example: different sources, but the same sink.
  • Coincidental - tool reports a similar weakness (the same weakness type).
  • Other instance - tool reports a similar weakness (the same weakness type) elsewhere in the code.

Intended summary analysis

We plan to analyze the data collected and present the following in our report:

  • Number of warnings by weakness category and weakness severity
  • Summaries for the analyzed warnings, e.g., number of true tool warnings by weakness category

SATE output format

The SATE IV output format is the same as the SATE 2010 format. SATE 2008 and 2009 outputs are subsets and are therefore compliant with the latest version.

In the SATE tool output format, each warning includes:

  • Id - a simple counter, unique within the report.
  • (Optional) tool specific id.
  • One or more locations, where each location has:
    • (Optional) id - path id. If a tool produces several paths for a weakness, id can be used to differentiate between them.
    • line - line number.
    • path - file path.
    • (Optional) fragment - a relevant source code fragment at the location.
    • (Optional) explanation - why the location is relevant or what variable is affected.
  • Name (class) of the weakness, e.g., "buffer overflow".
  • (Optional) CWE id, where applicable.
  • Weakness grade (assigned by the tool):
    • Severity on the scale 1 to 5, with 1 - the highest.
    • (Optional) probability that the problem is a true positive, from 0 to 1.
    • (Optional) tool_specific_rank - tool specific metric - useful if a tool does not use severity and probability. If a team uses this field, it would have to separately provide definition, scale, and possible values.
  • Output - original message from the tool about the weakness, either in plain text, HTML, or XML.
  • (Optional) An evaluation of the warning by a human; not considered to be part of tool output, including:
    • (Optional) correctness - human analysis of the weakness, one of several categories. Use this instead of the deprecated "falsepositive" attribute

The latest SATE XML schema file can be downloaded from
http://samate.nist.gov/SATE2010/resources/sate_2010.xsd

Teams are encouraged to use the schema file for validation, for example:

xmllint --schema sate_2010.xsd tool_report1.xml

Virtual Machine

To save time on figuring out how to compile the test cases, we provide a VM. It contains all the test cases for all the tracks. All dependencies required to compile the test cases are already installed in the VM. All the test cases are already installed in the VM. Follow the compilation instructions below to compile them.

Participants will need to download the software VMware Player to run the VM. It is available for free on several operating systems.

The VM runs on Ubuntu Linux 11.04. Sun JavaEE 5 is installed in the directory "/opt". You may want to tune the number of virtual CPUs and the amount of memory of the VM. The virtual machine needs to be shut down to do these changes.

The main account on the VM is "sate" and its passowrd is "sate". It has administration privileges through the "sudo" command.

Download the SATE IV Virtual Machine: part 0 part 1 part 2.
SHA256: 03052ede3248d9df32027eb302b81a53fa7003286ac1ce04fe843441fd3880ee

Merging on Windows:

	> copy /b SATE4-VM.tar.bz2.part0+SATE4-VM.tar.bz2.part1+SATE4-VM.tar.bz2.part2 SATE4-VM.tar.bz2

Merging on Linux/Mac:

	$ cat SATE4-VM.tar.bz2.part0 SATE4-VM.tar.bz2.part1 SATE4-VM.tar.bz2.part2 > SATE4-VM.tar.bz2


Test sets for SATE IV

For each test case, we provide the download link(s), additional information about test cases if applicable, and compilation instructions. While we provide the compilation instructions for Ubuntu Linux 10.04, the test cases should compile on other operating systems.

Track: C/C++
Track: Java

Track: C/C++

  • Dovecot: secure IMAP and POP3 server
    • Vulnerable version: 1.2.0
    • Fixed version: 1.2.17
  • Wireshark: network protocol analyzer
    • Vulnerable version: 1.2.0
    • Fixed version: 1.2.18
  • Synthetic test cases

Dovecot

Website: http://www.dovecot.org/

  • Download Dovecot 1.2.0 from our server or another server.
    SHA256: ca834f6aa0fc76bdbdf1f273f53b41cc229112b0d0eb60d70a41b1e11ce0f3a2
  • Download Dovecot 1.2.17 from our server or another server.
    SHA256: 6f39f86a06ddbaa8e264d03046c2fc8870f97ea27c1e52eb5bc96c3b754f0bed

Compilation:

	$ ./configure
	$ make

NOTE. Dovecot does memory allocation differently from other C programs. Its memory management is described here.

Wireshark

Website: https://www.wireshark.org/

  • Download Wireshark 1.2.0 from our server or another server.
    SHA256: 74607d5fde2766e64a9be6f94f1f9bc5a6df64a6c522bfbcba9235af77b44b72
  • Download Wireshark 1.2.18 from our server or another server.
    SHA256: 8d75323efd0746aca78c1b3cbe6b9401095ee1947067b091d5a967a02867c8bd

Compilation, on a fresh installation of Ubuntu 11.04 with GCC 4.5.2:

	$ sudo apt-get install bison flex libgtk2.0-dev libgnutls-dev libpcap-dev
	$ ./configure
	$ make

Synthetic test cases

Download the C/C++ synthetic test cases from our server.
SHA256: 453603fc77ff7dc48cb61bb2b5c1955aaade1ca5fc8e5064c8be600bf62c908a

Compilation:
	$ make

Track: Java

  • Apache Tomcat: servlet container
    • Vulnerable version: 5.5.13
    • Fixed version: 5.5.33
  • Jetty: HTTP server
    • Vulnerable version: 6.1.16
    • Fixed version: 6.1.26
  • Synthetic test cases

NOTE. For the Java test cases, you need to download and install JDK 5.0 with Java EE. Then point your "JAVA_HOME" environment variable to where the JRE is installed ("/opt/SDK/jdk/jre" by default).

Apache Tomcat

Website: http://tomcat.apache.org/

  • Download Tomcat 5.5.13 from our server.
    SHA256: a4d9f7ac885687e1631acb2d9da6c897199eb219c66a9db0654a0675bda5e5bf
  • Download Tomcat 5.5.33 from our server or another server.
    SHA256: 7723b6fdbd5d4844e043ef4aa81d6c618c3c183088902e28b40ae6c54f5189cd

Compilation:

	$ sudo su
	# apt-get install ant
	# export JAVA_HOME=/opt/SDK/jdk/jre
	# ant

NOTE. We updated some of the compilation scripts for the 5.5.13 version of Tomcat. The code remains unchanged.

NOTE. To compile different versions of Tomcat on the same computer, it may be necessary to remove files left over from a previous compilation in /usr/share/java.

Jetty

Website: http://jetty.codehaus.org/jetty/

  • Download Jetty 6.1.16 from our server or another server.
    SHA256: 8d975ad64fe86fe6c75ff6d54b1ef1df7c2c70f6649dbd85414c26211ea0a7fa
  • Download Jetty 6.1.26 from our server or another server.
    SHA256: 17a903f77ede991833dce212ee5834398bcd242f75fa9329df31bb865d702bad

Compilation:

	$ sudo apt-get install maven2
	$ export JAVA_HOME=/opt/SDK/jdk/jre
	$ mvn compile test-compile

Synthetic test cases

Download the Java synthetic test cases from our server.
SHA256: f87a1dc73fb22cbf8f3b65711058b64f3079a3d3cbeaa5d36a53f6dee7f3b32f

Compilation:
	$ unset JAVA_HOME
	$ ant

Other details

The SATE IV organizing meeting was Friday, 4 March 2011 in McLean, Virginia co-located with the 14th Semi-Annual Software Assurance Forum.

Views