SATE 2010 is the third annual SATE. The experience workshop was on 1 October 2010.
Static Analysis Tool Exposition (SATE) 2010
The NIST Software Assurance Metrics And Tool Evaluation (SAMATE) project conducted the third Static Analysis Tool Exposition (SATE) in 2010 to advance research in static analysis tools that find security defects in source code. The main goals of SATE were to enable empirical research based on large test sets, encourage improvements to tools, and promote broader and more rapid adoption of tools by objectively demonstrating their use on production software.
Briefly, participating tool makers ran their tool on a set of programs. Researchers led by NIST performed a partial analysis of tool reports. The results and experiences were reported at the SATE 2010 Workshop in Gaithersburg, MD, in October, 2010.
"Report on the Third Static Analysis Tool Exposition (SATE 2010)", Vadim Okun, Aurelien Delaitre, Paul E. Black, editors, U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 500-283, October, 2011.This special publication consists of the following three papers.
- "The Third Static Analysis Tool Exposition (SATE 2010)," by Vadim Okun, Aurelien Delaitre, and Paul E. Black, describes the SATE procedure and provides observations based on the data collected.
- "Goanna Static Analysis at the NIST Static Analysis Tool Exposition," by Mark Bradley, Ansgar Fehnker, Ralf Huuck, and Paul Steckler, introduces Goanna, which uses a combination of static analysis with model checking, and describes its SATE experience, tool results, and some of the lessons learned in the process.
- Serguei A. Mokhov introduces a machine learning approach to static analysis and presents MARFCAT's SATE 2010 results in "The use of machine learning with signal- and NLP processing of source code to fingerprint, detect, and classify vulnerabilities and weaknesses with MARFCAT."
The data includes tool reports in the SATE output format, analysis of the tool reports (tool warnings selected randomly, based on CVEs, and based on manual findings), and additional information submitted by teams.
Cautions on Interpreting and Using the SATE Data
SATE 2010, as well as its predecessors, taught us many valuable lessons. Most importantly, our analysis should NOT be used as a basis for rating or choosing tools; this was never the goal of SATE.
There is no single metric or set of metrics that is considered by the research community to indicate or quantify all aspects of tool performance. We caution readers not to apply unjustified metrics based on the SATE data.
Due to the variety and different nature of security weaknesses, defining clear and comprehensive analysis criteria is difficult. While the analysis criteria have been much improved since the previous SATEs, further refinements are necessary.
The test data and analysis procedure employed have limitations and might not indicate how these tools perform in practice. The results may not generalize to other software because the choice of test cases, as well as the size of test cases, can greatly influence tool performance. Also, we analyzed a small subset of tool warnings.
In SATE 2010, we added CVE-selected programs to the test sets for the first time. The procedure that was used for finding CVE locations in code and selecting tool warnings related to the CVEs has limitations, so the results may not indicate tools' ability to find important security weaknesses.
The tools were used in this exposition differently from their use in practice. We analyzed tool warnings for correctness and looked for related warnings from other tools, whereas developers use tools to determine what changes need to be made to software, and auditors look for evidence of assurance. Also in practice, users write special rules, suppress false positives, and write code in certain ways to minimize tool warnings.
We did not consider the user interface, integration with the development environment, and many other aspects of the tools, which are important for a user to efficiently and correctly understand a weakness report.
Teams ran their tools against the test sets in July 2010. The tools continue to progress rapidly, so some observations from the SATE data may already be out of date.
Because of the stated limitations, SATE should not be interpreted as a tool testing exercise. The results should not be used to make conclusions regarding which tools are best for a particular application or the general benefit of using static analysis tools.
Note. Per requests by Coverity and Grammatech, their tool output is not released as part of SATE data. Consequently, our detailed analysis of their tool warnings is not released either. However, the observations and summary analysis in our paper are based on the complete data set.
- Armorize CodeSecure
- Concordia University Marfcat
- Coverity Static Analysis for C/C++
- Grammatech CodeSonar
- LDRA Testbed
- Red Lizard Software Goanna
- Seoul National University Sparrow
- SofCheck Inspector for Java
Test cases: C/C++ track
- Dovecot: secure IMAP and POP3 server
- Wireshark: network protocol analyzer
- Vulnerable version: 1.2.0
- Fixed version: 1.2.9
- Chrome: web browser
- Vulnerable version: 5.0.375.54
- Fixed version: 5.0.375.70
Test cases: Java track
- Pebble: weblog software
- Apache Tomcat: servlet container
- Vulnerable version: 5.5.13
- Fixed version: 5.5.29
Program Planning Committee
- Redge Bartholomew Rockwell-Collins
- Steve Christey MITRE
- Romain Gaucher Cigital
- Raoul Jetley FDA
- Scott Kagan Lockheed-Martin
- Ajoy Kumar VP (Software Security)
- Michael Lowry NASA
- Jaime Merced DoD
- Frédéric Painchaud DRDC
Paul Anderson wrote a detailed proposal for using CVE-based test cases to provide ground truth for analysis. Romain Gaucher helped with planning SATE. Romain Gaucher and Ramchandra Sugasi of Cigital are the security experts that quickly and accurately performed human analysis of the test cases. We thank Sue Wang, now at MITRE, for great help with all phases of SATE 2010, including planning, selection of CVE-based test cases, and analysis. All members of the NIST SAMATE team contributed to SATE 2010.
SATE is modeled on the Text REtrieval Conference (TREC): http://trec.nist.gov/
Bill Pugh first proposed organizing a TREC-like exposition for static analysis tools: http://www.cs.umd.edu/~pugh/JudgingStaticAnalysis.pdf (slides 48-50)
SATE 2010 plan (as of Fall 2010)
Static Analysis Tool Exposition (SATE) is designed to advance research (based on large test sets) in, and improvement of, static analysis tools that find security-relevant defects in source code. Briefly, participating tool makers run their tools on a set of programs. Researchers led by NIST analyze the tool reports. The results and experiences are reported at a workshop. The tool reports and analysis are made publicly available later.
The goals of SATE are:
- To enable empirical research based on large test sets
- To encourage improvement of tools
- To speed adoption of tools by objectively demonstrating their use on real software
Our goal is not to evaluate nor choose the "best" tools.
SATE is aimed at exploring the following characteristics of tools: relevance of warnings to security, their correctness, and prioritization.
Note. A warning is an issue (usually, a weakness) identified by a tool. A (tool) report is the output from a single run of a tool on a test case. A tool report consists of warnings.
Major changes from SATE 2009
- Add CVE-selected programs to the test sets. For these, focus on tool warnings that identify the CVEs.
- Expand the C track to C/C++ track, that is, it may include C++ programs in addition to C programs.
- Update correctness categories and analysis criteria.
Steps in the SATE procedure
The following summarizes the steps in the SATE procedure. The dates are subject to change.
- Step 1 Plan and prepare
- Organizers choose test sets
- Teams sign up to participate (by 25 June)
- Step 2 Provide test sets to teams (28 June)
- Step 3 Teams run their tool on the test set(s) and return their report(s) (30 July - extended)
- Teams can withdraw from the exposition prior to this deadline. If a team withdraws, their intention to participate and decision to withdraw will never be disclosed.
- Step 4 Organizers analyze the reports
- Step 4a Organizers select a subset of tool warnings for analysis and share it with teams (9 Aug)
- Step 4b (Optional) Teams return their review of the selected warnings from their tool's reports (27 Aug)
- Step 4c Organizers provide final analysis to the teams (13 Sep)
- Step 5 Organizers and teams report and discuss their experience and observations at a workshop (1 October)
- Step 6 Teams are encouraged to submit a research report, to be published as part of a NIST special publication, describing their experience running the tool, discussion of their tool's results, etc. (by December)
- Step 7 Publish reports and data (between Feb and May 2011)
The exposition consists of 2 language tracks: C/C++ track and Java track.
- A test set for each language track
- A test set consists of
- General - 1 or 2 open source programs (or program components)
- CVE-selected - pairs of open source programs: a vulnerable version with one or more publicly reported vulnerabilities (CVEs) and a fixed version. We will provide the list of CVEs to teams.
- The general programs and the CVE-selected programs are analyzed differently (see Analysis procedure below)
- Size of each program is at least several thousand lines of code
- Each program has aspects relevant to security
- We expect programs to have various kinds of security defects
- We expect the code to be representative of today’s state of practice
- Compilable on a Linux OS using a commonly available compiler
Conditions for tool runs and submissions
Teams run their tools and submit reports following specified conditions.
- Teams can participate in either language track or both
- Teams cannot modify the code of the test cases, except possibly for comments (e.g., annotations).
- For each test case, teams do one or more runs and submit the report(s).
- Teams are encouraged to do a custom run (e.g., the tool is configured with custom rules). For a custom run, specify the affected settings (e.g., custom rules) in enough detail so that the run can be reproduced independently.
- Teams may do a run that uses the tool in default configuration.
- Teams cannot do any hand editing of tool reports.
- Teams convert the reports to a common XML format. See SATE output format for description of the format.
- Teams are also encouraged to submit the original reports from their tools, in addition to the reports in the SATE output format.
- Teams specify the environment (including the operating system and version of compiler) in which they ran the tool.
Finding all weaknesses in a reasonably large program is impractical. Also, due to the likely high number of tool warnings, analyzing all warnings may be impractical. Therefore, we select subsets of tool warnings for analysis.
Generally the analyst first selects issues for analysis. Second, find associated warnings from tools. This results in a subset of tool warnings. Analyze this subset.
Methods 1 and 2 below apply to the general programs only. Method 3 applies to the CVE-selected programs. We will perform separate analysis and reporting for the resulting subsets.
Method 1: Statistical subset of tool warnings
Statistically select the same number of warnings from each tool report, assigning higher weight to categories of warnings with higher severity and avoiding categories of warnings with low severity.
This selection method is useful to the tool users because it considers warnings from each tool.
Tool warning selection procedure
We selected 30 warnings from each tool report using the following procedure:
- Randomly select one warning from each warning class (identified by a warning name or by CWE id) with severities 1 through 4.
- While more warnings are needed, repeat:
- Randomly select 3 of the remaining warnings (or all remaining warnings if there are less than 3 left) from each warning class with severity 1,
- Randomly select 2 of the remaining warnings (or all remaining warnings if there are less than 2 left) from each warning class with severity 2,
- Randomly select 1 of the remaining warnings from each warning class (if it still has any warnings left) with severity 3.
- If more warnings are still needed, select warnings from warning class with severity 4, then select warnings from warning class with severity 5.
If a tool did not assign severity, we assigned severity based on weakness names and our understanding of their relevance to security.
Method 2: Select tool warnings related to manually identified weaknesses
Security experts manually analyze the test cases and identify the most important weaknesses (manual findings). Analyze for both design weaknesses and source code weaknesses focusing on the latter. Since manual analysis combines multiple weaknesses with the same root cause, we anticipate a small number of manual findings, e.g., 10-25 per test case. Take special care to confirm that the manual findings are indeed weaknesses. Tools may be used to aid human analysis, but static analysis tools cannot be the main source of manual findings.
Check the tool reports to find warnings related to the manual findings. For each manual finding, for each tool: find at least one related warning, or conclude that there are no related warnings.
This method is useful because it is largely independent of tools and thus includes weaknesses that may not be found by any tools. It also focuses analysis on weaknesses found most important by security experts.
Method 3: Select tool warnings related to the CVEs
For each CVE-selected pair of test cases, check the tool reports to find warnings that identify the CVEs in the vulnerable version. Check whether the warnings are still reported for the fixed version.
This method is useful because it focuses analysis on exploited weaknesses.
The detailed criteria for analysis of correctness and significance and criteria for associating warnings are at http://samate.nist.gov/SATE2010/resources/sate_analysis/AnalysisCriteria.pdf.
Analysis of correctness
Assign one of the following categories to each warning analyzed.
- True security weakness - a weakness relevant to security.
- True quality weakness - requires developer's attention, poor code quality, but may not be relevant to security. Example: buffer overflow where input comes from the user input and the program is not run as SUID. Example: "locally true" - function has a weakness, but the function may always be called with safe parameters.
- True but insignificant weakness. Example: database tainted during configuration. Example: a warning that describes properties of a standard library function without regard to its use in the code.
- Weakness status unknown - unable to determine correctness
- Not a weakness - false
For each tool warning in the list of selected warnings, find warnings from other tools that refer to the same (or related) weakness. For each selected warning instance, our goal is to find at least one related warning instance (if it exists) from each of the other tools. While there may be many warnings reported by a tool that are related to a particular warning, we do not attempt to find all of them.We will use the following degrees of association:
- Equivalent - weakness names are the same or semantically similar; locations are the same, or in case of paths, the source and the sink are the same and the variables affected are the same.
- Strongly related - the paths are similar, where the sinks are the same conceptually (e.g., one tool may report a shorter path than another tool).
- Weakly related - warnings refer to different parts of a chain or composite; weaknesses are different but related in some ways, e.g., one weakness may lead to the other, even if there is no clear chain; the paths are different but have a filter location or another important attribute in common.
Criteria for analysis of warnings related to manual findings
Mark tool warnings related to manual findings with one of the following:
- Same instance.
- Same instance, different perspective.
- Same instance, different paths. Example: different sources, but the same sink.
- Coincidental - tool reports a similar weakness (the same weakness type).
- Other instance - tool reports a similar weakness (the same weakness type) elsewhere in the code.
Intended summary analysis
We plan to analyze the data collected and present the following in our report:
- Number of warnings by weakness category and weakness severity
- Summaries for the analyzed warnings, e.g., number of true tool warnings by weakness category
SATE output format
The SATE 2010 output format is the same as the SATE 2009 format, except for an additional correctness category in the evaluation section. SATE 2008 and 2009 outputs are subsets and are therefore valid for 2010.
In the SATE tool output format, each warning includes:
- Id - a simple counter, unique within SATE 2010.
- (Optional) tool specific id.
- One or more locations, where each location has:
- (Optional) id - path id. If a tool produces several paths for a weakness, id can be used to differentiate between them.
- line - line number.
- path - file path.
- (Optional) fragment - a relevant source code fragment at the location.
- (Optional) explanation - why the location is relevant or what variable is affected.
- Name (class) of the weakness, e.g., "buffer overflow".
- (Optional) CWE id, where applicable.
- Weakness grade (assigned by the tool):
- Severity on the scale 1 to 5, with 1 - the highest.
- (Optional) probability that the problem is a true positive, from 0 to 1.
- (Optional) tool_specific_rank - tool specific metric - useful if a tool does not use severity and probability. If a team uses this field, it would have to separately provide definition, scale, and possible values.
- Output - original message from the tool about the weakness, either in plain text, HTML, or XML.
- (Optional) An evaluation of the warning by a human; not considered to be part of tool output, including:
- (Optional) correctness - human analysis of the weakness, one of several categories. Use this instead of the deprecated "falsepositive" attribute
The SATE 2010 XML schema file can be downloaded from
Teams are encouraged to use the schema file for validation, for example:
xmllint --schema sate_2010.xsd tool_report1.xml
Instructions for downloading and installing the test cases
Download Dovecot from our server.
SHA256: 3f9b4d0501bf04b4bb940b8bf66e43265b53b0165293c166f4428d182b6e8587 dovecot-2.0.beta6.20100626.tar.gz
NOTE. Dovecot does memory allocation differently from other C programs. Its memory management is described here:
- Download Wireshark 1.2.0 from our server or another server.
SHA256: bd8558ec36e2d31a628c3bdc70027487b79dad3a51fb5f0f79375c768b984e97 wireshark-1.2.0.tar.bz2
- Download Wireshark 1.2.9 from our server or another server.
SHA256: 078b3dca26c562989281d2c842bbbd1ebcef570dcc06a832ec0c31f1b8076152 wireshark-1.2.9.tar.bz2
On a fresh installation of Ubuntu 10.04 with GCC 4.4.3, install the following packages, configure and compile:sudo apt-get install bison flex libgtk2.0-dev libgnutls-dev libpcap-dev
We used a fresh install of Ubuntu 10.04 with GCC 4.4.3
NOTE. Compiling Chrome requires a lot of memory. It succeeded on computer with 2GB or RAM and 20GB of disk space.
Install dependencies:wget http://src.chromium.org/svn/trunk/src/build/install-build-deps.sh
chmod +x install-build-deps.sh
Install code downloading tools:wget http://src.chromium.org/svn/trunk/tools/depot_tools.tar.gz
tar zxf depot_tools.tar.gz
Alternatively, download from our server. The download may fail due to the large file size (1.2GB).
SHA256: 8aa860fa8b05ae619db2427d25a1549335ce5fb6ecdaea964a94aa5359feec57 chrome-5.0.375.54.tar.gz
Download and compile the sources for the fixed version:gclient config http://src.chromium.org/svn/releases/5.0.375.70
Alternatively, download from our server. The download may fail due to the large file size (1.2GB).
Download Pebble 2.5-M2 from our server.
SHA256: 02885022103cfdbaf984cfe72f84bf4ce0c7841003343b8b8058c27cdd413315 pebble.tar.gz
Compile:sudo apt-get install subversion maven2 default-jdk
Pebble requires Java 6.0. To install Java EE 6:sudo apt-get install default-jdk ant
chmod +x java_ee_sdk-6u1-unix.sh
Apache TomcatWebsite: http://tomcat.apache.org/
- Download Tomcat 5.5.13 from our server.
SHA256: 5064eb0eb8992faa198e1577e1205cad41463526df4183ff56c67435ff7fe030 apache-tomcat-5.5.13-src.SATEFIXED.tar.gz
- Download Tomcat 5.5.29 from our server.
SHA256: 82520f433025072fc7e7d8cfbdde4131d36fdfdb6b336a479c7dbce407abf3f9 apache-tomcat-5.5.29-src.tar.gz
NOTE. To compile different versions of Tomcat on the same computer, it may be necessary to remove files left over from a previous compilation in /usr/share/java.
On a fresh installation of Ubuntu 10.04, install the latest version of Sun JDK 5 (1.5), install ant and compile:sudo apt-get install ant