Static Analysis Tool Exposition (SATE) 2008
The NIST SAMATE project conducted the first Static Analysis Tool Exposition (SATE) in 2008 to advance research in static analysis tools that find security defects in source code. The main goals of SATE were to enable empirical research based on large test sets and to encourage improvement and speed adoption of tools. The exposition was planned to be an annual event.
Briefly, participating tool makers ran their tool on a set of programs. Researchers led by NIST performed a partial analysis of tool reports. The results and experiences were reported at the Static Analysis Workshop in Tucson, AZ, in June, 2008.
Published as "Static Analysis Tool Exposition (SATE) 2008", Vadim Okun, Romain Gaucher, Paul E. Black, editors, U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 500-279, June, 2009.
This special publication consists of the following papers. "Review of the First Static Analysis Tool Exposition (SATE 2008)," by Vadim Okun, Romain Gaucher, and Paul E. Black, describes the SATE procedure, provides observations based on the data collected, and critiques the exposition, including the lessons learned that may help future expositions. Paul Anderson’s "Commentary on CodeSonar’s SATE Results" has comments by one of the participating tool makers. Steve Christey presents his experiences in analysis of tool reports and discusses the SATE issues in "Static Analysis Tool Exposition (SATE 2008) Lessons Learned: Considerations for Future Directions from the Perspective of a Third Party Analyst".
Download NIST SP 500-279.
The data includes tool reports in the SATE output format, our analysis of the tool reports, and additional information submitted by participants.
Cautions on Interpreting and Using the SATE Data
SATE 2008 was the first such exposition that we conducted, and it taught us many valuable lessons. Most importantly, our analysis should NOT be used as a direct source for rating or choosing tools; this was never the goal of SATE.
There is no metric or set of metrics that is considered by the research community to indicate all aspects of tool performance. We caution readers not to apply unjustified metrics based on the SATE data.
Due to the variety and different nature of security weaknesses, defining clear and comprehensive analysis criteria is difficult. As SATE progressed, we realized that our analysis criteria were not adequate, so we adjusted the criteria during the analysis phase. As a result, the criteria were not applied consistently. For instance, we were inconsistent in marking the severity of the warnings where we disagreed with tool’s assessment.
The test data and analysis procedure employed have serious limitations and may not indicate how these tools perform in practice. The results may not generalize to other software because the choice of test cases, as well as the size of test cases, can greatly influence tool performance. Also, we analyzed a small, non-random subset of tool warnings and in many cases did not associate warnings that refer to the same weakness.
The tools were used in this exposition differently from their use in practice. In practice, users write special rules, suppress false positives, and write code in certain ways to minimize tool warnings.
We did not consider the user interface, integration with the development environment, and many other aspects of the tools. In particular, the tool interface is important for a user to efficiently and correctly understand a weakness report.
Participants ran their tools against the test sets in February 2008. The tools continue to progress rapidly, so some observations from the SATE data may already be obsolete.
Because of the above limitations, SATE should not be interpreted as a tool testing exercise. The results should not be used to make conclusions regarding which tools are best for a particular application or the general benefit of using static analysis tools.
Download SATE 2008 data.
- Aspect Security ASC
- Checkmarx CxSuite
- Fortify SCA
- Grammatech CodeSonar
- HP DevInspect
- SofCheck Inspector for Java
- UMD FindBugs
- Veracode SecurityReview
Test cases: C track
- lighttpd: web server
- Local download link: http://samate.nist.gov/SATE2008/resources/C/lighttpd-1.4.18.tar.gz
- Website: http://www.lighttpd.net
- nagios: host, service and network monitoring program
- Local download link: http://samate.nist.gov/SATE2008/resources/C/nagios-2.10.tar.gz
- Website: http://www.nagios.org
- naim: console instant messenger application
- Local download link: http://samate.nist.gov/SATE2008/resources/C/naim-0.11.8.3.1.tar.bz2
- Website: http://code.google.com/p/naim/
Test cases: Java track
- DSpace: content management system
- Local download link: http://samate.nist.gov/SATE2008/resources/Java/dspace-1.4.2-source.tgz
- Website: http://www.dspace.org
- mvnForum: forum
- Local download link: http://samate.nist.gov/SATE2008/resources/Java/mvnforum-1.1-src.zip
- Website: http://www.mvnforum.com
- OpenNMS: network management system
- Local download link: http://samate.nist.gov/SATE2008/resources/Java/opennms_1.2.9.tar.gz
- Website: http://www.opennms.org
We plan for the exposition to be an annual event. Some possible future plans include the following.
- Multiple tracks for different domains
- More languages
- Other tool classes
- Web Application Security Scanners
- Binary analysis
- Requirements analysis
- Specification-to-code verifiers
- Static analysis for purposes other than "finding bugs"
- e.g. metrics: program size, assurance level, size of security problem, etc.
- Test applications with deliberately inserted back doors
- Interactive track: to measure the way the tool is used by the programmers
- Track to generate code from requirements
We thank Steve Christey, Bob Schmeichel, and Bob Martin of the MITRE Corporation for contributing their time and expertise to the analysis of tool reports.
SATE is modeled on the Text REtrieval Conference (TREC): http://trec.nist.gov/
Bill Pugh first proposed organizing a TREC-like exposition for static analysis tools: http://www.cs.umd.edu/~pugh/JudgingStaticAnalysis.pdf (slides 48-50)
SATE 2008 plan (as of mid 2008)
- To enable empirical research based on large test sets
- To encourage improvement of tools
- To speed adoption of the tools by objectively demonstrating their use on real software
Briefly, organizers provide test sets of programs to tool makers who wish to participate. Participants run their tool on the test cases and return the tool reports. Organizers performs a limited analysis of the results and watch for interesting aspects. Participants and organizers report their experience running tools and their results at SAW. Organizers make the test sets, tool reports, and results publicly available 6 months after the workshop. See the Protocol for more detail.
Our goal is not to choose the "best" tools: there are many other factors in determining which tool or tools is appropriate in each situation.
Characteristics to be considered
- Relevance of warnings to security
- Correctness of warnings (true positive or false positive)
- Prioritization of warnings (high, medium, ...)
Note. A warning is an issue identified by a tool. A (Tool) report is the output from a single run of a tool on a test case. A tool report consists of warnings.
- The exposition consists of 2 language tracks:
- C track
- Java track
- Participants can enter either track or both
- Separate analysis and reporting for each track
Here is the detailed interaction with due dates.
Step 1 Prepare
Step 1a Organizers choose test sets
- A test set for each language track
- A test set consists of up to 3 open source programs (or program components)
- Size of each program is at least several thousand lines of code
- We anticipate some of the test cases to be tens or hundreds of thousands lines of code
- Each program has aspects relevant to security
- We expect programs to have various kinds of security defects
- We expect the code to be representative of today's state of practice
- Compilable on a Linux OS using a commonly available compiler
Step 1b Tool makers sign up to participate (8 Feb 2008)
- Participants specify which track(s) they wish to enter
- For each track, participants specify the exact version(s) of the tool(s) that they will run on the test set. The version must have release or build date that is earlier than the date when they receive the test set.
Step 2 Organizers provide test set(s) (15 Feb 2008)
- Organizers will specify the method of distribution in advance
Step 3 Participants run their tool on the test set(s) and return their report(s) (by 29 Feb 2008)
- Participants cannot modify the code of the test cases, except possibly for comments (e.g. annotations).
- If annotations were manually added, note this and send back the modified test case with annotations.
- For each test case, participants can do one or more runs and submit the report(s)
- participants are encouraged to do a run that uses the tool in default configuration.
- participants may do custom runs (e.g., the tool is configured with custom rules). For a custom run, specify the affected settings (e.g., custom rules) in enough detail so that the run can be reproduced independently.
- Participants specify the environment (including the OS, version of compiler, etc.) in which they ran the tool
- Hand editing of the tool reports (e.g., manually removing false positives or adding true positives) is not allowed.
- The reports are in common format (in XML). See Tool output format.
- Participants can withdraw from any language track or from the exposition prior to this deadline. In that case, their intention to participate and decision to withdraw will not be disclosed.
Step 3a (optional) Participants return their review of their tool's report(s) (by 15 Mar 2008)
Step 4 Organizers analyze the reports (by 15 April 2008)
- For each test case, combine all submitted reports and information from other analysis
- Come up with a master reference list, that is, true positives, of security relevant weaknesses
- Compare each tool's security warnings against the master reference: true positives, false positives
- Participants receive the master reference list, comparison of their report with the master reference list, reports from other tools
Note. We do not expect (and will emphasize this in our report) that the master reference list will be perfect. Participants are welcome to submit a critique of the master reference list, either items missing or incorrectly included.
Step 4a (Optional) Participants return their corrections to the master reference list (by 29 April 2008)
Step 4b Participants receive an updated master reference list and an updated comparison of their report with the master reference list (by 13 May 2008)
Step 4c Participants submit a report for SAW (by 30 May 2008)
- The participant's report presents experience running the tool, discussion of their tool's results, etc.
- The report is a paper up to 10 pages long
Step 5 Report Comparisons at SAW (June 2008)
- Organizers report comparisons and any interesting observations.
- Participants receive the detailed comparisons for all participating tools (see next step for what these include)
- Participants report their experience running the tools and discuss their results
- Discuss comments, suggestions, plans for the next exposition
Step 5a Participants submit final version of report (from Step 4c) (by June 30 2008)
- To be published as NIST special publication or NIST technical report
Step 6 Publish Results (Dec 2008)
- Organizers publish test sets, master reference list, and detailed comparisons, including
- tool version and any configuration parameters (e.g., custom rule set) used
- verbatim textual report from the tool
- warning by warning comparison with the master list
Tool output format
The tool output format is an annotation for the original tool report. We would like to preserve all content of the original tool report.
Each warning includes
- weakness id - a simple counter
- (optional) tool specific unique id
- one or more location, including line number and pathname.
- name of the weakness
- (optional) CWE id, where applicable
- weakness grade
- severity on the scale 1 to 5, with 1 - the highest
- (optional) probability that the problem is a true positive, from 0 to 1
- output - original message from the tool about the weakness, either in plain text, HTML, or XML
- (optional) An evaluation of the issue by a human; not considered to be part of tool output