Results were reported at the SATE 2009 workshop on 6 November.
Static Analysis Tool Exposition (SATE) 2009
The NIST Software Assurance Metrics And Tool Evaluation (SAMATE) project conducted the second Static Analysis Tool Exposition (SATE) in 2009 to advance research in static analysis tools that find security defects in source code. The main goals of SATE were to enable empirical research based on large test sets, encourage improvements to tools, and promote broader and more rapid adoption of tools by objectively demonstrating their use on production software.
Briefly, participating tool makers ran their tool on a set of programs. Researchers led by NIST performed a partial analysis of tool reports. The results and experiences were reported at the SATE 2009 Workshop in Arlington, VA, in November, 2009.
"The Second Static Analysis Tool Exposition (SATE) 2009", Vadim Okun, Aurelien Delaitre, Paul E. Black, U.S. National Institute of Standards and Technology (NIST) Special Publication (SP) 500-287, June, 2010.
Download NIST SP 500-287.
The data includes tool reports in the SATE output format, analysis of the tool reports, and additional information submitted by teams.
Cautions on Interpreting and Using the SATE Data
SATE 2009, as well as its predecessor, SATE 2008, taught us many valuable lessons. Most importantly, our analysis should NOT be used as a basis for rating or choosing tools; this was never the goal of SATE.
There is no single metric or set of metrics that is considered by the research community to indicate or quantify all aspects of tool performance. We caution readers not to apply unjustified metrics based on the SATE data.
Due to the variety and different nature of security weaknesses, defining clear and comprehensive analysis criteria is difficult. While the analysis criteria have been improved since SATE 2008, refinements are necessary and are in progress.
The test data and analysis procedure employed have limitations and might not indicate how these tools perform in practice. The results may not generalize to other software because the choice of test cases, as well as the size of test cases, can greatly influence tool performance. Also, we analyzed a small subset of tool warnings.
The tools were used in this exposition differently from their use in practice. We analyzed tool warnings for correctness and looked for related warnings from other tools, whereas developers use tools to determine what changes need to be made to software, and auditors look for evidence of assurance. Also in practice, users write special rules, suppress false positives, and write code in certain ways to minimize tool warnings.
We did not consider the user interface, integration with the development environment, and many other aspects of the tools, which are important for a user to efficiently and correctly understand a weakness report.
Teams ran their tools against the test sets in late August - early September 2009. The tools continue to progress rapidly, so some observations from the SATE data may already be out of date.
Because of the stated limitations, SATE should not be interpreted as a tool testing exercise. The results should not be used to make conclusions regarding which tools are best for a particular application or the general benefit of using static analysis tools.
Download SATE 2009 data.
Test cases: C track
Download the C test cases.
- IRSSI: IRC client
- PVM3: Parallel virtual machine
Test cases: Java track
We thank Romain Gaucher for help with planning SATE 2009. David Lindsay and Romain Gaucher of Cigital are the security experts that quickly and accurately performed human analysis of the test cases. Drew Buttner and Steve Christey of the MITRE Corporation played an important role in the analysis of tool reports. All members of the NIST SAMATE team contributed to SATE 2009.
SATE is modeled on the Text REtrieval Conference (TREC): http://trec.nist.gov/
Bill Pugh first proposed organizing a TREC-like exposition for static analysis tools: http://www.cs.umd.edu/~pugh/JudgingStaticAnalysis.pdf (slides 48-50)
SATE 2009 plan (as of Fall 2009)
Static Analysis Tool Exposition (SATE) is designed to advance research in static analysis tools that find security-relevant defects in source code. Briefly, participating tool makers run their tools on a set of programs. Researchers led by NIST analyze the tool reports. The results and experiences were reported at a workshop. The tool reports and analysis are made publicly available later.
The goals of SATE are:
- To enable empirical research based on large test sets
- To encourage improvement of tools
- To speed adoption of tools by objectively demonstrating their use on real software
Our goal is not to evaluate nor choose the "best" tools.
Characteristics to be considered
- Relevance of warnings to security
- Correctness of warnings
- Prioritization of warnings
Note. A warning is an issue (usually, a weakness) identified by a tool. A (tool) report is the output from a single run of a tool on a test case. A tool report consists of warnings.
Steps in the SATE procedure
The following summarizes the steps in the SATE procedure. The dates are subject to change.
- Step 1 Plan and prepare
- Organizers choose test sets
- Teams sign up to participate (by 14 Aug)
- Step 2 Provide tests sets to teams (19 Aug)
- Step 3 Teams run their tool on the test set(s) and return their report(s) (4 Sep)
- Teams can withdraw from the exposition prior to this deadline. In that case, their intention to participate and decision to withdraw will never be disclosed.
- Step 4 Organizers analyze the reports
- Step 4a Provide preliminary analysis to the teams (22 Sep)
- Step 4b (Optional) Teams return their corrections to the preliminary analysis (2 Oct)
- Step 4c Provide final analysis to the teams (16 Oct)
- Step 5 Organizers and teams report and discuss their experience and observations at a workshop to be held in conjunction with the DHS Software Assurance Forum (6 Nov)
- Step 6 Publish reports and data (between February and May 2010)
The exposition consists of 2 language tracks: C track and Java track.
- A test set for each language track
- A test set consists of 1 or 2 open source programs (or program components)
- Size of each program is at least several thousand lines of code
- Each program has aspects relevant to security
- We expect programs to have various kinds of security defects
- We expect the code to be representative of today’s state of practice
- Compilable on a Linux OS using a commonly available compiler
We will choose the test cases using some ideas from the discussion of selection of test cases in the Appendix of this document.
Conditions for tool runs and submissions
Teams run their tools and submit reports following specified conditions.
- Teams can participate in either language track or both
- Teams cannot modify the code of the test cases, except possibly for comments (e.g., annotations).
- For each test case, teams do one or more runs and submit the report(s).
- Teams are encouraged to do a run that uses the tool in default configuration.
- Teams may do custom runs (e.g., the tool is configured with custom rules). For a custom run, specify the affected settings (e.g., custom rules) in enough detail so that the run can be reproduced independently.
- Teams cannot do any hand editing of tool reports.
- Teams convert the reports to a common XML format. See SATE output format for description of the format.
- Teams are also encouraged to submit the original reports from their tools, in addition to the reports in the SATE output format.
- Teams specify the environment (including the operating system and version of compiler) in which they ran the tool.
Finding all weaknesses in a reasonably large program is impractical. Also, due to the likely high number of tool warnings, analyzing all warnings may be impractical. Therefore, we select subsets of tool warnings for analysis.
In general, the following procedure is used. First, select a set of issues for analysis. Second, find associated warnings from tools. This results in a subset of tool warnings. Analyze this subset of warnings.
We plan to use the following two complementary methods to select tool warnings. We will perform separate analysis and reporting for the two resulting subsets.
Method 1: Select a subset of tool warnings
Select the same number of warnings from each tool report, avoiding categories of warnings with low severity.
We will choose the warnings using some ideas from the discussion of selection of warnings in the Appendix of this document.
For the selected warnings, add associated warnings (that refer to the same weakness) from other tools and analyze the subset.
This selection method is useful to the tool users because it considers warnings from each tool.
Tool warning selection procedure
We selected 30 warnings from each tool report using the following procedure:
- Randomly select one warning from each warning class (identified by a warning name) with severities 1 through 4.
- While more warnings are needed, repeat:
- Randomly select 3 of the remaining warnings (or all remaining warnings if there are less than 3 left) from each warning class with severity 1,
- Randomly select 2 of the remaining warnings (or all remaining warnings if there are less than 2 left) from each warning class with severity 2,
- Randomly select 1 of the remaining warnings from each warning class (if it still has any warnings left) with severity 3.
- If more warnings are still needed, select warnings from warning class with severity 4, then select warnings from warning class with severity 5.
If a tool did not assign severity, we assigned severity based on weakness names and our understanding of their relevance to security.
Method 2: Select tool warnings related to manually identified weaknesses
Manually analyze the test cases and identify the most important weaknesses (manual findings). Analyze for both design weaknesses and source code weaknesses focusing on the latter. Since manual analysis combines multiple weaknesses with the same root cause, we anticipate a small number of manual findings, e.g., 10-25 per test case. Take special care to confirm that the manual findings are indeed weaknesses. Tools may be used to aid human analysis, but static analysis tools cannot be the main source of manual findings.
Check the tool reports to find warnings associated with the manual findings (or provide information to fix them). For each manual finding, for each tool: find at least one true tool warning, or find at least one related true warning, or conclude that there are no true or related true warnings.
This selection method is useful to the tool users because it is largely independent of tools and thus includes weaknesses that may not be found by any tools. It also focuses analysis on weaknesses found most important by security experts.
Criteria for analysis of correctness
Assign one of the following categories to each warning analyzed.
- True weakness
- True but insignificant weakness. Examples: database tainted during configuration, or a warning that describes properties of a standard library function without regard to its use in the code.
- Weakness status unknown - unable to determine correctness
- Not a weakness - an invalid conclusion about the code
In the above categories, there are two dimensions: correctness and significance.
In the analysis of correctness assume that
- A tool has (or should have) perfect knowledge of control/data flow that is explicitly in the code.
- If a tool reports a weakness on an infeasible path, mark it as false (not a weakness).
- If a tool reports a weakness that is not present, mark it as false.
- For example, if a tool reports an error caused by unfiltered input, but in fact the input is filtered correctly, mark it as false.
- If the input is filtered, but the filtering is not complete, mark it as true. This is often the case for cross-site scripting weaknesses.
- If a warning says that a function can be called with a bad parameter, but in the test case it is always called with safe values, mark the warning as false.
- A tool does not know about context or environment and may assume the worst case.
- For example, if a tool reports a weakness that is caused by unfiltered input from command line or from local files, mark it as true (but it may be insignificant - see below). The reason is that the test cases are general purpose software, and we did not provide any environmental information to the participants.
In the analysis of significance of a warning, consider its possible effects on security (integrity, confidentiality, availability). Mark a warning as true but insignificant in these cases:
- A warning describes properties of a function (e.g., standard library function) without regard to its use in the code.
- A warning describes a property that may only lead to a security problem in unlikely and local (not caused by an external person) cases.
- For example, a warning about unfiltered input from a command that is run only by an administrator during installation is likely insignificant.
- If a warning about coding inconsistencies does not indicate a deeper problem, then it is insignificant.
Criteria for associating warnings
For each tool warning in the list of selected warnings, find warnings from other tools that refer to the same (or related) weakness. For each selected warning instance, our goal is to find at least one related warning instance (if it exists) from each of the other tools. While there may be many warnings reported by a tool that are related to a particular warning, we do not need to find all of them.
If a warning is not in the list of selected warnings, but it was marked as associated with a selected warning, then its correctness needs to be determined.
There are several degrees of association:
- Equivalent – weakness names are the same or semantically similar; locations are the same, or in case of paths, the source and the sink are the same and the variables affected are the same.
- Strongly related – the paths are similar, where the sinks are the same conceptually (e.g., one tool may report a shorter path than another tool).
- Weakly related – warnings refer to different parts of a chain or composite; weakness names are different but related in some ways, e.g., one weakness may lead to the other, even if there is no clear chain; the paths are different but have a filter location or another important attribute in common.
More specifically, the following criteria apply to weaknesses that can be described using source-to-sink paths. A source is where user input can enter a program. A sink is where the input is used.
- If two warnings have the same sink, but the sources are two different variables, mark them as weakly related.
- If two warnings have the same source and sink, but paths are different, mark them as strongly related. However, if the paths involve different filters, mark them as weakly related.
- If one warning contains only the sink, and the other contains a path, the two warnings refer to the same sink and use a similar weakness name,
- If there is no ambiguity as to which variable they refer to (and they refer to the same variable), mark them as strongly related.
- If there are two or more variables affected and there is no way of knowing which variable the warnings refer to, mark them as weakly related.
Additional criteria for analysis of warnings related to manual findings
Matching tool warnings to the manual findings is complicated by the fact that the tool warnings may be at a lower level. Due to the possibility of a large number of tool warnings per manual finding, do not attempt to find all associated tool warnings for each manual finding.
Tool warnings related to manual findings are in one of the following two categories:
- True - warning is the same or very similar to the manual finding.
- Related true - the warning is somewhat similar to the manual finding. For example, tool may report the weakness from a different perspective.
Intended summary analysis
We plan to analyze the data collected and present the following in our report:
- Number of warnings by weakness category and weakness severity
- Summaries for the analyzed warnings, e.g., number of true tool warnings by weakness category
- For each manual finding: whether there were any reports with true warnings, any reports with related true warnings, or no reports with true or related true warnings.
SATE output format
In devising the tool output format, we try to capture aspects reported textually by most tools. The output format is based on the format used for SATE 2008.
Summary of changes since last year
The proposed format has these changes compared to the SATE 2008 format:
- An optional attribute "id" for a location. If a tool produces several paths for a weakness, id can be used to differentiate between them. For example, the following describes two paths, the first consists of two nodes, the second consists of three nodes.
<location id="1" path="/dir/file1" line="232"/> <location id="1" path="/dir/file1" line="98"/> <location id="2" path="/dir/file1" line="342"/> <location id="2" path="/dir/file2" line="65"/> <location id="2" path="/dir/file1" line="98"/>
This is useful, e.g., in cases where a tool provides several source-to-sink paths in a single warning.
<location id="1" path="/dir/file1" line="232"> <fragment>gets(str1);</fragment> <explanation>unbounded write to buffer str1</explanation> </location>
Description of the format
In the SATE tool output format, each warning includes:
- Id - a simple counter.
- (Optional) tool specific id.
- One or more locations, where each location has:
- (Optional) id - path id. If a tool produces several paths for a weakness, id can be used to differentiate between them.
- line - line number.
- path - pathname.
- (Optional) fragment - a relevant source code fragment at the location.
- (Optional) explanation - why the location is relevant or what variable is affected.
- Name (class) of the weakness, e.g., “buffer overflow”.
- (Optional) CWE id, where applicable.
- Weakness grade (assigned by the tool):
- Severity on the scale 1 to 5, with 1 - the highest.
- (Optional) probability that the problem is a true positive, from 0 to 1.
- (Optional) tool_specific_rank - tool specific metric - useful if a tool does not use severity and probability. If a team uses this field, it would have to separately provide definition, scale, and possible values.
- Output - original message from the tool about the weakness, either in plain text, HTML, or XML.
- (Optional) An evaluation of the issue by a human; not considered to be part of tool output, including:
- (Optional) correctness - human analysis of the weakness, one of four categories. This attribute should be used instead of the deprecated "falsepositive" attribute
Download the SATE 2009 XML schema file.
The SATE 2009 format is backward compatible with the SATE 2008 format.
Teams are encouraged to use the schema file for validation, for example:
xmllint --schema sate_2009.xsd tool_report1.xml
Other tool types
Although dynamic tools are beyond the scope of SATE 2009, we intend to include them in future expositions. We invite makers of dynamic tools to participate in the workshop. If they choose to run their tool on the test cases, they may submit their tool output to us. We will not analyze the tool output, but will release it as part of SATE 2009 data.
Selection of test cases
Other ideas for test case selection that may improve analysis:
- Conduct a differential analysis. Select software with known weaknesses that were later fixed, and run the tools against both the earlier version and the fixed version. Focus analysis on the known weaknesses.
- Seed weaknesses, then focus the analysis on the tools’ ability to find the seeded weaknesses
- Select smaller applications
- Select fewer test cases
- Choose one test case (the latest beta version) for each track from the previous SATE, plus 1 or 2 new test cases
- Include a collection of very small synthetic test cases with known weaknesses
Selection of warnings
Other ways to choose the subset of tool warnings
- Randomly choose a fixed number of warnings from each tool report
- Choose a higher portion of the higher severity warnings
- Choose a higher portion of certain weakness classes (e.g., CWE Top 25)
- Analyze warnings for selected modules only
- Representative modules (how to choose representative modules?)
- Most exposed modules (e.g., choose the web front end and database back end, but not the calculation module)