Static Analysis Tool Exposition (SATE) V
The experience workshop will be Friday, March 14, 2014. The planning meeting for SATE V was held on Monday, March 4, 2013 at NIST.
Static Analysis Tool Exposition (SATE) is designed to advance research (based on large test sets) in, and improvement of, static analysis tools that find security-relevant defects in source code. Briefly, participating tool makers run their tools on a set of programs. Researchers led by NIST analyze the tool reports. The results and experiences are reported at a workshop. The tool reports and analysis are made publicly available later.
SATE's purpose is NOT to evaluate nor choose the "best" tools. Rather, it is aimed at exploring the following characteristics of tools: relevance of warnings to security, their correctness, and prioritization. Its goals are:
- To enable empirical research based on large test sets,
- To encourage improvement of tools,
- To speed adoption of tools by objectively demonstrating their use on real software.
Note. A warning is an issue (usually, a weakness) identified by a tool. A (tool) report is the output from a single run of a tool on a test case. A tool report consists of warnings.
Call for participation
We invite participation from makers of static analysis tools that find weaknesses relevant to security. We welcome commercial, research, and open source tools. Participation is free.
If you want to participate or have any questions, please email Aurelien Delaitre (aure 'at' nist.gov).
Changes from SATE IV
- Tool outputs and our detailed analysis of tool warnings will not be released in order to encourage wider participation. Teams are free to release their own data.
- Teams should run their tools within the Software Assurance Marketplace (SWAMP).
- Teams should provide a Coverage Claims Representation (CCR) of the weaknesses their tool can find.
- We will recognize and encourage sound static analyzers (tools that in theory never report incorrect findings) by introducing the Ockham Criteria.
Steps in the SATE procedure
The following summarizes the steps in the SATE procedure. The dates are subject to change.
- Step 1: Plan and prepare
- Organizers choose test sets
- Teams sign up to participate
- Step 2: Provide test sets to teams (June 3, 2013)
- Step 3: Teams run their tool on the test set(s) within the SWAMP infrastructure
- Teams provide a sample tool report in the SATE output format. Organizers review it and suggest formatting corrections if needed. (July 1, 2013)
- Teams submit their report(s) and CCR (target date: August 1st, deadline: September 3, 2013)
- Teams can withdraw from the exposition prior to this deadline. If a team withdraws, their intention to participate and decision to withdraw will never be disclosed.
- Step 4: Organizers analyze the reports
- Organizers identify tool warnings that match the CVEs
- Organizers select a subset of tool warnings for analysis and share it with teams
- (Optional) Teams return their review of the selected warnings from their tool's reports
- Organizers provide preliminary analysis to the teams (September 16, 2013)
- (Optional) Teams return their corrections to the preliminary analysis (November 16, 2013)
- Organizers provide final analysis to the teams (December 1, 2013)
- Step 5: Organizers and teams report and discuss their experience and observations at a workshop (March 2014)
- Step 6: Teams are encouraged to submit a research report, to be published as part of a NIST Special Publication (SP), describing their experience running the tool, discussion of their tool's results, etc. (by April 2014)
- Step 7: Publish NIST SP, including our research report and research reports submitted by teams (by October 2014)
The exposition consists of 3 language tracks: C/C++, Java, and PHP.
- A test set for each language track
- A test set consists of
- CVE-selected - pairs of open source programs: a vulnerable version with one or more publicly reported vulnerabilities (CVEs) and a fixed version. We will provide the list of CVEs to teams.
- The fixed version of the CVE-based test cases is used as a test case for the warning subset analysis.
- A set of synthetic test cases (for C/C++ and Java tracks only).
SATE addresses different aspects of static analysis tools by using complementary kinds of test cases and analysis methods.
- CVE-based test cases focus analysis on real-world exploitable vulnerabilities
- Warning subset analysis helps understand what weaknesses are found by tools in real-world software
- Synthetic test cases contain precisely characterized weaknesses.
Conditions for tool runs and submissions
Teams run their tools and submit reports following specified conditions.
- Teams can participate in either language track or all.
- Teams cannot modify the code of the test cases, except possibly for comments (e.g., annotations).
- For each test case, teams do one or more runs and submit the report(s).
- Teams are encouraged to do a custom run (e.g., the tool is configured with custom rules). For a custom run, specify the affected settings (e.g., custom rules) in enough detail so that the run can be reproduced independently.
- Teams may do a run that uses the tool in default configuration.
- Teams cannot do any hand editing of tool reports.
- Teams convert the reports to a common XML format. See SATE output format for description of the SATE format.
- Teams are also encouraged to submit the original reports from their tools, in addition to the reports in the common output format.
- Teams specify the environment (including the operating system and version of compiler) in which they ran the tool.
Finding all weaknesses in a reasonably large program is impractical. Also, due to the likely high number of tool warnings, analyzing all warnings may be impractical. Therefore, we select subsets of tool warnings for analysis.
Generally the analyst first selects issues for analysis. Second, find associated warnings from tools. This results in a subset of tool warnings. Analyze this subset.
Methods 1 and 2 below apply to the general programs only. Method 3 applies to the CVE-selected programs. We will perform separate analysis and reporting for the resulting subsets.
Method 1: Statistical subset of tool warnings
Statistically select the same number of warnings from each tool report, assigning higher weight to categories of warnings with higher severity and avoiding categories of warnings with low severity.
This selection method is useful to the tool users because it considers warnings from each tool.
Tool warning selection procedure
In previous SATEs, we selected 30 warnings from each tool report using the following procedure:
- Randomly select one warning from each warning class (identified by a warning name or by CWE id) with severities 1 through 4.
- While more warnings are needed, repeat:
- Randomly select 3 of the remaining warnings (or all remaining warnings if there are less than 3 left) from each warning class with severity 1,
- Randomly select 2 of the remaining warnings (or all remaining warnings if there are less than 2 left) from each warning class with severity 2,
- Randomly select 1 of the remaining warnings from each warning class (if it still has any warnings left) with severity 3.
- If more warnings are still needed, select warnings from warning class with severity 4, then select warnings from warning class with severity 5.
If a tool did not assign severity, we assigned severity based on weakness names and our understanding of their relevance to security.
Method 2: Select tool warnings related to manually identified weaknesses
Security experts manually analyze the test cases and identify the most important weaknesses (manual findings). Analyze for both design weaknesses and source code weaknesses focusing on the latter. Since manual analysis combines multiple weaknesses with the same root cause, we anticipate a small number of manual findings, e.g., 10-25 per test case. Take special care to confirm that the manual findings are indeed weaknesses. Tools may be used to aid human analysis, but static analysis tools cannot be the main source of manual findings.
Check the tool reports to find warnings related to the manual findings. For each manual finding, for each tool: find at least one related warning, or conclude that there are no related warnings.
This method is useful because it is largely independent of tools and thus includes weaknesses that may not be found by any tools. It also focuses analysis on weaknesses found most important by security experts.
Method 3: Select tool warnings related to the CVEs
For each CVE-selected pair of test cases, check the tool reports to find warnings that identify the CVEs in the vulnerable version. Check whether the warnings are still reported for the fixed version.
This method is useful because it focuses analysis on exploited weaknesses.
SATE V will use the same guidelines as the past two SATEs. The detailed criteria for analysis of correctness and significance and criteria for associating warnings are at http://samate.nist.gov/SATE2010/resources/sate_analysis/AnalysisCriteria.pdf.
The value of a sound static analyzers is that all of its findings can be assumed to be correct, even if it cannot handle enormous programs or does not handle dozens of weakness classes. To recognize sound analyzers, we will report tools that satisfy the SATE V Ockham Sound Analysis Criteria. In brief the criteria are:
- The tool is claimed to be sound.
- For at least one weakness class and one test case the tool produces findings for a minimum of 60% of buggy sites OR of non-buggy sites.
- Even one incorrect finding disqualifies a tool for this SATE.
Definitions and details are in the SATE V Ockham Sound Analysis Criteria page.
Analysis of correctness
Assign one of the following categories to each warning analyzed.
- True security weakness - a weakness relevant to security.
- True quality weakness - requires developer's attention, poor code quality, but may not be relevant to security. Example: buffer overflow where input comes from the user input and the program is not run as SUID. Example: "locally true" - function has a weakness, but the function may always be called with safe parameters.
- True but insignificant weakness. Example: database tainted during configuration. Example: a warning that describes properties of a standard library function without regard to its use in the code.
- Weakness status unknown - unable to determine correctness
- Not a weakness - false
For each tool warning in the list of selected warnings, find warnings from other tools that refer to the same (or related) weakness. For each selected warning instance, our goal is to find at least one related warning instance (if it exists) from each of the other tools. While there may be many warnings reported by a tool that are related to a particular warning, we do not attempt to find all of them.We will use the following degrees of association:
- Equivalent - weakness names are the same or semantically similar; locations are the same, or in case of paths, the source and the sink are the same and the variables affected are the same.
- Strongly related - the paths are similar, where the sinks are the same conceptually (e.g., one tool may report a shorter path than another tool).
- Weakly related - warnings refer to different parts of a chain or composite; weaknesses are different but related in some ways, e.g., one weakness may lead to the other, even if there is no clear chain; the paths are different but have a filter location or another important attribute in common.
Criteria for analysis of warnings related to manual findings
Mark tool warnings related to manual findings with one of the following:
- Same instance.
- Same instance, different perspective.
- Same instance, different paths. Example: different sources, but the same sink.
- Coincidental - tool reports a similar weakness (the same weakness type).
- Other instance - tool reports a similar weakness (the same weakness type) elsewhere in the code.
Intended summary analysis
We plan to analyze the data collected and present the following in our report:
- Number of warnings by weakness category and weakness severity
- Summaries for the analyzed warnings, e.g., number of true tool warnings by weakness category
SATE output format
The SATE V output format is the same as the SATE 2010 format. SATE 2008 and 2009 outputs are subsets and are therefore compliant with the latest version.
In the SATE tool output format, each warning includes:
- Id - a simple counter, unique within the report.
- (Optional) tool specific id.
- One or more locations, where each location has:
- (Optional) id - path id. If a tool produces several paths for a weakness, id can be used to differentiate between them.
- line - line number.
- path - file path.
- (Optional) fragment - a relevant source code fragment at the location.
- (Optional) explanation - why the location is relevant or what variable is affected.
- Name (class) of the weakness, e.g., "buffer overflow".
- (Optional) CWE id, where applicable.
- Weakness grade (assigned by the tool):
- Severity on the scale 1 to 5, with 1 - the highest.
- (Optional) probability that the problem is a true positive, from 0 to 1.
- (Optional) tool_specific_rank - tool specific metric - useful if a tool does not use severity and probability. If a team uses this field, it would have to separately provide definition, scale, and possible values.
- Output - original message from the tool about the weakness, either in plain text, HTML, or XML.
- (Optional) An evaluation of the warning by a human; not considered to be part of tool output, including:
- (Optional) correctness - human analysis of the weakness, one of several categories. Use this instead of the deprecated "falsepositive" attribute
The latest SATE XML schema file can be downloaded here.
Teams are encouraged to use the schema file for validation, for example:
xmllint --schema sate5.pathcheck.xsd tool_report1.xml