SAMATE Reference Dataset
This page discusses some of the design issues and decisions of the SAMATE Reference Dataset (SRD). The SRD user interface and its manual are on-line. It evolves quickly. We appreciate and acknowledge those who contributed test cases to the SRD.
The purpose of the SRD is to provide consumers, researchers, and developers with a set of known weaknesses. This will allow consumers to evaluate tools and developers to test their methods. The dataset will encompass a wide variety of flaws, languages, platforms, and compilers. We want the dataset to become a broad effort, with examples contributed by many sources.
Within the SRD are relatively small, explicit sets designated for specific use, like a minimum benchmark for Web penetration testers. The SRD is much more than a group of (vetted) test sets.
The dataset will eventually cover all phases of software: from initial concept, through design and implementation, to acceptance, deployment, and operation. Thus, although we talk mostly about source code, the dataset will eventually have models, designs, running programs, etc.
The SRD manual includes bug reports and suggestions for enhancements and improvements.
Conceptually the dataset has four parts.
- Test cases: samples of designs, source code, binaries, etc. with known flaws.
- Corresponding samples with the flaws fixed.
- Metadata to label each case with
- Contributor, source code (for binary or executable cases), remediation, etc.
- Location(s) and type(s) of flaw(s), compiler or platform where it occurs, etc.
- Drivers, stubs, include files, declarations, "make" files, etc.
- Input demonstrating the flaw and expected result.
- Comments and observations
- Test suites consisting of sets of test cases, each selected for a particular purpose. For instance, one test suite might be a small set to test that the scripts and so forth work for the tool and installation.
The SRD Acknowledgements page lists many of the people and groups who contributed test cases. We appreciate them and acknowledge their work and generosity. These test cases represent considerable intellectual effort to reduce reported vulnerabilities to examples, classify them, generate elaborations of particular flaws, come up with corresponding correct examples, etc.
The Acknowledgements also have more details about the test cases, their sources, links to paper explaining them, and other information.
The SRD will be a huge repository of test cases. Where should a test case come from? What will provide more useful feedback? Although most test cases will be in someone's test suite, there is room for test cases that are just interesting or informative.
Source of Test Cases
- Wild Code Sampling code from known bugs available in industry and open source software allows SRD users to construct a test set that encompasses real bugs that are found in software. By using the version of the code with the known vulnerability and comparing it to the "patched" version we can generate a correct and buggy example. Several studies have used open source software to compare error detection rates of leading SA tools. Some of the resulting concerns are:
- Is the size of the program prohibitive to use all of it? If so, how much time is required to extract the buggy components and the corresponding fixed components?
- What licensing restrictions apply?
- Artificial Code Constructing code samples that illustrate possible bugs and vulnerabilities would be one effective way of creating reference dataset. By selecting elements from a taxonomy of flaws and vulnerabilities, one could code examples and produce a set of source code that covers a wide range of bugs.
- How can we ensure that they are sufficiently complex?
- Are there any effective methods for generating test cases without having to write them individually?
- Academic Code Code samples from collected from computer science and programming courses constitute a large set of similar programs, which can be useful as a source of multiple correct and buggy programs. These could be extremely useful analyzing false positive, since they will provide many correct code samples that vary in structure.
- Generated Code It has been argued that test cases should be generated on the fly. There are dozens or hundreds of minor variants that could be automatically generated, for example, interprocedural flow, in-lined subroutines, within loops, obfuscation techniques, etc. It might be best to have templates that can be elaborated into actual test cases. The advantage is the templates and generation techniques can be developed independently.
Types of Cases
What types of cases should be included?
- Static Cases: This is code which is examined.
- Source code: code in C, C++, Java, etc.
- "Binary" or executable code
- Intermediate: Java byte code, Pascal p-code, etc.
- Dynamic Cases: Buggy applications. These are needed for penetration testing, web interface, etc.
- The internal Malware Research Protocols page has ideas for reducing the chance of accidentally running an intentionally insecure application.
- Design and Higher Level Cases: Security must start at the highest levels: requirements and design. For processes, methodologies, design checker tools, app generators, etc. there need to be example designs, etc.
- What is an appropriate size for a test case? Should test cases be short and simply illustrate the vulnerability (i.e. an off-by-one index for a buffer), or should they be longer and more complex?
- Would whole application such as Mozilla or Apache be useful?
Pristine vs. Dirty Samples
Many test cases are designed to demonstrate one error, but have additional errors. For instance, 1777 illustrates a hardcoded password, but it also uses gets(), which allows a buffer overflow.
Should test cases be absolutely the cleanest, most excellent code with great style (except for the weakness)? This minimizes confounding concerns.
Or should the test case be pretty typical, with style faults, non-portable code, even other weaknesses or compile time errors? Real code looks like this.
We've decided the examples should at least compile, so they can be executed, and shouldn't have other security weaknesses. Poor style, design, or commenting is not forbidden.
Flawed and Fixed Samples
Test suites should encompass a variety of flawed code, but also should have corresponding fixed code for each test case. These are particularly important to test false-positive rates. In most cases there will be many different possible solutions. How many different solutions would be appropriate?
Naming Convention of Weakness - CWE Compatibility
A "bad" test case contains at least one weakness. The offending code is highlighted in the display of a test case. The name of weakness is shown in the form of CWE naming convention. That is a CWE Identification number followed by an associated Name, e.g. "CWE-121: Stack Based Buffer Overflow". Currently, the weakness names are based on CWE version 2.1. In the case the associated CWE full name is too long to be displayaed in a SRD screen, a shortened name will be chosen. However, CWE Identification number will keep intact.
In SRD, the CWE weakness name appears in every place where weakness needs to be posted, e.g. the screen display of a test case, in the list of test cases, the output zipped file of test cases, etc. The complete list of CWE weakness used in SRD is provided on line.
CWE weakness name is also used as search criteria to find test cases. The Extended Search page lists all the weaknesses presented in SRD. The user can fetch all the test cases with a specific weakness.
Search Test Cases via Weakness (CWE ID)On the screen of searching test cases, click on the tab of Extended Search. Then, there are two ways presented to search test cases using CWE identifiers:
On the right portion of the screen, expand the CWE weaknesses under “Weakness”. A hierarchical tree of software weaknesses in CWE identifiers that SRD supports will display. Selecting one of the CWE identifier will add that weakness as one of searching criteria. It will show at searching field “Weakness” on the left portion of the screen. Fill in other searching fields on this screen, if applicable. Clicking on “Search Test Cases” will return a list of test cases that meet the searching criteria.
- How complex should solutions be? For instance, some solutions could fix the code, but in the process make it very convoluted.
- How obfuscated should the bugs be?
- What is reasonable or unreasonable for current tools?
- Should examples be pristine or should poor code be allowed?
- If the poor practice is flagged as another weakness, it should probably be fixed. If it is just ugly code, it might be left.
Test Case Permanence
Who can change test cases or test suites? When? Why?
To have long term value, the content of a test case is "write once". That is, once source code or a binary is added to the SRD, it keeps the same name and never changes. This permanence allows research work to refer to, say SRD test case 1552, knowing that that exact code can always be retrieved. Later work could reliably get exactly what was used before.
What if there is a mistake in the code, for instance, there is a second, unintended weakness? The test case can be marked Deprecated and a reference made to a corrected version. Deprecated test cases should not be used for new work. They remain in the SRD as a reference to recheck old work.
The metadata associated with a test case could change. For instance, the description could be expanded or corrected. It may be useful to have a history of such changes so users can see if metadata has changed, what the changes are, and who did them.
Test suites are similarly "write once". Once they are designated, they should not change. A test suite might be superceded by an improved test suite, which refers to test cases conforming to the latest language standard or has better coverage.
The presumption is that test cases and test suites are examined by their authors before being submitted. If there is substantial doubt about the suitability or correctness of a test case, it should be resolved before submission. Test cases should be deprecated rarely. On the other hand, descriptive data about test cases, the "metadata", need not be vetted quit as rigorously.
Test Suite Selection
A test suite is a set of test cases explicitly picked for some purpose. For instance, "Minimum Pen Test Benchmark" might consist of cases 1053, 10277, 10278, ... (cases numbers are made up).
Using test suites in the SRD allows different people to pick different sets for different reasons. Test suite designers should consider many questions, for instance:
- Should there be mostly artificial test cases to cover a wide range of exploits, or should test cases be sampled from "the wild"? Would a mix of multiple sources be most appropriate?
- Should the distribution of bugs in the test suite correspond to the distribution of actual bugs found in software, or should more focus be put on covering as many types of bugs as possible?
- If we are only sampling from vulnerabilities that survived an internal debugging process, would they be valid for evaluating a tool's ability to catch all bugs or simply the "hardest"?
- Does performance on artificial test cases reflect a tool's ability to perform on real code?
- Here are some questions about using academic cases:
- Will this code provide an accurate means for evaluating SA tools?
- Are the errors and bugs introduced while learning the same as those in professionally produced software?
- Is the course work sufficiently complex to be used to evaluate tools that are intended for industry software?
- What range of concepts (threads, networking, cryptography, etc) does the course work cover compared to "real" software?
- There is a big advantage (and a disadvantage) to using using dynamically generated cases.
- The advantage is that dynamically generated code is less suseptible to being "gamed". With a fixed test suite, a tool could have a table of all the cases and the "correct" responses. Then the tool always gets 100% on the standard test. Even simple dynamic variable and function renaming would defeat simple cheats.
- The disadvantage is that the generation technique itself would have to be qualified as part of the test methodology. Of course, qualifying each case generated might be infeasible, too.
- What types of vulnerabilities should be selected? What languages should be the primary focus? What compilers, platforms, and applications (servers, browsers, etc) would be most useful for testing purposes? If a piece of code is buggy only on certain compilers, should it be included or should errors introduced by a compiler be ignored?
Flaw classes and Code complexity
SRD Test Case Status - What it Means
What does it mean when an SRD test case is labeled "candidate"? What is the quality of that test case? Has it been reviewed or vetted in any way? What constitutes an "accepted" test case? And what if a test case is found to be incorrect or of poor quality? We provide information explaining what the status tag assigned to each test case tells you.
Other NIST Reference Datasets
NIST is developing Computer Forensic Reference Data Sets (CFReDS) for digital evidence. These reference data sets provide an investigator documented sets of simulated digital evidence for examination.
Statistical Reference Datasets (StRD) are "reference datasets with certified values for a variety of statistical methods."
Other datasets, handbooks, and reference material are available to help "by mathematical modeling, design of methods, transformation of these methods into efficient numerical algorithms for high-performance computers and the implementation of these methods into high-quality mathematical software."