The SAMATE Project Department of Homeland Security

SARD Design Issues

From SAMATE


This page discusses some of the design issues and decisions of the Software Assurance Reference Dataset (SARD). SARD access and its manual are on-line. The SARD changes often. We appreciate and acknowledge those who contributed test cases to the SARD.

Contents

[hide]

Goals

The purpose of the SARD is to provide consumers, researchers, and developers with a set of known weaknesses. This will allow consumers to evaluate tools and developers to test their methods. The dataset will encompass a wide variety of flaws, languages, platforms, and compilers. We want the dataset to become a broad effort, with examples contributed by many sources.

Within the SARD are relatively small, explicit test suites designated for specific use, like a minimum benchmark for Web penetration testers.

The dataset will eventually cover all phases of software: from initial concept, through design and implementation, to acceptance, deployment, and operation. Thus, although we talk mostly about source code, the dataset will eventually have models, designs, running programs, etc.

The SARD manual includes bug reports and suggestions for enhancements and improvements.

Dataset Structure

Conceptually the dataset has three parts.

  • Test cases: samples of designs, source code, binaries, etc. with known flaws. Some have corresponding cases with the flaws fixed.
  • Metadata to label each case with
    • Contributor, source code (for binary or executable cases), remediation, etc.
    • Location(s) and type(s) of flaw(s), compiler or platform where it occurs, etc.
    • Drivers, stubs, include files, declarations, "make" files, etc.
    • Input demonstrating the flaw and expected result.
    • Comments and observations
  • Test suites consisting of sets of test cases, each selected for a particular purpose. For instance, one test suite might be a small set to test that the scripts and so forth work for the tool and installation.

Acknowledging Contributors and Describing Groups of Test Cases

The SARD acknowledgments and test case description page lists many of the people and groups who contributed test cases. We appreciate them and acknowledge their work and generosity. These test cases represent considerable intellectual effort to reduce reported vulnerabilities to examples, classify them, generate elaborations of particular flaws, come up with corresponding correct examples, etc.

The acknowledgments also have more details about the test cases, their sources, links to paper explaining them, and other information.

Dataset Composition

The SARD is a huge repository of over 80 000 test cases.

Source of Test Cases

  • Wild Code Sampling code from known bugs available in industry and open source software allows SARD users to construct a test set that encompasses real bugs that are found in software. By using the version of the code with the known vulnerability and comparing it to the "patched" version we can generate a correct and buggy example.
  • Artificial Code Constructing code that illustrate bugs and vulnerabilities is one way to create test cases. By selecting elements from a taxonomy of flaws and vulnerabilities, researchers have produced sets of source code that covers a wide range of weaknesses.
  • Academic Code Code collected from computer science and programming courses constitute a large set of similar programs, which can be useful as a source of multiple correct and buggy programs. These could be extremely useful analyzing false positive, since they will provide many correct test cases that vary in structure.
  • Generated Code It has been argued that test cases should be generated on the fly. There are dozens or hundreds of minor variants that could be automatically generated, for example, interprocedural flow, in-lined subroutines, within loops, obfuscation techniques, etc. It might be best to have templates that can be elaborated into actual test cases. The advantage is the templates and generation techniques can be developed independently.

Types of Cases

What types of cases should be included?

  • Static Cases: This is code which is examined.
    • Source code: code in C, C++, Java, etc.
    • "Binary" or executable code
    • Intermediate: Java byte code, Pascal p-code, etc.
  • Dynamic Cases: Buggy applications. These are needed for penetration testing, web interface, etc.
    • The internal Malware Research Protocols page has ideas for reducing the chance of accidentally running an intentionally insecure application.
  • Design and Higher Level Cases: Security must start at the highest levels: requirements and design. For processes, methodologies, design checker tools, app generators, etc. there need to be example designs, etc.

Other Attributes

Code Size

  • What is an appropriate size for a test case? Short test cases clearly illustrate a vulnerability and are more practical for broad coverage. However, the SARD must also have large, complex cases to clearly represent production code and to investigate concerns of scaling.
  • Whole applications, such as Mozilla or Apache or wireshark, are useful.

Pristine vs. Dirty Samples

Many test cases are designed to demonstrate one error, but have additional errors. For instance, 1777 illustrates a hardcoded password, but it also uses gets(), which allows a buffer overflow.

Test cases that are absolutely the cleanest, most excellent code with great style (except for the weakness) minimizes confounding concerns. These are useful for basic research and instruction.

Typical test cases, with style faults, non-portable code, even other weaknesses or compile time errors, most resemble real code.

We believe test cases should at least compile, so they can be executed, and shouldn't have extraneous weaknesses. Test Case Status has details of our review process. Poor style, design, or commenting is not forbidden.

Flawed and Fixed Samples

Test suites should encompass a variety of flawed code, but also should have corresponding fixed code for test cases. These are important to test false-positive rates. In many cases there are many different possible solutions, so multiple fixed code is possible.

Naming Convention of Weakness - CWE Compatibility

A "bad" test case contains at least one weakness. The offending line(s) of code is highlighted in the display of a test case. The weakness is designated using Common Weakness Enumeration (CWE) entries. That is a CWE Identification number followed by an associated name, e.g. "CWE-121: Stack Based Buffer Overflow". Currently, the weakness names are based on CWE version 2.1. In case the associated CWE full name is too long to be displayed, a shortened name is used. However, the CWE Identification number is kept intact.

In SARD, the CWE weakness name appears in every place where weakness needs to be posted, e.g. the screen display of a test case, in the list of test cases, the output zipped file of test cases, etc. The complete list of CWE weakness used in SARD is provided on line.

CWE weakness name is also used as search criteria to find test cases.  The Extended Search page lists all the weaknesses presented in SARD.  The user can fetch all the test cases with a specific weakness.

Test Case Permanence

Who can change test cases or test suites? When? Why?

To have long term value, the content of a test case is "write once". That is, once source code or a binary is added to the SARD, it keeps the same name and never changes. This permanence allows research work to refer to, say SARD test case 1552, knowing that that exact code can always be retrieved. Later work could reliably get exactly what was used before.

What if there is a mistake in the code, for instance, there is a second, unintended weakness? The test case can be marked Deprecated and a reference made to a corrected version. Deprecated test cases should not be used for new work. They remain in the SARD as a reference to recheck old work.

The metadata associated with a test case could change. For instance, the description could be expanded or corrected. It may be useful to have a history of such changes so users can see if metadata has changed, what the changes are, and who did them.

Test suites are similarly "write once". Once they are designated, they should not change. A test suite might be superceded by an improved test suite, which refers to test cases conforming to the latest language standard or has better coverage.

The presumption is that test cases and test suites are examined by their authors before being submitted. If there is substantial doubt about the suitability or correctness of a test case, it should be resolved before submission. Test cases should be deprecated rarely. On the other hand, descriptive data about test cases, the "metadata", need not be vetted quit as rigorously.

Test Suite Selection

A test suite is a set of test cases explicitly picked for some purpose. For instance, Test Suite #45 consists of 75 cases to test a "source code security analyzer based on functional requirements SCA-RM-1 through SCAN-RM-5 specified in "Source Code Security Analysis Tool Functional Specification"

Using test suites in the SARD allows different people to pick different sets for different reasons. Test suite designers should consider many questions, for instance:

  • Should there be mostly artificial test cases to cover a wide range of exploits, or should test cases be sampled from "the wild"? Would a mix of multiple sources be most appropriate?
  • Should the distribution of bugs in the test suite correspond to the distribution of actual bugs found in software, or should more focus be put on covering as many types of bugs as possible?
  • If we are only sampling from vulnerabilities that survived an internal debugging process, would they be valid for evaluating a tool's ability to catch all bugs or simply the "hardest"?
  • Does performance on artificial test cases reflect a tool's ability to perform on real code?
  • Here are some questions about using academic cases:
    • Will this code provide an accurate means for evaluating SA tools?
    • Are the errors and bugs introduced while learning the same as those in professionally produced software?
    • Is the course work sufficiently complex to be used to evaluate tools that are intended for industry software?
    • What range of concepts (threads, networking, cryptography, etc) does the course work cover compared to "real" software?
  • There is a big advantage (and a disadvantage) to using using dynamically generated cases.
    • The advantage is that dynamically generated code is less suseptible to being "gamed". With a fixed test suite, a tool could have a table of all the cases and the "correct" responses. Then the tool always gets 100% on the standard test. Even simple dynamic variable and function renaming would defeat simple cheats.
    • The disadvantage is that the generation technique itself would have to be qualified as part of the test methodology. Of course, qualifying each case generated might be infeasible, too.
  • What types of vulnerabilities should be selected? What languages should be the primary focus? What compilers, platforms, and applications (servers, browsers, etc) would be most useful for testing purposes? If a piece of code is buggy only on certain compilers, should it be included or should errors introduced by a compiler be ignored?

Flaw classes and Code complexity

Our current flaw classes and code complexities are still available as XML files. You can download the current XML Flaw Class Tree and the XML of code complexities.

SARD Test Case Status - What it Means

What does it mean when an SARD test case is labeled "candidate"? What is the quality of that test case? Has it been reviewed or vetted in any way? What constitutes an "accepted" test case? And what if a test case is found to be incorrect or of poor quality? We provide information explaining what the status tag assigned to each test case tells you.

Other NIST Reference Datasets

NIST is developing Computer Forensic Reference Data Sets (CFReDS) for digital evidence. These reference data sets provide an investigator documented sets of simulated digital evidence for examination.

Statistical Reference Datasets (StRD) are "reference datasets with certified values for a variety of statistical methods."

Other datasets, handbooks, and reference material are available to help "by mathematical modeling, design of methods, transformation of these methods into efficient numerical algorithms for high-performance computers and the implementation of these methods into high-quality mathematical software."

Other (Non-NIST) Assurance Tool Test Collections

SAMATE is compiling a list of other assurance tool test suites and benchmarks.