Metrics and Measures
Metrics vs. Measures
The terms metric and measure have some overlap. We use measure for more concrete or objective attributes and metric for more abstract, higher-level, or somewhat subjective attributes. For instance, lines of code (LOC) is a measure: it is objective and concrete. (Unfortunately LOC varies wildly for different implementations of the same algorithm [Schofield 2005]. Function points (FP) are much better.) Robustness, quality (as in "high quality"), and effectiveness are important attributes that we have some consistent feel for, but are hard to define objectively. Thus these are metrics. Measures, such as faults/FP, are bases for metrics. Measures help us approximate less tangible metrics.
Note that the above definitions differ from those in International Vocabulary of Metrology ... (VIM) 3rd edition. It says
- process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity
Bryce Ragland's Measure, Metric, or Indicator: What's the Difference?, CrossTalk, March(?) 1995, is helpful. We would call "body temperature" in his example a measure and "health" a metric. We would probably call body temperature trend a measure, too.
There is a white paper on Metrology for Information Technology (IT), NIST IR 6025, May 1997. In particular, it defines "measurement", "reference data", "reference implementation", "test", and "testing" (see Figure 3).
Martha Gray's Applicability of Metrology to Information Technology, J. Res. Natl. Inst. Stand. Technol. 104(6):567-578, Nov-Dec 1999 gives excellent background, examples, and references to standard definitions.
Jeff Nyman in The Metric vs. The Measure (Globaltester, 2005), points out that a metric must be useful. Roughly, that we can imagine process changes to achieve better metric scores.
In the Sept 1999 version of "Methodology for Evaluation of Collaboration Systems", Section 5 Metrics and Measures lays out the difference quite clearly but assigns exactly opposite meanings: "Metrics ... are directly measurable and can often be collected automatically. ... Measures are derived from interpretations of one or more metrics."
When we talk about metrics and measurements, we usually think about those for mass (gram) and time (second). We can add, compare, and average values. But other scales can be useful.
The most basic scale is nominal, which is just classification. Some nominal scales are gender, feelings (happy, serious, angry), and language (C, Java, Ada). Nominal scales may be hierarchical. For example, Indo-European languages are Italic, Celtic, or Germanic (or others), and computer instruction sets are RISC or CISC.
An ordinal scale is a nominal scale which has a natural order. For instance, weather reports give hurricane intensity on the Saffir-Simpson hurricane scale a discretization of wind speed. The Mohs scale grades relative hardness from 1 (talc) to 10 (diamond). The scale is not uniform: diamond (hardness 10) actually is 4 times harder than corundum (hardness 9) and 6 times harder than topaz (hardness 8).
An interval scale has equal distance between values, and a ratio scale is an interval scale with a zero.
Some scales naturally report in logarithmic terms. The Richter scale was a logarithmic scale for earthquakes, which measured the size of earthquake waves, not an earthquake's total energy. Relative sound energy is often measured with a logarithmic unit, the bel.
Wikipeida's Level of measurement is useful, too.
Good metrics depend on good measurement units. As mentioned above, lines of code (LOC) is a seductive measure: it is very objective and easily understood. Since LOC varies wildly between different implementations for the same operation [Schofield 2005], it is doubtful that high level measurements can be based on it. However, faults often occur at a specific rate per LOC (in addition to higher level errors in design, protocol, etc.)
Function points seems to map much better to total intellectual effort.
A unit of specification might be "shalls", that is, single statements of requirement. One "shall" yields very roughly 25 lines of code. This is similar to Mosaic, Inc.'s Testable Requirements.
The Objects of Measurement
There are at least three related, but distinct, entities that might be measured:
- Programs, and
An algorithm is the abstract thing. For instance, one can analyze a cryptographic protocol (algorithm) to show that it is not vulnerable to cryptographic attack such as replay, man-in-the-middle, chosen plaintext, etc. Similarly, one can check that a quicksort algorithm is correct. Why even bother implementing an algorithm if the algorithm is flawed? Any (faithful) implementation will have the flaws of the algorithm, plus any flaws introduced in the implementation.
Algorithms must be instantiated in some form: they must be rendered in some form. We do not (and cannot?) have pure representation. Executable instantiations are programs, source code, implementations, etc. Acceptance and post-development analysis must be at this level. The algorithm is inferred from the program.
It is not clear whether byte code or binaries should be included with programs above or whether they should be a separate category. On one hand, they are below the programming language abstraction and are not subject to compilation (or interpretation). On the other hand, they are still static globs of bits which must be executed according to some semantics by some engine, whether a virtual machine, a processor, or something else.
Finally one may check an execution. What did the program actually do? (Dynamic) tests apply to executions, not programs, strictly speaking.
It follows that there should be measures or metrics that apply to algorithms, programs, or executions.
One might argue that the highest level of a system, above algorithms, is actually design. If so, there must be metrics for designs, too.
Requirements, Specifications, and Constraints
There is another class of entities: requirements, specifications, constraints, etc. These might be distinguished from the above in several ways. They are not (practically) executable. They are higher level or have some orthogonal or cross-cutting view to clearly express certain aspects. From this it follows that they are often incomplete. For instance, we may specify that a protocol has liveness and safety, but that is far from completely describing what it should do.
These might be analyzed for, say, consistency and some completeness, examined with regard to use cases, and so forth. So these also may be included as objects of measurement.
Hardware vs. Software
Is there a qualitative difference between a VCR and a web server program? Both are designed by people; both run on hardware; both have outputs to and inputs from the outside (including users); both serve certain purposes. But we tend to think of a VCR as hardware, with some software, and a web server as software, running on hardware.
Some function may be carried out by hardware, some by software, and some by a combination. A "black box" system is likely to have both, and some functionality may depend on both. For example, security may depend on encryption, which may be done either by a program or a special computer chip.
Classifying Metrics and Measures
For completeness and comparison, it is convenient to classify metrics and measures.
Intrinsic vs. Relative
Some properties are intrinsic to target of measurement while other properties are more properly relationships relative to the operational environment. For instance, the number of lines of code (LOC), the number of function points (FP)in a program or the computational complexity of an algorithm depend only on the thing itself, not the environment in which it is run. (A model of execution is given or assumed for most artifacts, be it C semantics, a Java Virtual Machine, or an instruction set architecture.)
In contrast a vulnerability, or exploitable weakness, cannot be determined without knowing (or assuming) an operational context. A buffer overflow weakness is not a vulnerability if it comes from the configuration file which is (assumed to be) only editable by the (trusted) operator.
Basic or Primitive vs. Computed or Derived
This is similar to the measure/metric dimension.
Static vs. Dynamic
Lines of code can be counted statically. Number of tests failed can only be counted dynamically - by running the program.
Objective vs. Subjective
An objective measure typically can be automated and is more repeatable than subjective measures. Objective measures are often easier to model and analyze to establish a theoretical basis.
Measures can be classified as to the general aspect.
- Size - number of elements, lines, classes
- Complexity - structural, computational, algorithmic, logical, functional
- Quality - composite, emergent, or high-level aspects: reliability, efficiency, resilience, usability
Metrics for Security Tools
Tsipenyuk and Chess [Chess and Tsipenyuk 2006] proposed a metric for static analysis tools. Given counts of true positives (t), false positives (p), and false negatives (n), the score is:
100 * t / (t + p + n), augmented by weights and penalties
- Results with different reported severities should be weighed differently
- False negatives penalties per bug category should differ depending on whether the tool claims to detect this kind of bug or not
- Different weights depending on the perspective:
- Developers - tolerate false negatives
- Auditors - tolerate false positives