Metrics and Measures
Metrics vs. Measures
The terms metric and measure have some overlap. We use measure for more concrete or objective attributes and metric for more abstract, higher-level, or somewhat subjective attributes. For instance, lines of code (LOC) is a measure: it is objective and concrete. (Unfortunately LOC varies wildly for different implementations of the same algorithm [Schofield 2005]. Function points (FP) are much better.) Robustness, quality (as in "high quality"), and effectiveness are important attributes that we have some consistent feel for, but are hard to define objectively. Thus these are metrics. Measures, such as faults/FP, are bases for metrics. Measures help us approximate less tangible metrics.
We published these definitions in Paul E. Black, Karen A. Scarfone, and Murugiah P. Souppaya, Cyber Security Metrics and Measures, Wiley Handbook of Science and Technology for Homeland Security, 2008.
The above definitions are not incompatible with those in the International Vocabulary of Metrology ... (VIM) 3rd edition, 2012. Section 2.1 is the following definition.
- process of experimentally obtaining one or more quantity values that can reasonably be attributed to a quantity
In the Sept 1999 version of "Methodology for Evaluation of Collaboration Systems", Section 5 Metrics and Measures lays out the difference quite clearly but assigns exactly opposite meanings: "Metrics ... are directly measurable and can often be collected automatically. ... Measures are derived from interpretations of one or more metrics."
When we talk about metrics and measurements, we usually think about those for mass (gram) and time (second). We can add, compare, and average values. But other scales can be useful.
In the taxonomy proposed by Stevens, the most basic "scale" is nominal, which is just classification. VIM calls them "nominal properties" (1.30). Some nominal properties are gender, feelings (happy, serious, angry), language (C, Java, Ada), and ISO two-letter country code. Nominal properties may be hierarchical. For example, Italic, Celtic, and Germanic are all Indo-European languages, and computer instruction sets are RISC or CISC.
An ordinal scale is a nominal scale that has a natural order. For instance, weather reports give hurricane intensity on the Saffir-Simpson hurricane scale a discretization of wind speed. The Mohs scale grades relative hardness from 1 (talc) to 10 (diamond). The scale is not uniform: diamond (hardness 10) is 4 times harder than corundum (hardness 9) and 6 times harder than topaz (hardness 8).
An interval scale has equal distance between values. Values on an interval scale cannot be added: it makes no sense to ask what is the sum of 4 July 1776 and 23 December 1805. The number of days between them (difference) is reasonable (10 763 days).
A ratio scale is an interval scale with a zero. 20 K (degrees Kelvin) is twice as hot as 10 K. (20°C is not twice as hot as 10°C: Celsius is an interval scale.)
Considerations of Stevens' Taxonomy
Some scales naturally report in logarithmic terms. The Richter scale was a logarithmic scale for earthquakes, which measured the size of earthquake waves, not an earthquake's total energy. Relative sound energy is often measured with a logarithmic unit, the bel.
Wikipeida's Level of measurement explains what statistics make most sense with what scales. However, transforming data from one scale to another may help. Suppose an uncalibrated thermometer has a non-linear (monotonic) readout. As is, the data is on an ordinal scale, but with experiments, we can transform the data into Kelvin, a ratio scale.
Linear scales may not always fit well. Color is better described in three-dimensional models, such as the hue, value, and chroma in the Munsell color model. Note there are many other color models. VIM uses color as an example of a nominal scale.
Percentages and probabilities are real numbers bounded by zero and one.
Good metrics depend on good measurement units. As mentioned above, lines of code (LOC) is a seductive measure: it is objective and easily understood. Since LOC varies wildly between different implementations for the same functionality [Schofield 2005], it is doubtful that high level measurements can be based on it. However, faults often occur at a specific rate per LOC (in addition to higher level errors in design, protocol, etc.)
Function points seems to map much better to total intellectual effort.
A unit of specification might be "shalls", that is, single statements of requirement. One "shall" yields very roughly 25 lines of code. This is similar to Mosaic, Inc.'s Testable Requirements.
The Objects of Measurement
There are at least three related, but distinct, entities that might be measured:
- Programs, and
An algorithm is the abstract thing. For instance, one can analyze a cryptographic protocol (algorithm) to show that it is not vulnerable to cryptographic attack such as replay, man-in-the-middle, chosen plaintext, etc. Similarly, one can check that a quicksort algorithm is correct. Why even bother implementing an algorithm if the algorithm is flawed? Any (faithful) implementation will have the flaws of the algorithm, plus any flaws introduced in the implementation.
Algorithms must be instantiated in some form: they must be rendered in some form. We do not (and cannot?) have pure representation. Executable instantiations are programs, source code, implementations, etc. Acceptance and post-development analysis must be at this level. The algorithm is inferred from the program.
It is not clear whether byte code or binaries should be included with programs above or whether they should be a separate category. On one hand, they are below the programming language abstraction and are not subject to compilation (or interpretation). On the other hand, they are still static globs of bits that must be executed according to some semantics by some engine, whether a virtual machine, a processor, or something else.
Finally one may check an execution. What did the program actually do? (Dynamic) tests apply to executions, not programs, strictly speaking. Executions include the system, such as library versions, file system permissions, and operator input.
It follows that there should be measures or metrics that apply to algorithms, programs, or executions.
One might argue that the highest level of a system, above algorithms, is actually design. If so, there must be metrics for designs, too.
Requirements, Specifications, and Constraints
There is another class of entities: requirements, specifications, constraints, etc. These might be distinguished from the above in several ways. They are not (practically) executable. They are higher level or have some orthogonal or cross-cutting view to clearly express certain aspects. From this it follows that they are often incomplete. For instance, we may specify that a protocol has liveness and safety, but that is far from completely describing what it should do.
These might be analyzed for, say, consistency and some completeness, examined with regard to use cases, and so forth. So these also may be included as objects of measurement.
Hardware vs. Software
Is there a qualitative difference between a VCR and a web server program? Both are designed by people; both run on hardware; both have outputs to and inputs from the outside (including users); both serve certain purposes. But we tend to think of a VCR as hardware, with some software, and a web server as software, running on hardware.
Some function may be carried out by hardware, some by software, and some by a combination. A "black box" system is likely to have both, and some functionality may depend on both. For example, security may depend on encryption, which may be done either by a program or a special computer chip.
Classifying Metrics and Measures
For completeness and comparison, it is convenient to classify metrics and measures.
Intrinsic vs. Relative
Some properties are intrinsic to target of measurement while other properties are more properly relationships relative to the operational environment. For instance, the number of lines of code (LOC), the number of function points (FP)in a program or the computational complexity of an algorithm depend only on the thing itself, not the environment in which it is run. (A model of execution is given or assumed for most artifacts, be it C semantics, a Java Virtual Machine, or an instruction set architecture.)
In contrast a vulnerability, or exploitable weakness, cannot be determined without knowing (or assuming) an operational context. A buffer overflow weakness is not a vulnerability if it comes from the configuration file which is (assumed to be) only editable by the (trusted) operator.
Basic or Primitive vs. Computed or Derived
This is similar to the measure/metric dimension.
Static vs. Dynamic
Lines of code can be counted statically. Number of tests failed can only be counted dynamically - by running the program.
Objective vs. Subjective
An objective measure typically can be automated and is more repeatable than subjective measures. Objective measures are often easier to model and analyze to establish a theoretical basis.
Measures can be classified as to the general aspect.
- Size - number of elements, lines, classes
- Complexity - structural, computational, algorithmic, logical, functional
- Quality - composite, emergent, or high-level aspects: reliability, efficiency, resilience, usability