← Back

NAACCR Record Uniqueness

At-a-glance
ApplicationLevelRegistry UsersCustomers Users
ToolsIntermediate
Advanced
Steward
Research/Surveillance
Research
Link: http://www.naaccr.org/index.asp?Col_SectionKey=11&Col_ContentID=463#RecUniq

Overview:
The NAACCR Record Uniqueness Program is a useful risk-assessment tool. It is an implementation of the k-anonymity measure of risk in microdata files [Steel PM, Disclosure Risk Assessment for Microdata, 2004]. For k=1, this is a measure of the number of unique records for a given set of key variables. This measure is useful for assessing the risk of revealing additional information about a known cancer patient. (It is not an assessment of the risk of identifying a previously unknown cancer patient since it does not estimate the number of unique records in the population.) For k>1, this is a measure of the number of records with k or fewer common records for a given set of key variables. This measure is useful for estimating the number of table cells that might be subject to suppression if the microdata file is aggregated by the key values.

The documentation includes recommended threshold values for the percentage of unique records on a file, implying that a file is safe as long as the percentage of unique records is below these thresholds. These thresholds are, at best, general guidelines and should be used with caution. There is some amount of risk for any percentage of unique records greater than zero and the size of the risk depends on many things including the size of the sample and the resources available to the intruder. The threshold values may be either too low or too high for a given file and intrusion scenario. The record uniqueness values produced by this program are best used as relative values when comparing the tradeoff between risk and utility for various combinations of keys or key recoding options.

In addition to the number of unique records (or records with k or fewer common key values), the program generates an estimate of the relative contribution of each key variable to the total. This can be useful for identifying which key might yield the most benefit from recoding or removal.

There are two versions: one is a Windows executable suitable for analyzing a moderately sized input file, the other a SAS macro for larger files. At the time of this writing, the SAS macro version works correctly under Linux but reports five times more unique records than are on the file when run under Windows. This bug has been reported to the developer and a revised version is being tested.

[See also the μ-Argus program in the Computational Aspects of Statistical Confidentiality (CASC) Project entry]