Analysis and Data Improvement Tools

Analysis and Data Improvement Tools

NAACCR Committees and members have worked collaboratively to develop tools and resources for use by central cancer registry analysts and researchers. Select one of the options below to learn more.

This algorithm combines NHIA and NAPIIA into a single SAS program.

This program is used with a NAACCR standard data exchange file format with confidential information, including a census tract identifier. The program will link the census tract identifier with the percent of the residents in the census tract that live below the poverty level. This information is based data from the 2000 U.S. Census and the American Community Survey. The data used is the census data most closely aligned with diagnosis year. The program will output two variables that will be attached to every registry record inputted: the xx.x% poverty for the census tract, and a second variable that groups the exact percents into four categories: less than 5% poverty, 5%-9.9% poverty, 10%-19.9% poverty, and 20% or higher poverty.

Rural-Urban Data Items

Studies have shown that residents of rural areas have lower screening rates, lower rates of follow-up of abnormal screening tests, higher late-stage diagnosis rates, and differences in cancer treatment patterns.  Including tract-level indicators of rural-urban residence in the NAACCR data files will facilitate research in rural-urban disparities and allow researchers to control for rural-urban differences in model-based analysis of cancer risks and outcomes.

This SAS code creates 2 different measures of the rural-urban environment.  The URIC is a measure of the rural nature of the place of residence and can be an indicator of access to recreation, access to food stores, exposures to pollutants, crime levels, social cohesion, etc.  The USDA RUCA-based indicator is a measure of the proximity to large urban centers and can be an indicator of access to oncology specialists and cancer treatment facilities.  Both indicators have been tested for uniqueness and they do not allow the identification of individual census tracts as long as the county is not known.

Description of items:

Two indicators of the rural-urban environment based on the census tract of the diagnosis address:

  • Urban Rural Indicator Codes (URIC) is based on the Census Bureau’s identification of urban and rural areas
  • Rural Urban Commuting Areas Codes (RUCA) is based on the USDA’s Rural Urban Commuting Area (RUCA) codes

Cases diagnosed between 1995 and 2004 are assigned a code based on the 2000 U.S. Census. Cases diagnosed since 2005 are assigned a code based on the 2010 U.S. Census. 

Allowable values:

  • URIC :
    • 1: all urban – the percent of the population in an urban area = 100%
    • 2: mostly urban – the percent of the population in an urban area < 100% and ≥ 50%
    • 3: mostly rural – the percent of the population in a rural area < 100% and > 50%
    • 4: all rural – the percent of the population in an rural area = 100%
    • 9: unknown or not applicable – census tract not available or tract population was zero at the last decadal census
  • RUCA
    • 1: urban commuting area – RUCA codes 1.0, 1.1, 2.0, 2.1, 3.0, 4.1, 5.1, 7.1, 8.1, and 10.1
    • 2: not an urban commuting area – all other RUCA codes except 99
    • 9: unknown or not applicable – census tract not available or RUCA code = 99

Supplemental Area Based Social Measures (ABSM)

Epidemiologists cannot ignore the impact of social conditions on population health. Cancer registries currently collect the Krieger Poverty codes which is an area based social measure (ABSM) based on the census data on the percent of people living below poverty. These codes are available in cancer registry data in the US at the county and census tract-level and can be used to assess the impact of poverty on an individual-level, using the poverty ABSM as a proxy, and community-level, addressing both the compositional and contextual effects of social environment on cancer. The Krieger codes have been the standard, but the codes were developed using New England census data. For other regions on the county with higher poverty rates and for population groups with higher poverty rates, the Krieger cutpoints result in residual confounding and using these cutpoints can mask real disparities, particularly when analyzing minority populations. Additionally, other social data may also be important to include in etiologic and public health planning research, such as language isolation or housing security. Instead of relying on just the single poverty measure of SES, using these additional fields allows researchers the flexibility to use different cut-points for poverty for research on minorities and to incorporate additional SES contextual variables as needed into analysis. Researchers can also create multifactal socioeconomic indices to evaluate the potential impact of socioeconomic gradients on cancer burden, such as the Yost Index.

The list of variables in a separate Excel spreadsheet is available here under ‘Supplemental Information’. This SAS Code currently pulls the variables from 2 different time period and appends the data based on census-tract. The tract is then stripped from the data.

If you have any questions, please contact Recinda Sherman at

This tool describes and provides macro-driven formulae, in a Microsoft Excel workbook, to calculate completeness of case ascertainment based on observed cancer incidence, death rates, and a comparison of standard rates of incidence and mortality in the United States.

The Record Uniqueness Program was developed by Howe, Lake, and Shen to assess electronic data files for risk of confidentiality breach based on unique combinations of key variables.

This is a software utility developed in MS Access to identify miscoded sex codes based on first name. Taking as input a data file in NAACCR v16 format, a query runs against a list of known sex/name pairs, and it produces a list of cases for manual review that have potential errors in sex. The utility is based on an algorithm initially created by the New York Cancer Registry in August 2011.

In real world registry settings, the number of potential errors flagged by the tool  is extremely low – in the neighborhood of 0.25%. After careful review, users have reported that about 20-50% of the cases identified by the tool in need of review are indeed in error. Higher percentages have been found when the tool is run on incoming registry data. For cases where the edit flagged a sex that was correct, a misspelling of the name was often identified. For male breast cancer, nearly all flagged cases were errors, a consequence of the highly skewed sex distribution of this cancer site. A published study on this tool is available here.

A list of tools which can import and export data in NAACCR Volume II format.



This version of the V21 SAS translation tool is designed to work with naaccr-xml-utility-8.6, which was posted to on October 12, 2021. This update supersedes the September 2021 update which made it easier to harness the SAS translation tools.

Changes since 8.4 include:

  • Upgraded all base dictionaries to specifications v1.5; added new dateLastModified attribute.
  • Added a new optional ‘cleanupcsv’ parameter (defaults to true) to allow the temporary CSV to not be automatically deleted.
  • Improved feedback messages the macros write to the logs.
  • Improved help written in the macros.

The three latter changes were added mainly to improve QC and/or debugging processes.


The SAS program and accompanying Word document are to assist the NAACCR community in reading and writing NAACCR XML V21 files using SAS.

The SAS program is to be used in conjunction with the Word document,

“Instructions for ReadWrite_NAACCR_21_XML_tidy.sas_20210928.docx.” Read the instructions first. Really.


The SAS program harnesses SAS code and macros written by Fabian Depry, IMS, adds SAS labels, and removes fields from the SAS datasets that are 100% missing.


The code template below can be used by proficient SAS programmers to efficiently and accurately access data in the NAACCR XML V21 format. Code to both read and write NAACCR XML V21 format is provided. Various sections and options are included – users simply comment out sections which are not applicable for their specific needs. The SAS code supports the three most often used record types (Incidence, Confidential and Abstract (which includes text fields).


As you use the tool, we appreciate any feedback or comments you have. Contact with your thoughts.


v18 SAS Translation Tool

The code template below can be used by proficient SAS programmers to efficiently and accurately access data in the V18 format. Code to both read and write ASCII V18 format is provided. Various sections and options are included – users simply comment out sections which are not applicable for their specific needs. The code supports the three most often used record types (Incidence, Confidential and Text).

New for V18, there are two versions of the SAS code: one that uses NAACCR item numbers for the SAS variables names (as has been done in the past), and one that uses NAACCR XML names (NAACCR ID) for the SAS variable names. As NAACCR makes the transition from the “flat” ASCII file to XML, we encourage you to utilize the SAS code with NAACCR XML names. It will help prepare you for the future! As you use the tool, we appreciate any feedback or comments you have.  Contact with your thoughts.

To find more XML tools developed by NAACCR members, visit

Note: Translations tools for V14, V15, and V16 includes code that handles data elements which are part of the CDC’s Comparative Effectiveness Research (CER) and Patient-Centered Outcomes Research (PCOR) projects.



This MS Access database contains an import/export file specification for NAACCR v15 and v16 record layouts. It allows the user to import these types of files, perform operations on them, and then export them back out as a text file in the same format. Contact if you have any feedback on this tool.

Along with incidence and mortality data, information on population-based cancer survival is necessary to understand the full burden of cancer in our society. This SAS code is used to create the variables needed to conduct relative survival for the CiNA Volume 4: CiNA Survival. It is made available here for use by researchers on their own data and currently updated for a study cutoff date of 2015.

Copyright © 2018 NAACCR, Inc. All Rights Reserved | naaccr-swoosh-only See NAACCR Partners and Sponsors