Analysis and Data Improvement Tools

NAACCR Committees and members have worked collaboratively to develop tools and resources for use by central cancer registry analysts and researchers. Select one of the options below to learn more.

NAACCR Central Registry Analyst Handbook

Overview

While each registry is different, most analysts perform the same essential functions, and in turn, come across similar issues. The materials here provide information and guidance to new and established analysts at central cancer registries, who both utilize population-based cancer surveillance data and release it to other researchers for analysis. The goal is to provide a comprehensive set of resources to support all aspects of data use and research in central cancer registries.

This page will present a range of general guidance from using the NAACCR data dictionary to navigating some of the more complex variables that have changed over time. Additional resources under development include information on obtaining population data, creating cancer surveillance statistics, geospatial analysis, and how to handle different data requests, from aggregate statistics to community cancer concerns. This page will also include tips and resources for running data queries and using some of the common registry software. While these resources will cover many of the functions an analyst performs, it is not considered an exhaustive list. Analysts should always follow the established protocols and best practices for their registry, where applicable.

Guidance, best practices, and other resources here are maintained by the RDU Research Analyst Handbook Taskforce (RAHT) in coordination with other RDU subcommittees, Central Registry Operations Standards (CROS), and the NAACCR Executive Office. Resources will be published as they are made available. We encourage analysts to reach out to RDU RAHT with suggestions for additional topics, as well as to share resources they may have developed at their registry.

Subsections:

Role of the Central Registry Analyst

NAACCR Data Dictionary

Data Quality from the Analyst Perspective

Generating Cancer Surveillance Statistics

How to Obtain National Cancer Statistics

Complex Variables

Obtaining Population Data

Research Data Requests

Running Data Queries

Data Analysis for Cancer Control Programs

Epidemiology

Geospatial Analysis

Confidentiality, Data Security, and Data Transmission Best Practices

Performing Data Linkages

Call for Data

Additional Resources

NAACCR supports a variety of variable recodes and area-based social measures. Many used to be supported via SAS programs. To ensure standardization and replicability, we now support these data items through the annual CFD tool, NAACCR Prep, available here: https://www.naaccr.org/call-for-data/#datatools These variables can also be calculated by any user using the NCI funded tool created by IMS called File*Pro, available here: https://seer.cancer.gov/tools/filepro/.

VENUSCANCER: SAS CODE MAPPING NAACCR V22 TO VENUSCANCER DATA DICTIONARY

The VENUSCANCER Study is embedded in the CONCORD programme, a world-wide initiative designed to explain the global inequalities in patterns of care, short-term survival and trends in avoidable premature deaths from breast, cervical and ovarian cancers, the three most common cancers in women. This project aims to provide levers for health policy to reduce or eliminate avoidable differences in survival from these cancers. The SAS code posted here maps or translates NAACCR V22 to the VENUSCANCER data dictionary.

VENUSCANCER SAS Code for V22 – Updated June 2023

March 2023 Webinar

Tract-Estimated Congressional District Populations

Population data by race/ethnicity for congressional districts can be ascertained by aggregating census blocks, but this data is only available as part of the decennial census. Intercensal tract level population estimates by race/ethnicity are available, but census tracts do not translate directly to congressional districts as some census tracts are split across one or more congressional districts.

An analysis was conducted to compare populations between actual congressional districts to tract-estimated congressional districts to determine the potential effect of calculating cancer rates by congressional districts using tract estimated congressional district populations. For the analysis, each tract was assigned to only one congressional district.

When a tract is split across multiple congressional districts, the tract was assigned to the congressional district based on the piece of tract that has the most people.
When the population of split tracts was equal, the tract was assigned to the lowest numbered congressional district.

Block-level populations were aggregated to produce counts for both the actual congressional districts and the tract-estimated congressional districts by:

Sex: male, female
Race/ethnicity groups: non-Hispanic white, non-Hispanic black, non-Hispanic American Indian or Alaska Native, non-Hispanic Asian Pacific Islander, non-Hispanic Other, Hispanic
Age groups: 0-49, 50-64, 65+

To compare the actual congressional district populations with the tract-estimated congressional district populations, we converted the counts of each population subgroup to percentages and calculated the absolute value of the differences between the two percentages. We also summarized the percent difference in population for all the congressional districts in each state and for the US as a whole.

After comparing differences in counts in the overall populations and populations by sex, race/ethnicity, and age groups, the tract-estimated congressional districts were found to have small differences compared to the actual congressional district population counts. With small differences in populations and subgroups of interest, these tract-estimated congressional district populations could be used to calculate congressional district cancer rates that would likely be close to the rates if the actual congressional district populations were available. Using these tract-estimated congressional districts to calculate cancer rates has the added advantage of aligning well with cancer case geocoding that is usually developed and verified at the census tract level.

We are currently supporting the 117th Congressional Districts using 2010 tract boundaries and the 118th Congressional District using both 2010 and 2020 tract boundaries.

Supplemental NAACCR Poverty Indicator Measures and Yost Index

Epidemiologists cannot ignore the impact of social conditions on population health. Cancer registries currently collect the Krieger Poverty codes which is an area based social measure (ABSM) based on the census data on the percent of people living below poverty. These codes are available in cancer registry data in the US at the county and census tract-level and can be used to assess the impact of poverty on an individual-level, using the poverty ABSM as a proxy, and community-level, addressing both the compositional and contextual effects of social environment on cancer.

The Krieger codes have been the standard, but the codes were developed using New England census data. For other regions on the county with higher poverty rates and for population groups with higher poverty rates, the Krieger cutpoints result in residual confounding and using these cutpoints can mask real disparities, particularly when analyzing minority populations. Additionally, other social data may also be important to include in etiologic and public health planning research, such as language isolation or housing security. Instead of relying on just the single poverty measure of SES, researchers have developed a multifactal socioeconomic index to evaluate the potential impact of socioeconomic gradients on cancer burden. This index, called the Yost Index because it was developed by Kathleen Yost, requires a number of area based social measures that are available from the census.

With the above in mind, NAACCR is requesting supplemental, tract-level ABSMs to be submitted during call for data for evaluation. The variables are time dependent, based on diagnosis year, and the data are appended to the case using NAACCR*Prep. The additional variables requested are now minimal in number because we are linking to a predefined SES index, the Yost variable, instead of all the component parts of the index. We are also collecting SES quintiles by race. The Yost variable has been assessed and recoded to limit uniqueness of the data and ensure limited risk of disclosure of census tract. The SES index is identical to the SEER composite SES score. More information is available here: https://seer.cancer.gov/seerstat/databases/census-tract/index.html.

These data are still being by NAACCR. None of these variables are released for any research without the consent of the submitting registry. The ACS Quintiles by race/ethnicity will be used to develop useful, race/ethnic based cutpoints to enable research on poverty by minority groups. During this evaluation period, NO supplemental data will be released to outside researchers.

Central registries have two ways to calculate these variables. After the variables are appended, using the combination of geocoded state, county, and tract, the tract can then be stripped from the data. This is an option in NAACCR*Prep. But a registry may also choose to maintain the tract for 2010 and 2020 and submit to NAACCR for evaluation for CiNA Geographic. The tract data will be separated and stored separately from the other CiNA data. Access tract data currently limited to the NAACCR Program Manager of Data Use and Research, Dr. Recinda Sherman and required IMS database administrators. Use is currently limited to evaluation and fitness for use assessment only, not research.

If you have any questions, please contact Recinda Sherman.

NCI/NAACCR Cancer Reporting Zones

NCI, in collaboration with NAACCR, is working with individual registries to develop a set of cancer reporting zones across the U.S. that are more suitable for cancer data reporting than counties. In each respective state, the zones will be custom crafted to represent areas that:

are meaningful to stakeholders in terms of cancer reporting and cancer interventions;
comprise adjacent census tracts and smaller counties (or portions of counties) that sum to population sizes that are sufficiently large to support stable rates;
collectively cover the entire population of the state;
are homogeneous with respect to important sociodemographic characteristics and are compact in size;
have large enough case counts for data reporting without compromising confidentiality; and
result in a relatively small proportion of areas with suppressed values, although for rarer cancer sites suppression will be inevitable, especially when producing rates stratified by sex and/or race.

These zones subdivide large population urban counties and are collections of smaller rural counties (or portions of counties). They have a minimum population size of 50,000. Participation in the project is voluntary. If your registry is interested in participating, please contact Recinda Sherman. To learn more about the project and methodology, see the article Developing Geographic Areas for Cancer Reporting Using Automated Zone Design[DN1] .

As of July 2024, NCI has developed and finalized zones associated with 22 registries and their catchment areas. A crosswalk of census tract 2010 to cancer reporting zone is available for these areas and can be used to calculate cancer rates. The crosswalk includes a 11 character tractID based on 2010 census tract geographies, a 8 character ZoneIDOrig variable that identifies the unique cancer reporting zone within a registry’s catchment area, and a 10 character ZoneIDFull variable that identifies the unique cancer reporting zone across registries (consists of the 2-digit state FIPS code followed by the 8 character ZoneIDOrig). In addition, a variable Zone_Tract_Certainty is included in the crosswalk. This variable is a flag to indicate whether census tract is needed in order to assign the zone. There are two codes for this field: 0 = can assign zone based on county; 1 = need tract to determine zone. This flag can be used in conjunction with census tract certainty where only high certainty census tracts should be used for assigning the cancer reporting zone when Zone_Tract_Certainty = 1 and subsequently calculating statistics at the zone level.

NAACCR HISPANIC AND ASIAN/PACIFIC ISLANDER IDENTIFICATION ALGORITHM (NHAPIIA)

This algorithm combines NHIA and NAPIIA into a single SAS program.

NHAPIIA v19– 10/15/2019

RUCA and URIC codes

Rural-Urban Data Items

Studies have shown that residents of rural areas have lower screening rates, lower rates of follow-up of abnormal screening tests, higher late-stage diagnosis rates, and differences in cancer treatment patterns. Including tract-level indicators of rural-urban residence in the NAACCR data files will facilitate research in rural-urban disparities and allow researchers to control for rural-urban differences in model-based analysis of cancer risks and outcomes.

This SAS code creates 2 different measures of the rural-urban environment. The URIC is a measure of the rural nature of the place of residence and can be an indicator of access to recreation, access to food stores, exposures to pollutants, crime levels, social cohesion, etc. The USDA RUCA-based indicator is a measure of the proximity to large urban centers and can be an indicator of access to oncology specialists and cancer treatment facilities. Both indicators have been tested for uniqueness and they do not allow the identification of individual census tracts as long as the county is not known.

Description of items:

Two indicators of the rural-urban environment based on the census tract of the diagnosis address:

Urban Rural Indicator Codes (URIC) is based on the Census Bureau’s identification of urban and rural areas
Rural Urban Commuting Areas Codes (RUCA) is based on the USDA’s Rural Urban Commuting Area (RUCA) codes

Cases diagnosed between 1995 and 2004 are assigned a code based on the 2000 U.S. Census. Cases diagnosed since 2005 are assigned a code based on the 2010 U.S. Census.

Allowable values:

URIC :
- 1: all urban – the percent of the population in an urban area = 100%
- 2: mostly urban – the percent of the population in an urban area < 100% and ≥ 50%
- 3: mostly rural – the percent of the population in a rural area < 100% and > 50%
- 4: all rural – the percent of the population in an rural area = 100%
- 9: unknown or not applicable – census tract not available or tract population was zero at the last decadal census
- C: the state + county + tract combination was not found in the lookup table
- D: either the state, county, or tract were blank or an unknown value (e.g., state was “ZZ”, county was “999”, etc.)

RUCA
- 1: urban commuting area – RUCA codes 1.0, 1.1, 2.0, 2.1, 3.0, 4.1, 5.1, 7.1, 8.1, and 10.1
- 2: not an urban commuting area – all other RUCA codes except 99
- 9: unknown or not applicable – census tract not available or RUCA code = 99
- C: the state + county + tract combination was not found in the lookup table
- D: either the state, county, or tract were blank or an unknown value (e.g., state was “ZZ”, county was “999”, etc.)

NAACCR METHOD TO ESTIMATE COMPLETENESS

Registries can use this worksheet for a reasonable assessment of current case ascertainment. Please keep in mind that adjustments were made to the method for diagnosis year 2020 and forward. But using the most recent diagnosis year spreadsheet will apply the most current population estimates. Therefore, your completeness monitoring will be closer to the official NAACCR estimate.

NAACCR Method to Estimate Completeness Workbook (2010-2022) (document updated 8/15/2025)

RECORD UNIQUENESS

The Record Uniqueness Program was developed by Howe, Lake, and Shen to assess electronic data files for risk of confidentiality breach based on unique combinations of key variables.

An Executable Record Uniqueness Program for Moderate-Sized Files (Zipped EXE)
A SAS Program for Record Uniqueness for Large Files (Zipped SAS Macro)
A description of the method

SEX CODE VALIDATION UTILITY

This is a software utility developed in MS Access to identify miscoded sex codes based on first name. Taking as input a data file in NAACCR v16 format, a query runs against a list of known sex/name pairs, and it produces a list of cases for manual review that have potential errors in sex. The utility is based on an algorithm initially created by the New York Cancer Registry in August 2011.

Sex Edit for NAACCR V18

In real world registry settings, the number of potential errors flagged by the tool is extremely low – in the neighborhood of 0.25%. After careful review, users have reported that about 20-50% of the cases identified by the tool in need of review are indeed in error. Higher percentages have been found when the tool is run on incoming registry data. For cases where the edit flagged a sex that was correct, a misspelling of the name was often identified. For male breast cancer, nearly all flagged cases were errors, a consequence of the highly skewed sex distribution of this cancer site. A published study on this tool is available here.

TRANSLATION TOOLS FOR VOLUME II (DATA DICTIONARY)

This page contains tools to import and export data in NAACCR Volume II format.

V25 SAS TRANSLATION TOOL (Updated September 2025)

Here is the latest SAS program and Word instructions for reading and writing NAACCR V23 XML files using SAS. We are grateful to Fabian Depry, IMS for updating these resources annually and to Chris Johnson for spearheading this resource.

The SAS program is to be used in conjunction with the Word document. Note: Use of this code does require proficiency is SAS.

Before starting, read the instructions first.

Really.

It will make your life better.

As you use the tool, we appreciate any feedback or comments you have (contact Chris Johnson of the Idaho registry)

Analysis and Data Improvement Tools

Analysis and Data Improvement Tools

NAACCR Central Registry Analyst Handbook

VENUSCANCER: SAS CODE MAPPING NAACCR V22 TO VENUSCANCER DATA DICTIONARY

Tract-Estimated Congressional District Populations

Supplemental NAACCR Poverty Indicator Measures and Yost Index

NCI/NAACCR Cancer Reporting Zones

NAACCR HISPANIC AND ASIAN/PACIFIC ISLANDER IDENTIFICATION ALGORITHM (NHAPIIA)

RUCA and URIC codes

NAACCR METHOD TO ESTIMATE COMPLETENESS

RECORD UNIQUENESS

SEX CODE VALIDATION UTILITY

TRANSLATION TOOLS FOR VOLUME II (DATA DICTIONARY)

V25 SAS TRANSLATION TOOL (Updated September 2025)

Prior Versions: V24, V23, V22, V21 SAS Translation Tools

Research & Analytic Tools