NAACCR Geocoder

This topic has 4 replies, 2 voices, and was last updated 7 years, 7 months ago by Francis Boscoe.

Viewing 5 posts - 1 through 5 (of 5 total)

Author

Posts
February 8, 2017 at 3:21 pm #5070

Recinda Sherman
Spectator

We are currently working on improving the NAACCR Geocoder–currently focusing on improving the underlying street file data.

Please use this forum to discuss and report any potential issues with the NAACCR geocoder.

February 8, 2017 at 7:16 pm #5092

Francis Boscoe
Spectator

Here is the example that was discussed on today’s call (the house numbers are real, but are altered from any patient’s):

For the non-existent address 182 Washington Avenue, Albany, NY 12203

AGGIE currently gives

182 Washington Avenue Extension, Albany, NY 12203 (in the NAACCR version)
182 Washington Avenue Avenue, Albany, NY 12203 (in the SEER*DMS version – I suspect a small bug here)

with a match score of 100 (*update – based on the call, this result will be penalized in the future to have a score less than 100).

However, in this case, the correct address is 182 Washington Avenue, Albany, NY 12210

Other viable candidates would have been:

282 Washington Avenue, Albany, NY 12203
782 Washington Avenue, Albany, NY 12203
982 Washington Avenue, Albany, NY 12203
1082 Washington Avenue, Albany, NY 12203

Here are my own Bayesian prior probabilities for each of these possibilities:

182 Washington Avenue Extension, Albany, NY 12203 (0.35)
182 Washington Avenue, Albany, NY 12210 (0.35)
282 Washington Avenue, Albany, NY 12203 (0.05)
782 Washington Avenue, Albany, NY 12203 (0.05)
982 Washington Avenue, Albany, NY 12203 (0.05)
1082 Washington Avenue, Albany, NY 12203 (0.14)
None of the above (0.01)

So AGGIE is picking the most likely choice here (at least a tie for the most likely choice) – and I think that most of the time, this would be the case – but would still be incorrect 65% of the time.

I think this is a typical example, in that there will usually be a handful of possible alternatives for every typo. No matter how much we tweak the weights and penalties, I don’t see how AGGIE could ever guess correctly more than half the time. Certain kinds of analyses can tolerate having a few percent of the records geocoded to the wrong place. In New York, because we are legally mandated to publish small-area case counts, and because we do many small-area cancer investigations, we can’t. Hence requiring a match score of 100.

February 22, 2017 at 3:48 pm #5215

Francis Boscoe
Spectator

Another example. Ocean Avenue is not the same as Ocean Parkway. There are tens of thousands more like this.

February 22, 2017 at 3:52 pm #5216

Francis Boscoe
Spectator

I am unable to edit the above post – I just get taken to a blank screen. Anyhow, it was a screen shot showing how the two streets are miles apart. 512 KB is a tiny file size limit, you might want to rethink that.

August 10, 2017 at 1:23 pm #6076
Francis Boscoe
Spectator
After many rounds of improvements on AGGIE’s part, here is my assessment of how it compares with the previously existing data in the New York State Cancer Registry. It’s not as detailed as what New Jersey did, but should suffice.
94.9% – AGGIE returns same county
4.9% – AGGIE replaces known with unknown – this reflects our conservative approach (requiring high match score) and is similar to what we had in our old system. We can handle manual review of this many cases.
0.2% – AGGIE replaces unknown with known – I spot-checked a few and AGGIE looked good
0.02% – AGGIE replaces known with known – I spot-checked a few and AGGIE looked good

Latitude/longitude (restricting to where it is known on both databases)
96.1% – AGGIE is within 100 meters of existing registry value
3.7% – AGGIE within 100 m – 1 km
0.2% – AGGIE more than 1 km different

Most of the differences in the 1-3 km range seem to arise from choosing different points on the same road. Of the ones I’ve spot-checked, AGGIE is correct 63%, registry original value 27% and neither 12%.

In the 20-30 km range, AGGIE was correct 3 times, registry 6 times (only 9 examples total). Two of the AGGIE errors were where it replaced COUNTY ROUTE 2 with COUNTY ROUTE and placed a point on a seemingly random county route (matched to county parcel layer with a score of 100). Maybe this can be fixed, but obviously it is an infrequent occurrence. 3 of the errors were on Route 12 in Watertown, but the errors were inconsistent – registry was right twice and AGGIE once.

All the 18 differences of >50 km were typos by the registry.

On balance, AGGIE wins. Time to turn it back on for NY.
Author

Posts

Viewing 5 posts - 1 through 5 (of 5 total)

The forum ‘Research & Data Use’ is closed to new topics and replies.

NAACCR Geocoder

Recent Topics