Here is the example that was discussed on today’s call (the house numbers are real, but are altered from any patient’s):
For the non-existent address 182 Washington Avenue, Albany, NY 12203
AGGIE currently gives
182 Washington Avenue Extension, Albany, NY 12203 (in the NAACCR version)
182 Washington Avenue Avenue, Albany, NY 12203 (in the SEER*DMS version – I suspect a small bug here)
with a match score of 100 (*update – based on the call, this result will be penalized in the future to have a score less than 100).
However, in this case, the correct address is 182 Washington Avenue, Albany, NY 12210
Other viable candidates would have been:
282 Washington Avenue, Albany, NY 12203
782 Washington Avenue, Albany, NY 12203
982 Washington Avenue, Albany, NY 12203
1082 Washington Avenue, Albany, NY 12203
Here are my own Bayesian prior probabilities for each of these possibilities:
182 Washington Avenue Extension, Albany, NY 12203 (0.35)
182 Washington Avenue, Albany, NY 12210 (0.35)
282 Washington Avenue, Albany, NY 12203 (0.05)
782 Washington Avenue, Albany, NY 12203 (0.05)
982 Washington Avenue, Albany, NY 12203 (0.05)
1082 Washington Avenue, Albany, NY 12203 (0.14)
None of the above (0.01)
So AGGIE is picking the most likely choice here (at least a tie for the most likely choice) – and I think that most of the time, this would be the case – but would still be incorrect 65% of the time.
I think this is a typical example, in that there will usually be a handful of possible alternatives for every typo. No matter how much we tweak the weights and penalties, I don’t see how AGGIE could ever guess correctly more than half the time. Certain kinds of analyses can tolerate having a few percent of the records geocoded to the wrong place. In New York, because we are legally mandated to publish small-area case counts, and because we do many small-area cancer investigations, we can’t. Hence requiring a match score of 100.