Data Cleaning Part 2 - Geocoding Addresses, Double The Performance By Cleaning | From Learning and Evolution to Data Science

Summary

This is my second post on topic of Data Cleaning.
Cleaning addresses format turned out to have a substantial positive impact on Geocoding performance.
Deep understandings of address format standard is needed to deal with all kinds of special cases.

Introduction

I discussed a lot of interesting findings I discovered in NYC Taxi Trip data in last post. However it was not clear whether the cleaning added much value to the analysis other than some anomaly records were removed, and you can always check the outliers for any calculation and remove them when appropriate.

Actually there are some times that the data cleaning can have great benefits. I was geocoding lots of addresses from public data recently, and found cleaning the addresses almost doubled the geocoding performance. This effect is not really mentioned anywhere as far as I know, and I only have a theory about how that is possible.

In short, I was feeding address strings to PostGIS Tiger Geocoder extension for geocoding.

Clean Addresses Have Much Better Geocoding Performance

Simple assembling the columns could have lots of dirty inputs which will interfere with the Geocoder parsing. I first did one pass Geocoding on 2010 data, then checked the geocoding results. I filtered many type of dirty inputs that caused problems and cleaned them up. Using the cleaning routine on other years’ data, the geocoding performance doubled.

NFIRS Data Year	Addresses Count	Time Used
2009	1,767,797	6.3 days
2010	1,829,731	14.28 days
2011	1,980,622	7.06 days
2012	1,843,434	6.57 days
2013	1,753,145	6.51 days

I didn’t find anybody mentioned this kind of performance gain in my thorough research on Geocoding performance tuning. Somebody suggested to normalize address first, but that didn’t help on performance because the Geocoder actually will normalize address input anyway, unless your normalize procedure is vastly better than the built-in normalizer. My theory about this performance gain is as follows:

Postgresql PostGIS server will try to cache all the data needed for geocoding in RAM. My Geocoding server can hold 1 ~ 2 states’ data in RAM, so I split the input addresses by states. Every input file are single state only. Ideally the server will not need to read from disk in most of time.
The problem is there are lots of addresses that have wrong zip code or city. The Geocoder can still process them but it will be much more slower because the Geocoder need to scan in a much broader range. It seemed that it will scan all states even if the state information is correct. I didn’t find a way to limit the scan range to a known state, and this was confirmed by the Geocoder author.
The problematic addresses are scattered in the input file. Every time when the Geocoder meet them, it will scan all states and mess up the perfect cache, which caused lots of performance drop on the good addresses followed.
With the cleaning procedure in use, the bad address are either removed from input or collected into a special input file, separated from the good addresses. Now the Geocoder can process the good addresses much faster.

All the format errors

Here are the cleaning procedures I used. In the end I filtered and cleaned about 14% of data in many types.

# loading data and preparing address string
data_year = '2010'
# create year directory, load original address data, change year number here.
load(paste0('data/address/', data_year, '_formated_addresses.Rdata')) 
setnames(address,'ZIP5', 'zip')
address[, row_seq := as.numeric(row.names(address))]
setkey(address, zip)
address[, address_type := 'a'] # type 1, 3,4,5 as addresses can be geocoded.
address[LOC_TYPE == '2', address_type := 'i']  # to be combined with intersections in type 1 as intersections input
address[LOC_TYPE %in% c('6', '7'), address_type := 'n'] # ignore 6,7
# original reference, change input string instead of original fields if possible
address[, original_address :=
                    paste0(NUM_MILE,' ', STREET_PRE,' ', STREETNAME,' ', STREETTYPE, 
                           ' ', STREETSUF, ' ',APT_NO, ', ', CITY, ', ', STATE_ID, ' ', zip)]

There are many manually inputed symbols for NA:

> head(str_subset(address$original_address, "N/A"))
[1] "55  Margaret ST  N/A, Monson, MA 01057"         "55  Margaret ST  N/A, Monson, MA 01057"        
[3] "1657  WORCESTER RD   N/A, FRAMINGHAM, MA 01701" "132  UNION AV   N/A, FRAMINGHAM, MA 01702"     
[5] "N/A  OAKLAND BEACH AV   , Warwick, RI 02889"    "00601  MERRITT 7 N/A  , NORWALK, CT 06850"  

> head(str_subset(address$original_address, "null"))
[1] "96  Walworth ST  null, Saratoga Springs, NY 12866" "197 S Broadway   null, Saratoga Springs, NY 12866"
[3] "640  West Broadway   , Conconully, WA 98819"       "58  W Fork Rd   , Conconully, WA 98819"           
[5] "  Mineral Hill Rd   , Conconully, WA 98819"        "225  Conconully ST  , OKANOGAN, WA 98840"

Because ‘NA’ or ‘na’ could be a valid part in address string, it’s better to clean them before concatenating fields into one address string.

> head(str_subset(address$original_address, "NA"))
[1] "7821 W CINNABAR AV  , PEORIA, AZ 00000"      "7818 W PINNACLE PEAK RD  , PEORIA, AZ 00000"
[3] "8828 W SANNA ST  , PEORIA, AZ 00000"         "8221 W DEANNA DR  , PEORIA, AZ 00000"       
[5] "2026 W NANCY LN  , PHOENIX, AZ 00000"        "3548 E HELENA DR  , PHOENIX, AZ 00000"

Once I finished cleaning on fields, I will prepare a cleaner address string and do the further cleaning in that concatenated string. That’s why I concatenated all original fields into original_address, which is for reference in case some fields changed in later process.

Most other cleaning process are better done in the whole string, because some input may go to wrong fields, like street number in street name instead of street number column. With the whole string this kind of error doesn’t matter any more.

# remove all kinds of input for NA
str_subset(address$original_address, "N/A")
for (j in seq_len(ncol(address)))
    set(address,which(is.na(address[[j]]) | 
                        (address[[j]] %in% c('N/A','n/a', 'NA','na', 'NULL', 'null'))),j,'')

Many addresses’ zip code are wrong.

> sample(address[!grep('\\d\\d\\d\\d\\d', zip), zip], 20)
 [1] ""     "06"   ""     ""     "625"  "021"  "33"   "021"  "461"  ""     "021"  "2008" "970"  ""     "11"   "021"  "021" 
[18] "9177" ""     "021"

The Geocoder can process address without zip code, but it have to be format like ‘00000’.

# ---- some zip are invalid ----
address[!grep('\\d\\d\\d\\d\\d', zip), ':=' (zip = '00000', address_type = 'az')]

After the above 2 steps of direct modifying address fields, I prepared the address string and will process the whole string in all later cleaning.

# ---- prepare address string (ignore apt_no) ---- 
address[, input_address :=
                    paste0(NUM_MILE,' ', STREET_PRE,' ', STREETNAME,' ', STREETTYPE, 
                           ' ', STREETSUF, ' ', ', ', CITY, ', ', STATE_ID, ' ', zip)]  
address[, input_address := str_trim(gsub("\\s+"," ",input_address))]

Some addresses are empty.

> head(address[STATE_ID == '' & STREETNAME == '', original_address])
[1] "     , ,  " "     , ,  " "     , ,  " "     , ,  " "     , ,  " "     , ,  "

# ---- ignore empty rows, most with empty state and zip ---- 
# may came from duplicate records for same event from 2 dept
address[STATE_ID == '' & STREETNAME == '', address_type := 'e']

There are lots of usage of speical symobls like /, @, &, * in input which will interfere with the Geocoder.

> sample(address[LOC_TYPE == '1' & str_detect(address$input_address, "[/|@|&]"), input_address], 10)
 [1] "743 CHENANGO ST , BINGHAMTON/FENTON, NY 13901"             "123/127 tennyson , highland park, MI 48203"               
 [3] "318 1/2 McMILLEN ST , Johnstown, PA 15902"                 "712 1/2 BURNSIDE DR , GARDEN CITY, KS 67846"              
 [5] "m/m143 W Interstate 16 , Ellabell, GA 31308"               "12538 Greensbrook Forest DR , Houston / Sheldon, TX 77044"
 [7] "F/O 1179 CASTLEHILL AVE , New York City, NY 10462"         "509 1/2 N Court , Ottumwa, IA 52501"                      
 [9] "7945 Larson , Hereford/Palominas, AZ 85615"                "1022 1/2 N Langdon ST , MITCHELL, SD 57301"

First I remove all the 1/2 since the Geocoder cannot recognize them, and removing them will not affect the Geocoding result accuracy.

address[str_detect(address$input_address, "1/2"), 
        input_address := str_replace_all(input_address, "1/2", "")]

Some used * to label intersections, which I will use different Geocoding scripts to process later.

> head(address[str_detect(input_address, "[a-zA-Z]\\*[a-zA-Z]"), input_address])
[1] "16 MC*COOK PL , East Lyme, CT 06333"             "1236 WAL*MART PLZ , PHILLIPSBURG, NJ 08865"     
[3] "0 GREENSPRING AV*JFX , BROOKLANDVILLE, MD 21022" "0 BELFAST RD*SHAWAN RD , COCKEYSVILLE, MD 21030"
[5] "0 SHAWAN RD*WARREN RD , COCKEYSVILLE, MD 21030"  "0 SHAWAN RD*BELFAST RD , COCKEYSVILLE, MD 21030"

address[str_detect(input_address, "[a-zA-Z]\\*[a-zA-Z]"), address_type := 'i_*']

Similarly filter other special symbols.

address[address_type == 'a' & str_detect(address$input_address, "[/|@|&]"),
        address_type := 'i_/@&']

Many addresses used milepost numbers, which is a miles count along highway. They are not street addresses and cannot be processed by the Geocoder. There are all kinds of usage to record this type of address.

> head(str_subset(address$input_address, "(?i)milepost"))
[1] "452.2E NYS Thruway Milepost , Angola, NY 14006" "447.4W NYS Thruway Milepost , Angola, NY 14006"
[3] "446W NYS Thruway Milepost , Angola, NY 14006"   "447.4 NYS Thruway Milepost , Angola, NY 14006" 
[5] "444.1W NYS Thruway Milepost , Angola, NY 14006" "I-94 MILEPOST 68 , Eau Claire, WI 54701"       

> head(str_subset(address$input_address, "\\bmile\\b|\\bmiles\\b"))
[1] "2.5 mile Schillinger RD , T8R3 NBPP, ME 00000"        "cr 103(2 miles west of 717) , breckenridge, TX 00000"
[3] "Interstate 93 south mile mark , WINDHAM, NH 03087"    "183 lost mile rd. , parsonfield, ME 04047"           
[5] "168 lost mile RD , w.newfield, ME 04095"              "20 mile stream rd , proctorsville, VT 05153"

Note it’s still possible to have some valid street address with mile as a word in address(my regular expression only check when mile is a whole word, not part of word), but it should be very rare and difficult to separate the valid addresses from the milepost usage. So I’ll just ignore all of them.

address[str_detect(paste0(NUM_MILE, STREETNAME), "\\bmile\\b|\\bmiles\\b"), address_type := 'm']
address[str_detect(address$input_address, "(?i)milepost"), address_type := 'm']

Another special format of address is grid style address. I decided to remove the grid number part and keep the rest of address. The Geocoder will get a rough location for that street or city, which is still helpful for my purpose. The Geocoding match score will separate this kind of rough match from the exact match of street addresses.

Grid-style Complete Address Numbers (Example: “N89W16758”). In certain communities in and around southern Wisconsin, Complete Address Numbers include a map grid cell reference preceding the Address Number. In the examples above, “N89W16758” should be read as “North 89, West 167, Address Number 58”. “W63N645” should be read as “West 63, North, Address Number 645.” The north and west values specify a locally-defined map grid cell with which the address is located. Local knowledge is needed to know when the grid reference stops and the Address Number begins.
Page 37, United States Thoroughfare, Landmark, and Postal Address Data Standard

Most are WI and MN addresses. Except the E003 NY address, I’m not sure what does that means. Since the Geocoder cannot handle it either, they can be removed.

> sample(address[str_detect(address$input_address, "^[NSWEnswe]\\d"), input_address], 10)
 [1] "W26820 Shelly Lynn DR , Pewaukee, WI 53072"      "E14 GATE , St. Paul, MN 55111"                  
 [3] "W5336 Fairview ROAD , Monticello, WI 53570"      "W22870 Marjean LA , Pewaukee, WI 53072"         
 [5] "E003 , New York City, NY 10011"                  "W15085 Appleton AVE , Menomonee Falls, WI 53051"
 [7] "N7324 Lake Knutson RD , Iola, WI 54945"          "N10729 Hwy 17 S. , Rhinelander, WI 54501"       
 [9] "N2494 St. Hwy. 162 , La Crosse, WI 54601"        "N2639 Cty Hwy Z , Palmyra, WI 53156"

address[str_detect(address$input_address, "^[NSWEnswe]\\d") & address_type == 'a', 
        address_type := 'ag']
address[address_type == 'ag', 
        input_address := str_replace(input_address, "^[NSWEnswe]\\d\\w*\\s", "")]

Some addresses have double quotes in it. Paired double quotes can be handled by the csv and Geocoder, but single double quote will cause problem for csv file.

> sample(address[str_detect(input_address, '"'), input_address], 10)
 [1] "317 IND \"C\" line at 14th ST , New York City, NY 10011" "750 W \"D\" AVE , Kingman, KS 67068"                    
 [3] "HWY \"32\" , SHEBOYGAN, WI 53083"                        "22796 \"H\" DR N , Marshall, MI 49068"                  
 [5] "5745 CR 631 \"C\" ST , Bushnell, FL 33513"               "CTY \"MM\" , HOWARDS GROVE, WI 53083"                   
 [7] "\"BB\" HWY , West Plains, MO 65775"                      "I-55 (MAIN TO HWY \"M\") , Imperial, MO 63052"          
 [9] "3400 Wy\"East RD , Hood River, OR 97031"                 "6555 Hwy \"D\" , parma, MO 63870"

# remove single double quote
address[str_detect(input_address, '(?m)(^[^"]*)"([^"]*$)') & 
          str_detect(address_type, "^a"), address_type := 'aq']
address[address_type == 'aq', 
        input_address := str_replace_all(input_address, '(?m)(^[^"]*)"([^"]*$)', "\\1\\2")]

Some addresses used (), which cause problems for the Geocoder. The stuff inside () can be removed.

> sample(address[str_detect(address$input_address, "\\(.*\\)"), input_address], 10)
 [1] "hwy 56 (side of beersheba mt) , beersheba springs, TN 37305"
 [2] "805 PARKWAY (DOWNTOWN) RD , Gatlinburg, TN 37738"           
 [3] "3409 JAMESWAY DR SW , Bernalillo (County), NM 87105"        
 [4] "96 Arroyo Hondo Road , Santa Fe (County), NM 87508"         
 [5] "3555 Dobbins Bridge RD , Anderson (County), SC 29625"       
 [6] "KARPER (12100-14999) RD , MERCERSBURG, PA 17236"            
 [7] "15.5 I-81 (10001-16000) LN N , Chambersburg, PA 17201"      
 [8] "30 Wintergreen DR , Beaufort (County), SC 29906"            
 [9] "305 Rosecrest RD , Spartanburg (County), SC 29303"          
[10] "1678 ROUTE 12 (Gales Ferry) , Gales Ferry, CT 06335"

# remove paired ()
address[str_detect(address$input_address, "\\(.*\\)"), address_type := 'a()']
address[address_type == 'a()', 
        input_address := str_replace_all(input_address, "\\(.*\\)", "")]

After this step, there are still some single ( cases.

> sample(address[str_detect(input_address, "\\("), input_address], 10)
 [1] "65 E Interstate 26 HWY , Columbus (Township o, NC 28722"   
 [2] "4496 SYCAMORE GROVE (4300-4799 RD , Chambersburg, PA 17201"
 [3] "AAA RD , Fort Hood (U.S. Army, TX 76544"                   
 [4] "2010 Catherine Lake RD , Richlands (Township, NC 28574"    
 [5] "285 Scott CIR NW , Calhoun (St. Address, GA 30701"         
 [6] "Highway 411 NE , Calhoun (St. Address, GA 30701"           
 [7] "2626 HILLTOP CT SW , Littlerock (RR name, WA 98556"        
 [8] "144 Tyler Ct. , Richland (Township o, PA 15904"            
 [9] "263 Farmington AVE , Farmington (Health C, CT 06030"       
[10] "12957 Roberts RD , Hartford (Township o, OH 43013"

address[str_detect(input_address, "\\("), address_type := 'a(']
# other than the case that ( in beginning, all content from ( to , to be removed.
address[str_detect(input_address, "^\\("), 
                        input_address := str_replace(input_address, "^\\(", "")]
address[str_detect(input_address, "\\("), 
                        input_address := str_replace(input_address, "\\(.*(,)", "\\1")]

Some used ; to add additional information, which will only cause trouble for the Geocoder.

> sample(address[str_detect(input_address, ";"), input_address], 10)
 [1] "1816 MT WASHINGTON AV #1; WHIT , Colorado Springs, CO 80906"
 [2] "3201 E PLATTE AV; WAL-MART STO , Colorado Springs, CO 00000"
 [3] "1511 YUMA ST #2; CONOVER APART , Colorado Springs, CO 80909"
 [4] "3550 AFTERNOON CR; MSGT ROY P , Colorado Springs, CO 80910" 
 [5] "805 S CIRCLE DR #B2; APOLLO PA , Colorado Springs, CO 00000"
 [6] "5590 POWERS CENTER PT; SEVEN E , Colorado Springs, CO 80920"
 [7] "715 CHEYENNE MEADOWS RD; DIAMO , Colorado Springs, CO 80906"
 [8] "3140 VAN TEYLINGEN DR #A; SIER , Colorado Springs, CO 00000"
 [9] "Meadow Rd; rifle clu , Hampden, OO 04444"                   
[10] "3301 E SKELLY DR;J , TULSA, OK 74105"

address[str_detect(input_address, ";"), address_type := 'a;']
address[address_type == 'a;', 
        input_address := str_replace(input_address, ";.*?(,)", "\\1")]

Some have *.

> sample(address[str_detect(address$input_address, "\\*") & address_type == 'a', input_address], 10)
 [1] "TAYLOR ST , *Holyoke, MA 01040"            "NORTHAMPTON ST , *Holyoke, MA 01040"      
 [3] "1*5* W Coral RD , Stanton, MI 48888"       "Cr 727 *26 , angleton, TX 77515"          
 [5] "378 APPLETON ST , *Holyoke, MA 01040"      "0 I195*I895 , ARBUTUS, MD 21227"          
 [7] "1504 NORTHAMPTON ST , *Holyoke, MA 01040"  "50 RIVER TER , *Holyoke, MA 01040"        
 [9] "BOOKER ST * CARVER ST , Palatka, FL 32177" "19 OCONNOR AVE , *HOLYOKE, MA 01040"

address[str_detect(address$input_address, "\\*") & address_type == 'a', address_type := 'a*']
address[address_type == 'a*', input_address := str_replace_all(input_address, "\\*", "")]

This looks like came from some program output.

> head(address[str_detect(address_type, "^a") & str_detect(input_address, "\\*"), input_address])
[1] "5280 Bruns RD , **UNDEFINED, CA 00000"         "6500 Lindeman RD , **UNDEFINED, CA 00000"     
[3] "5280 Bruns RD , **UNDEFINED, CA 00000"         "17501 Sr 4 , **UNDEFINED, CA 00000"           
[5] "5993 Bethel Island RD , **UNDEFINED, CA 00000" "1 Quail Hill LN , **UNDEFINED, CA 00000"

address[str_detect(address_type, "^a") & str_detect(input_address, "\\*"),
        input_address := str_replace(input_address, "\\*\\*UNDEFINED", "")]

Almost any special character that OK for human reading still cannot be handled by the Geocoder.

> sample(address[str_detect(input_address, "^#"), input_address], 10)
 [1] "# 6 HIGH , Marks, MS 38646"                          "#560 CR56 , MAPLECREST, NY 12454"                   
 [3] "#250blk Durgintown rd. , Hiram, ME 04041"            "#888 Durgintown Rd. , Hiram, ME 04041"              
 [5] "#15 LITTLE KANAWHA RIVER RD , PARKERSBURG, WV 26101" "# 12 HOLLOW RD , WELLSTON, OH 45692"                
 [7] "#10 I-24 , Paducah, KY 42003"                        "#10.5 mm St RD 264 , Yahtahey, NM 87375"            
 [9] "#1 CANAL RD , SENECA, IL 61360"                      "#08 N Ola DR , Yahtahey, NM 87375"

address[str_detect(input_address, "^#"), address_type := 'a#']
address[address_type == 'a#', input_address := str_replace_all(input_address, "^#", "")]

All these steps may look cumbersome. Actually I just check the Geocoding results on one year data raw input, find all the problems and errors, clean them by types. Then I apply same cleaning code to other years because they are very similar, and I got the Geocoding performance doubled! I think this cleaning is well worth the effort.

Version History

2016-02-03 : First version.
2016-05-11 : Added Summary.