Tech Tip: Address Verification - Parsing
The address is a key piece of information in anyone's database, but it can be extremely hard to validate, and due to free form input, even harder to search.
Several years ago, I had a client that wanted to use the Address as data for validation when checking for duplicate information. The client works in the building trades and keeps tracks of job sites. Since more than one of their customers could be working at any one job site, they wanted to make sure that duplicate job site records were not created.
The client wanted a way to search for exact and/or like address, then display the results to the end user for them to select from.
If this was a new a new database, then the easiest way to do this is to require the user to break out the common information like the street name and address number. The problem was that the database already existed. The client needed to retro-fit the solution into the existing data. They also wanted to keep the data entry as normal as possible, so the division of the address number and street name was not a viable solution.
An address can be input in several different formats, but they all follow a common pattern. Look at the following examples:
Wilson Street, #B 100 Wilson Street, #B 100 Wilson, Apt. b 100 Wilson Street
Pattern: {address number}{street name}{type of street}{apartment or suite}
Using this format as the guideline, a subroutine was created that would parse the address into the following pattern.
The first thing that the routine does is check to see if an address number has been input. Once it has found all of the address number, then it checks to see what the street name is.
Street names can be a little confusing. At times people input 'Wilson Street' when they really want to have 'Wilson Lane'. Other times, only 'Wilson' is input as the street name. Due to this, the routine separates the name of the street and the type of the street (road, lane, ave, etc) from each other. This allows matches on just 'Wilson' and then checks to see if the street type matches up to create a more exact match.
For the same reason that 'Wilson' and 'Street' are separated, the apartment number is also separated from the address.
As displayed in the examples of the 'Wilson Street' address, the routine has to be able know when a part of the address is missing and move onto the next part. Instead of starting with one part of the address and finding that before moving onto the next part, the routine looks for all parts of the address as it spans the address string.
The end result of each example are:
Wilson Street, #B = Wilson;;B;;Street 100 Wilson Street, #B = Wilson;100;B;;Street 100 Wilson, Apt. b = Wilson;100;B;; 100 Wilson Street = Wilson;100;;;Street
Now that the address has been broken into common components, it can be indexed and searched. It can even be validated to make sure that all the information needed has been input.
You can download the subroutine from the International Spectrum at intl-spectrum.com/s1042. is
Do you have a Tech Tip to share? E-mail it to editor@intl-spectrum.com