Matching Records

Andrew Lang

One problem we’ve encountered when dealing with messy data is a situation where we find multiple systems in use and we want to know if one person is held in multiple systems. The problem is that we don't have a single ID code that is used in all these systems. So we need to use the data we have in each system to figure out if someone is the same person across the different systems. This can be done through some rules devised with the client's input.

 

So we have three systems:

 

System 1:

 

Title

First Name

Surname

Email

Company

MR

John

Smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

Smith & Co.

MR

James

Smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

Smith & Co.

MRS

Mary

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

 
 

J

Smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

Smith & Co.

   

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

Smith & Co.

 

System 2:



FirstName

LastName

Email

Telephone

Company

John

smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

07700 965332

Smith & Co.

James

SMITH

This email address is being protected from spambots. You need JavaScript enabled to view it.

07700 965341

Smith & Co.

 

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

   

Thomas

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

020 7946 8854

Bloggs Family Bakery



System 3:



TITLE

INITIAL

FAMILY NAME

EMAIL

COMPANY

PHONE

MR

J

SMITH

This email address is being protected from spambots. You need JavaScript enabled to view it.

SMITH & CO

 

MR

J

SMITH

   

07700 965341

MRS

M

BLOGGS

This email address is being protected from spambots. You need JavaScript enabled to view it.

BLOGGS BAKERY

020 7946 8854

MR

T

BLOGGS

This email address is being protected from spambots. You need JavaScript enabled to view it.

BLOGGS BAKERY

020 7946 8854

 

So as you can see there are a few potential challenges with these tables, for a starter they don't all have the same Columns or Column Headers, so we will need to work on that, one table is entirely in capital letters so we need to be aware of this.

 

There are some also fundamental truths to this data, namely that two people can share an email address and phone number, and that the data has been imperfectly gathered so there are several potentially ambiguous entries that could match with multiple individuals, so we will need to create a rule to handle these ambiguous matches.

 

In addition, we received some rules for these matches:

 

  1. If the individual has the same firstname lastname and email they should match (but nulls should never match to one another)

  2. If the individual has the same initial last name and title they should match (but only if the first names are both blank and the titles match and are not blank)

  3. The same two rules as above but for phone numbers and company names instead of email. (Rule 2 Cannot apply to company however)

 

In addition to these, we recommend an additional rule, that empty or null values should not match one another. So two individuals sharing a last name and email in two systems but both missing a first name would never match, as we cannot be sure they are not two different individuals, so with that in mind we set about the design of the workflow.

 

There may be other constraints that we implement as we go through this task.

 

The first step is bringing the three systems together, creating a single view of all the contacts we have:

 

We do some renaming to ensure that we have common column names across the three input streams, then we can use the union tool to build a single table of all the contacts we have:



FIRST_NAME

SURNAME

EMAIL

PHONE

COMPANY

TITLE

INITIAL

John

smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

07700 965332

Smith & Co.

   

James

SMITH

This email address is being protected from spambots. You need JavaScript enabled to view it.

07700 965341

Smith & Co.

   
 

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

       

Thomas

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

020 7946 8854

Bloggs Family Bakery

   

John

Smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Smith & Co.

MR

 

James

Smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Smith & Co.

MR

 

Mary

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

   

MRS

 

J

Smith

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Smith & Co.

   
 

Bloggs

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

Smith & Co.

   
 

SMITH

This email address is being protected from spambots. You need JavaScript enabled to view it.

 

SMITH & CO

MR

J

 

SMITH

 

07700 965341

 

MR

J

 

BLOGGS

This email address is being protected from spambots. You need JavaScript enabled to view it.

020 7946 8854

BLOGGS BAKERY

MRS

M

 

BLOGGS

This email address is being protected from spambots. You need JavaScript enabled to view it.

020 7946 8854

BLOGGS BAKERY

MR

T

 

So the next step is to do some cleaning, to get everything into the same case for matching purposes, and we can also add in a row id for our data (we will need this in the next step)

 

 

So we can also do a calculation to derive some entries in the initial column:

 

 

Here we find those cases where we are missing the initial, and we have a first name, and using that we can determine the initial.

 

Now we can start to think about the rules we want to apply, the final output we are aiming for is for every ID we have, we want to find the one or more ID’s that it matches from elsewhere in the list. At first, we thought about using something like a multi-row formula to move through the list and try and match the rows, that way using a cleaver sequence of sorts and re-sorts . But there is a much simpler way, we can simply join the list to itself using the join tool.

 

 

This does give the rough outputs that we are looking for with each row matched to the others when email, firstname and surname match, but it doesn't take into account our constraint, nulls should never match to each other, as you can see in the table below,

 

Here we see that the BLOGGS surname and This email address is being protected from spambots. You need JavaScript enabled to view it. are creating a lot of extra matches. So to add in these constraints we filter out the inputs that have an empty Firstname, or Surname, or Email, the results from the join then look something like the table below:

 

 

Now we still need to do a little something else, the ones that didn't join those with blank email, first name, or surname, still need to be dealt with, as the data is still valid, so what we can do is merge the L and R outputs from the join to get a complete picture of how the rows match under this rule.

One small point here is that the rows that did not match under this rule do still match to themselves so after we merge them together we can look for those missing a Matching ID and simply set the matching id to the ID. We will deal with duplicates later.

 

 

So now we can start on the next rule, Initial Last Name and Email, this is done in a very similar way to the first name surname and email, except there are some additional constraints, which we can add in using a formula after the J output from the join and before the union:

 

Here we check to see if the three constraints are satisfied, if all three are then the match is allowed, otherwise it is set to null or no match. Here we have added in two additional constraints, namely that the companies and phone numbers should match or one/both can be blank.

 

 

 

We can then do similar flows for Company Name and Phone number, just swapping out Email for one of these, and vice versa. Though for now and in the interests of keeping this somewhat interesting we will leave those for now.

 

Our Next task once all the tests are completed is to fist combine the outputs and generate a complete list of matches, now there may be some duplicate records so we can run these through the unique tool to find only the unique matches.

 

 

 

Now we can use the make group tool to create a lookup between the original ID and a new grouped person ID:

 

Here the Key is the ID field and Group is the newly identified person Code, so we can join these to the original data to see the newly grouped records and validate that they are as we might expect.

 

 

Looking through this these groupings are roughly what was expected. For messier data and more complex data, more constraints and tests might be required but this should give you a good idea of what you can do to match records across disparate systems. If you want to take a closer look at this workflow a refined version can be downloaded from the Alteryx Gallery.

 

In the next post, we will look at how we can consolidate these rows into a single record for an individual.

 

Free Alteryx Trial!

Free Tableau Trial!

Get in touch with us today!

Contact us
Cookies make it easier for us to provide you with our services. With the usage of our services you permit us to use cookies.
More information Ok