Is A. Mitchell McConnell, Jr. The Same Person as Mitch McConnell?

6 min read Original article ↗

Simple Record Linkage with Approximate String Matching

Daniel Liden

Names are among the most common ways of identifying people: you probably don’t know your best friend’s social security number, and you don’t use your fingerprint to claim your Starbucks order. But names can be messy. Different people may share the same name, and the same person may go by a first or middle name, a nickname, initials, or a title or honorific such as “Doctor.”

For these reasons, names are seldom the best option for matching records between datasets. Sometimes, however, we may not have a unique identifier that can be used to match records across sources. Matching names from one source to another may be the best available option. And it’s an important problem: thoughtful record linkage can add tremendous value to data by providing ways to interface with other public or proprietary data sources.

We encountered this issue while working on our article on congressional stock trades: we might be able to tell that “Mitch McConnell” is the same person as “A. Mitchell Mcconnell, Jr.” (especially when we know there’s only one Mitch McConnell in Congress), but we can’t just use a SQL join with those two names as keys. This analysis would have been impossible without a way to match the members of congress across datasets, so we needed a way to match their names.

Figure 1: Matching names across datasets is not a trivial problem when those names are formatted in different ways and when there are many potential matches to choose from.

In this post, we describe how we used a metric called the Levenshtein Distance to quickly identify the “best match” for each name between the two datasets and to identify the subset of names that required some manual review.

Want to try it yourself? We put together a Deepnote dashboard where you can enter two strings and receive a summary of several different popular string similarity metrics, including the Levenshtein distance, for those two strings.

Name Matching between Congressional Datasets

In our project on congressional stock trading, we needed to obtain committee assignment data from the ProPublica Congress API and stock trading data from the senate/house stock watchers sites. As there was no unique key for merging these datasets, we needed to match the members of congress by name.

This wasn’t a trivial problem: the names had different formats between the two datasets, and one member of congress could be represented by more than one name in the stock watchers data. Manually matching a list of 158 names from the stock watchers data against 549 distinct names from the ProPublica API would be tedious, time consuming, and susceptible to errors, but we lacked unique and matching keys that could directly link the two datasets.

The Levenshtein Distance

Instead of manually matching the names, for each name in the stock watchers list, we computed a metric called the Levenshtein Distance (sometimes referred to as the edit distance) to each name in the ProPublica data.

The Levenshtein Distance between two strings is the number of single-character insertions, deletions, and substitutions — the number of “edits” — needed to transform one of the strings into another. We normalized the Levenshtein distance between each pair of names to a score between zero and one, with one representing an identical match and zero representing no match.

Figure 2: Converting Harry to Harvey results in a Levenshtein distance of 2 (with a normalized similarity score of 0.67). It requires one replacement (r → v) followed by one insertion (e).

For example, matching “Ronald L Wyden” to “Ron Wyden” requires deleting five characters (including a space): “ald L”, giving a Levenshtein Distance of five. Normalizing this returns a similarity score of 0.64. The distance from Ron Estes to Ron Wyden, on the other hand, requires replacing “Est” and “s” with “Wyd” and “n,” for a distance of 4. Once normalized, we obtain a score of 0.56, leading us to (correctly) suppose that “Ron Wyden” is a better match to “Ronald L Wyden” than to “Ron Estes.”

Get Daniel Liden’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

We found the best match — the name with the highest normalized similarity score — for each name and manually reviewed these matches for errors, paying special attention to those with comparatively low similarity scores.

The Levenshtein Similarity Correctly Matched All Senator Names

Figure 2 shows the normalized Levenshtein similarities between a selection of the senators whose names were the most similar to “A. Mitchell Mcconnell, Jr.” in the Stock Watchers Data. McConnell’s name, in this format, did not match particularly well to any of the names in the ProPublica data, but “Mitch McConnell” was still the best match. In fact, this approach successfully linked all of the senator names from the Stock Watchers dataset to their correct matches in the ProPublica data.

Figure 3: “A. Mitchell Mcconnell, Jr.” did not match any of the names in the ProPublica dataset particularly well, but “Mitch McConnell” was the best match with a normalized Levenshtein similarity of 0.54.

Some of the House of Representative Names Required Manual Matching

The Levenshtein Similarity approach was not able to correctly map all of House Stock Watchers names to their ProPublica counterparts. For the majority of names, we quickly confirmed that the name with the greatest Levenshtein similarity was the correct match. We paid particularly close attention to names with low Levenshtein similarity scores, and manually mapped the incorrect matches to their correct ProPublica data counterparts. Overall, this process took just a few minutes instead of the hours it likely would have taken to manually match each of the names.

Figure 3 shows several of the close matches for representative Neal Patrick Dunn, one of the misclassified members of the House.

Figure 4: “Sean Patrick Maloney” was incorrectly identified as the match for “Neal Patrick Dunn” because the perfect middle name match.

Even though the closest match wasn’t correct, it was very easy to look through the names ranked by Levenshtein similarity to locate the correct match quickly.

Try it Yourself

We chose to use the Levenshtein distance because it could straightforwardly handle names with different lengths, provided a useful similarity measure even when no characters were in the exact same positions between the two strings, and was conveniently implemented in a number of software packages (we recommend jellyfish in python, StringDistances.jl in Julia, and stringdist in R).

There are plenty of other methods for approximate string matching that vary in, for example, how they weight matches at the beginning of a string or how stringent they are about matching characters that appear in different positions between strings.

We put together a Deepnote dashboard where you can enter two strings and receive a summary of several different popular string similarity metrics for those two strings.

Compare More Strings with the Deepnote Dashboard.

We’ve also made the lists of congress member names, along with the mapping to their correct matches, available on bit.io. We correctly matched all of the Senate names and all but seven of the House names from the stock watcher data to their counterparts in the ProPublica datasets using the Levenshtein distance (a successful match rate of greater than 95%). Can you find a better approach?