A story went viral on X this week, thanks to a re-tweet from Elon Musk, that the US Department of Justice (DoJ) is suing the State of Virgina’s election commissioner because the state has a “voter registration duplication rate of 33% in 2024”.
In 2024 we published our own analysis showing that across ~50 Million voter registrations from 7 States (AR, FL, GA, MI, NC, OH, and PA), there was an overall duplicate rate of 0.8%. We find it highly improbable that Virgina has a duplicate rate that is 41x times higher than the 7 States that we tested.
We did not analyse data from Virginia as at the time of our analysis, the VA Voter Registration List (VRL) cost USD6,000. The cost has since been reduced to USD600.
We also note that the lawsuit states that the national average voter registration duplicate rate is 12.7%, which is already a suspiciously high number (15.8x higher than our figure).
Firstly, why do we ( Tilores) care? Our technology specialises in identifying duplicate and related records of data (i.e. voter records) from one or multiple datasets (i.e. different State voter lists). In the world of data science, this is known as “entity resolution”. Our software is typically used for fraud detection and anti-money laundering in the finance world, but can also be used for building single customer views (SVCs) for almost any use case.
The challenge in entity resolution is identifying these related or duplicate records when they are non-identical. A typo here, a nickname there, some missing or conflicting data — and data records that to the human eye obviously belong together, are, as far as a computer is concerned, non-identical.
Imagine for example my own name, Steven Renwick, could easily be mis-spelt as Stephen Rennick. Those two names are non-identical, but a human can see they could belong to the same person. Add a matching date-of-birth, a mobile phone number, or a postal address (all of which themselves could be messy and inconsistent) and you can be confident these two identities belong to the same real-world person.
Entity resolution is a field that flatters to deceive. It seems like the combination of a few similarity algorithms, such as Levenshtein Distance,which compares how similar two strings of text are, will be sufficient to analyse a dataset. However, the bitter experience most have is that simple approaches don’t work well enough over the messy long-tail of real-world data.
Furthermore, as a “quadratically scaling problem”, where every record needs to be compared to every other one for a full analysis, it can be difficult for a regular data scientist to make iterative improvements to their analysis when the dataset is any larger than a few thousand records.
So what is going on? Why does the DoJ think there is such a high voter registration duplicate rate?
Well we can actually find the original analysis, which prompted the legal action. The work was carried out by a organisation calling itself the Electoral Process Education Corp (EPEC). At the time of writing, neither of their websites are working (epec.info and DigitalPollwatchers.org), but they do have an active X account, and their analysis is available on Substack, where the author goes into some detail about his methodology. Unfortunately, however, several references to previous analysis are not available as they linked to the organisation’s websites.
The author primarily relies on use of the Levenshtein distance algorithm used across First Name + Middle Name + Last Name + Suffix + Full DoB. To his credit, he does emphasise several times that he is identifying duplicates. Furthermore, he emphasises several times that increasing the fuzziness of the Levenshtein distance algorithm has the potential to identify more true duplicates, but will also create more false positives.
In entity resolution, one of the biggest challenges is balancing false positives (aka “precision” — these identified duplicates are not really duplicates) versus false negatives (aka “recall” — these two records are duplicates but the system did not detect them).
He does also acknowledge the tricky mathematics of entity resolution, and calculates that across the ~6 Million records in the Virginia Voter List (RVL) you would have to make ~3.8 x 10¹³ different string comparisons, and also calculates that this means 202.5 Quadrillion character comparisons. Phew!
So how does the EPEC author’s methodology differ from ours? Well, he appears to be using Levenshtein distance across all the potential attributes in one big analysis. No wonder he has so many string and character comparisons to make!
Tilores is a rule-based entity resolution system, where very granular matching can be defined. In the case of our voter analysis, we created 15 rules. If any single rule is triggered by two records, we would consider these two records to belong together and assign them the same entity-id.
Press enter or click to view image in full size
Let’s run through these rules individually, so you get an idea of how an enterprise-level entity resolution system handles such data.
Press enter or click to view image in full size
A matcher in Tilores, can be configured as you want. For example “Similar first_name” means that the first_name field was compared using 1) the Metaphone phonetic algorithm, and if two strings were phonetically the same, then they also had to match for Levenshtein distance of 0 for strings under 5 characters (i.e. identical) 1 for strings of 5 to 7 and 2 for strings of 8 plus. i.e. the longer the string, the more potential string differences we allowed. Note the EPEC author allowed a Levenshtein distance of up to 3.
We had other resources, such as lists of nicknames, common names and rare names. We also use a pretty comprehensive ETL-like module to normalise the data before attempting to match it.
Press enter or click to view image in full size
No doubt the EPEC author’s attempt is pretty solid for an non-specialist entity resolution analysis, but hopefully the above description gives you an idea of why it is so complicated to do this properly, and why you really need a specialised entity resolution system.
What confuses me is how they reached the 33.2% duplicate rate figure. The EPEC author suggests that they have a maximum number of duplicates of ~159,000. In a voting population of 6 Million, that would represent a duplicate rate of ~2.6%, which is much more in-line with our own analysis which found a duplicate rate of 0.32–1.1% in the 7 States. Still high, but nowhere near 33%.
I can only assume that the denominator in this calculation, which does not come from the EPEC author’s analysis and is mentioned in the complaint document as coming from another Election Administration and Voting Survey 2024, is based on the number of new registrations. This would not be a very clever way to do the analysis, as the most common reason for (re)registering is that you have moved address. Especially if you have changed county, there is a strong chance that your new registration happens before your previous voting profile is removed.
This is an administrative issue, rather than a sign of anything nefarious.
The EPEC author points out that if he had access to the voters’ social security numbers or drivers license numbers, then he would be able to do a more comprehensive analysis. We reached the same conclusion. A multi-data source approach is needed to perform this analysis correctly.
Well guess what?! There is an organisation in the US that does exactly that!
After posting our initial voter data analysis, we learned of the existence of an organisation called the Electronic Registration Information Centre ( ERIC), which is a non-partisan NGO that works directly with States to do exactly this analysis, with comprehensive data, including social security and drivers licence numbers.
Their aim is to improve voter data to increase trust in the voting process by both eliminating duplicates and identifying people on the other lists that are not registered to vote but could be.
The State of Virginia could simply work with ERIC to prove (and improve) the quality of their voter data. ERIC works with both “red” and “blue” states (21 States + DC), but what is crazy is that they have come under attack in recent times by people who think that their work somehow favours one side or the other. As if increasing voter participation could ever be a bad thing!
In summary this feels like a a politically motivated legal compliant (duh?!) based on a solid, but perhaps naïve entity resolution analysis, amplified by a misrepresentation of that duplicate number to suggest that 33% of the voter profiles in Virginia are duplicates.
If you want to do this analysis using Tilores, we would be happy to work with you. Please get in touch.