Parsing Precinct Election Results PDFs Using LLMs
Derek Willis, OpenElections
- PDFs of official precinct results from counties in Texas, Mississippi, and Pennsylvania
- Each has unique PDF formats (image scans, multi-column layouts, varying structures)
- Can LLMs be better/faster than OCR + manual cleanup or custom Python parsers?
What We Tested
For Texas and Mississippi, we tested Claude Haiku 4.5, Claude Sonnet 4.5, Gemini 3 Flash, Gemini 2.5 Pro, and Gemini 3 Pro.
For Pennsylvania, we used Claude Sonnet 4.5 to write a custom Python parser.
| State | Sample Size | Baseline (Reference Data) |
|---|---|---|
| Texas | 8 counties from 2024 general | Web UI LLM OCR + Python parsers |
| Mississippi | 9 counties | OCR + manual data manipulation |
| Pennsylvania | Multiple counties from 2024 and 2025 | Custom Python parsers (Electionware) |
We're comparing LLM extraction to verified data extracted last year, not raw PDFs. These reference files were validated through multiple methods. We're measuring: can LLMs match results that took us weeks to produce?
Sample selection: Deliberately chose counties with different formats, complexity levels (4-47 precincts), and layout styles.
The Results - What Works
Best Performance Against Reference Data:
| Model | Accuracy | Sample | Baseline Method |
|---|---|---|---|
| Gemini 2.5 Pro | 99.1% | 9 MS counties | OCR + manual cleanup |
| Claude Haiku | 100% | Scurry County, TX | Google Gemini (Human Verified) |
| Claude Haiku | 99.9% | Limestone County, TX (21 precincts) | Google Gemini (Human Verified) |
Takeaways
- In the right conditions, LLMs can do great work
- But they aren't perfect: 77.8% match across 8 TX counties
Best predictors of success:
- Clean, well-formatted PDFs (even if scanned)
- Consistent table structures
Texas Results: Claude Haiku 4.5
County-by-County Performance vs. Reference Data
| County | Precincts | Votes Checked | Vote Accuracy | Precinct Name Errors |
|---|---|---|---|---|
| Scurry | 11 | 321 | 100.0% | 0 |
| Limestone | 21 | 870 | 99.9% | 0 |
| San Saba | 6 | 72 | 91.7% | 0 |
| Foard | 4 | 146 | 84.9% | 0 |
| Lynn | 8 | 376 | 71.5% | 0 |
| Jones | 4 | 240 | 68.3% | 0 |
| Cottle | 4 | 106 | 64.2% | 0 |
| Panola | 19 | 364 | 16.2% | 0 |
High Success
- Standard table structures
- Consistent formatting
Low Success
- Panola: Many 0-vote candidates
- Same problems across models
What Doesn't Work
Common Failure Patterns When Comparing to Reference:
1. Missing zero-value rows (all models)
- Reference data includes candidates with 0 votes; LLMs consistently omit these
- Panola County, TX: 6 minor party candidates with 0 votes, missing
2. Incomplete extraction (default max tokens too small)
- Reference has 3,936 rows; LLMs extracted only 3,131-3,697 rows
- Missing Supreme Court races in Jones, Lynn, Cottle counties
3. Vote count errors (PDF-specific)
- Panola County, TX: Claude Haiku 16.2% match to reference
- Same counties fail across different models
4. Precinct name OCR errors (vertical vs horizontal)
- Mississippi: 2,898 precinct name errors vs. reference (Claude Haiku, 80 counties)
- Texas: 0 precinct errors across 2,495 verified votes
When models disagree with reference data, the same problem counties appear consistently.
Pennsylvania Custom PDF Parser Software
Baseline Method: Extract Text and Parse
Electionware system (used by many PA counties):
- Documented significant issues encountered during development
- Result: High-quality, reliable extraction... eventually
Issues discovered building the reference parsers:
- Hard-coded values (offices, parties, etc)
- Format variations between election years (small changes break code)
- Missing party codes (DAR = cross-filing)
- Loop control bugs (candidates misattributed to wrong offices)
- Office header variations (ALL-CAPS vs. Mixed-Case)
The old parser works. But:
- Required Python expertise
- Not portable to other states
- Ongoing maintenance for format changes
Don't Trust, Verify
You can't just trust LLM output. Here's how we validate against reference:
1. Direct comparison to reference data
- Vote count matching (what we measured)
- Precinct name validation
2. County-level total checks
- Extract precinct data with LLM
- Compare sums to official county totals
3. Multi-model extraction on samples
- Run 2-3 models on representative counties
- Where they agree, confidence is high
4. Automated validation patterns
- Row count checks (expected vs. actual)
- Zero-vote pattern detection
5. Targeted manual review
- Focus on counties with low match rates (<80%)
- Spot-check high-confidence extractions (>95%)
Recommendations
Replace existing extraction methods (for clean formats)
- PDFs like Limestone, Scurry, San Saba (TX)
- Use LLMs instead of OCR + manual cleanup
- Validate with county totals
LLM as first pass (for more complex formats)
- Faster than manual, but needs verification
- Multi-model extraction
- Systematic spot-checking
LLMs will not work well for some PDFs.
Figuring out which ones is super important
How You Should Use LLMs
- Some PDF formats: LLMs match days/weeks of manual work
- Zero-value rows problematic across most models
- Format and prompts matter more than models, although models do matter
A sampling approach:
- Test LLMs on 3-5% of your counties
- Compare to reference data (or build reference from LLMs + validation)
- Identify which formats work well
Workflow:
- Claude Code
- GitHub Copilot (academics get expanded access to models)
Questions?
- openelections@gmail.com
- Links to code/data at OpenElections.net