Show HN: One-Click CSV Deduplication (open-source)
app.dedupe.itI made an app to fuzzy-deduplicate my Google Sheets and CRM records
- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)
Implementation details:
- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude
Demo video: https://youtu.be/7mZ0kdwXBwM
Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it
Lmk any feedback on how to make this better! Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had Appreciate the kind words! Linear scaling in terms of speed and cost. We haven't yet optimized the prompts & choice of model to minimize token usage, so I'd recommend emailing us for advice if you want to run this on a large dataset