Settings

Theme

Show HN: One-Click CSV Deduplication (open-source)

app.dedupe.it

4 points by remolacha a year ago · 2 comments · 1 min read

Reader

I made an app to fuzzy-deduplicate my Google Sheets and CRM records

- No manual configuration required - Works out-of-the-box on most data types (ex. people, companies, product catalog)

Implementation details:

- Embeds records using an E5 model - Performs similarity search using DuckDB w/ vector similarity extension - Does last-mile comparison and merges duplicates using Claude

Demo video: https://youtu.be/7mZ0kdwXBwM

Github repo (Apache 2.0 licensed): https://github.com/SnowPilotOrg/dedupe_it

Lmk any feedback on how to make this better!

OliverGilan a year ago

Curious how this scales. Just tried this with the test dataset and it was probably the slickest deduplication experience I’ve had

  • remolachaOP a year ago

    Appreciate the kind words! Linear scaling in terms of speed and cost. We haven't yet optimized the prompts & choice of model to minimize token usage, so I'd recommend emailing us for advice if you want to run this on a large dataset

Keyboard Shortcuts

j
Next item
k
Previous item
o / Enter
Open selected item
?
Show this help
Esc
Close modal / clear selection