Show HN: I Built a Semantic De-Deduplicator

2 points by gkamradt 2 years ago · 2 comments · 2 min read

Hey HN Crew!

We all have lists...and they can be annoying to de-duplicate.

* User feedback * Groceries * Employee Surveys * Bug reports * You name it

Most ways to consolidate like-items work off of keywords or worse, exact phrases (Sheets/Excel).

But LLMs are much better at understanding an items semantic meaning and determining if two items should be combined or not.

I decided to build my first python package, The Semantic Deduplicator, to help me consolidate items based on their meaning, not keywords.

For Example On Groceries: ['We need more berries', 'I want more more milk', 'Can we get more carbonated water please?', 'We need more sparkling water'] ...deduplicated... ['Berries', 'Milk', 'Sparkling Water']

How it works:

1. Start with an empty list ready to populate

2. The first item you add will get 1) transformed into a clean name (user feedback > product request) and 2) added to the list

3. While you're adding more items

* Check to see if your new item's embedding is close to any existing item

* If so, ask the LLM to compare your two items to see if they should be combined

* If so, combine them

This package is more of an exploration and POC so be careful with it. I'd love to hear any feedback.

All the links:

* YT Explainer Video: https://www.youtube.com/watch?v=etLsNgkGbeM

* Twitter Thread: https://twitter.com/GregKamradt/status/1719760658936545336

* Pypi: https://pypi.org/project/semantic-deduplicator/

* Github: https://github.com/gkamradt/SemanticDeduplicator

skeptrune 2 years ago

This is smart and solid work.

We had the same idea and made it a core product feature - https://docs.arguflow.ai/duplicate_detection

nbbaier 2 years ago

Really cool stuff! Definitely going to try to fit this into a project.

Settings

Show HN: I Built a Semantic De-Deduplicator

Keyboard Shortcuts