Show HN: I Built a Semantic De-Deduplicator
Hey HN Crew!
We all have lists...and they can be annoying to de-duplicate.
* User feedback * Groceries * Employee Surveys * Bug reports * You name it
Most ways to consolidate like-items work off of keywords or worse, exact phrases (Sheets/Excel).
But LLMs are much better at understanding an items semantic meaning and determining if two items should be combined or not.
I decided to build my first python package, The Semantic Deduplicator, to help me consolidate items based on their meaning, not keywords.
For Example On Groceries: ['We need more berries', 'I want more more milk', 'Can we get more carbonated water please?', 'We need more sparkling water'] ...deduplicated... ['Berries', 'Milk', 'Sparkling Water']
How it works:
1. Start with an empty list ready to populate
2. The first item you add will get 1) transformed into a clean name (user feedback > product request) and 2) added to the list
3. While you're adding more items
* Check to see if your new item's embedding is close to any existing item
* If so, ask the LLM to compare your two items to see if they should be combined
* If so, combine them
This package is more of an exploration and POC so be careful with it. I'd love to hear any feedback.
All the links:
* YT Explainer Video: https://www.youtube.com/watch?v=etLsNgkGbeM
* Twitter Thread: https://twitter.com/GregKamradt/status/1719760658936545336
* Pypi: https://pypi.org/project/semantic-deduplicator/
* Github: https://github.com/gkamradt/SemanticDeduplicator This is smart and solid work. We had the same idea and made it a core product feature - https://docs.arguflow.ai/duplicate_detection Really cool stuff! Definitely going to try to fit this into a project.