Show HN: I made a dataset for finetuning embedding models

1 points by mihaich 2 years ago · 0 comments · 2 min read

Reader

I made a STSB alternatives, but with dialog/assistant samples.

I couldn't find anything similar online (!), so I built it.

The reason I did it was because I needed a very small model that would work well with my React component, and none of the existing 17M models performed adequately.

The one I created with this dataset does.

Embedding models, like other types of models, can be task-specific, and I didn't have any officially recognized task for my needs.

The closest is the "sentence similarity" task, but one of the most recognized benchmark for it is STSB and I find STSB to be quite strange.

Here is a 5 out of 5 scored example from STSB: "A person cuts an onion." and "A person is cutting an onion."

Here is a 1 out of 5 scored example from STSB: "A man is playing the flute" and "A man is playing the guitar".

STSB isn't what I need for my "real world" task. What I need is a way to find best paragraphs that are answers for the question the user asks. This is why I made that dataset and this is why I fine-tuned an embedding model. It was a fun experience and the model works really well! :)

No comments yet.

Settings

Show HN: I made a dataset for finetuning embedding models

Keyboard Shortcuts