Show HN: Curator – an open-source library for synthetic data generation

13 points by madiator a year ago · 8 comments · 1 min read

Reader

Synthetic data generation is an essential step in training and evaluating LLMs/Agents/RAG pipelines, but tooling around this is still lacking. We're introducing Curator, an open-source library designed to streamline the data curation process.

While there are many libraries to prompt LLMs, the semantics of generating synthetic data is different from prompting. For example, we need to process a large number of prompts (sometimes in millions or more) while accepting some failures, utilize several stages of prompting, incorporate human feedback, and filter out bad data using verifiers and heuristics.

Curator addresses these challenges: 1. It supports efficient data generation by several API providers and local models. 2. Recovers from failures and caches previous output. 3. Utilizes structured outputs to enable programming complex data generation pipelines. 4. Visualize your data generation in real time.

We are working on many more features (such as adding verifiers, diversity and data quality indicators, calling external tools to generate data, etc.). We hope to help the community create high-quality datasets to train great bespoke models!

trungtvu a year ago

hey one of the creators of the library here! would love to hear your feedback on our library :)

athena_research a year ago

would love to see the data quality & diversity metrics. It's kind of hard for me to go through tens of thousands of examples to understand the quality of my data sometimes.
- madiatorOP a year ago
  
  We are working on it!

overu589 a year ago

How is this not LSD for LLMs?

madiatorOP a year ago

Synthetic data got a bad reputation last year, but it is now an important component for all modern LLMs! In fact, we had also trained one model for -- ironically -- detecting hallucinations, and it was also trained on synthetic data.
Say if you have some PDFs, and want to generate questions and answers to test your RAG pipeline, that's synthetic data! Distillation is mostly synthetic data and works great as well!
Our hope is that this becomes steroids rather than LSD for LLMs :)
athena_research a year ago

lol please elaborate ;)

Settings

Show HN: Curator – an open-source library for synthetic data generation

Keyboard Shortcuts