Building Text Classifiers: A New Approach with Less Data

12 points by enqush 2 years ago · 13 comments

Reader

Reading between the lines here, it seems like the “new” approach is just an old approach combined with an advertisement for a platform.

Unless you’ve forgotten that there are numerous other ways to do classification beyond using an LLM, I don’t see much substantive content here.

janalsncm 2 years ago

What model specifically are you training or fine-tuning? This article criticizes OpenAI for providing a black box only, but I don’t see any details about your model either.

Can I download the model I’ve trained? If not, we can ignore the pricing comparison since there’s no guarantee it will be the same tomorrow.

Second, as far as I can tell your metrics are comparing zero shot GPT3.5 performance with your fine-tuned model performance. If you want a fair comparison you need to compare with the fine-tuned GPT3.5 performance.

aleph4 2 years ago

So what actual model are they using? The details are super vague.

gremlinunderway 2 years ago

when you click on "AI models" on the docs page it looks like they left docusaurus boilerplate in there. In fact, all of the pages under Train your model are also docusaurus boilerplate pages, lol

https://docs.mazaal.ai/guides-and-concepts/AI%20models/manag...

https://docs.mazaal.ai/guides-and-concepts/AI%20models/trans...

https://docs.mazaal.ai/category/train https://docs.mazaal.ai/guides-and-concepts/Train/create-a-pa...

tannhaeuser 2 years ago

What's the consensus of building state of the art classifiers? Is using Llama2, Mistral, and co. really better than BERT? Fine tuning vs prompt engineering (which is what I understand the article to be about)?Text classification can also range from mere sentiment analysis, to genre/form classification, to content classification, and there are conflicting results about the relative merits of LLMs for genre vs content classification at least, ie. arguing that pretraining on large data sets can't effectively be undone and specialized using just ~1000 texts on a given knowledge vertical, as agency and GPU cloud hosters want you make to believe.

bugglebeetle 2 years ago

GPT-4 with JSON function calling is about on par with BERT and company for simple classification problems, but is slower and more expensive. For anything more complicated than that, you’re better off using standard ML models.
One way they can be used to build on each other, however, is synthetic training data generation from LLMs to train other kinds of models. In my experience, this works best when you already have a fairly robust dataset for it to create permutations from.

nyadesu 2 years ago

This feels more like an ad, more than an article with actual content.

denimboy 2 years ago

I think they are using the LLM as few shot learner then using that to label the rest of the training data and finally using the now fully labeled data to train a more traditional supervised classifier like DistilBert.

visarga 2 years ago

How does "Summerization" work? Cuneiform?

enqushOP 2 years ago

TL;DR: In this article, we demonstrate a no-code approach to building a text classifier that not only outperforms Large Language Models (LLMs) such as OpenAI and Cohere, but also does so with just a handful of labeled data. If you're eager to see the results, feel free to jump straight to the Experiments section.

PaulHoule 2 years ago

So it sounds like you have a UI for building a training set and also a model trainer, right?
- enqushOP 2 years ago
  
  That is correct! our platform provides a seamless UI for constructing training sets and training models. Behind the scenes, there's a lot going on:
  - Data ingestion: from various platforms, starting with Google Drive and soon expanding to OneDrive and Dropbox. - A robust labeling backend: with native integration with LabelStudio.
  - Range of GPU selection: for training, starting from the affordable RTX A2000 to the powerful H100, starting at just $0.17/hour. You can save up to 60% compared to AWS, depending on the GPU.
  - Model deployment & optimization: All trained models are fine-tuned using NVIDIA TensorRT and Triton for maximum efficiency.

Settings

Building Text Classifiers: A New Approach with Less Data

Keyboard Shortcuts