Fine-Tuning Qwen3 Embeddings for product category classification on the Large-Scale Product Corpus

6 min read Original article ↗

Ivan

Language-models such as GPT, Llama, DeepSeek, Qwen trained with a filtered slice of Common Crawl. For e-commerce work, though, we can start with the Web Data Commons (WDC), the project by the University of Mannheim. It extracts web pages that carry some metadata and publishes the result as the Large-Scale Product Corpus (LSPC). Follow up of this post is Qwen3 SFT and DPO fine-tuning tutorial. And another one is about how to train from scratch Pytorch Embedding Classifier model.

Search engines like Google reward pages that include detailed product markup, so merchants already populate their sites with SEO-friendly fields such as title, brand, GTIN, price — and, crucially, category labels. Thanks to these built-in annotations, the WDC Large-Scale Product Corpus arrives almost fully self-labelled. I used those labels to fine-tune Qwen3 Embedding with Low-Rank Adaptation (LoRA), code is available on github. The resulting 615 million-parameter checkpoint fits comfortably in limited GPU memory yet updates the model’s representation space, mapping raw product titles to six top-level categories with a macro-F1 of 0.836 (83.6 %).

Understanding Product Data

At the heart of any classification system is its data. This project utilises a rich dataset known as the Large-Scale Product Corpus (LSPC) V2020, which is derived from the Web Product Common Crawl subset using web page metadata filtering to extract product data. Think of it as a massive collection of product information scraped from the internet.

For this specific classification task, the project focuses on the six most represented categories: Automotive, Baby, Books, Clothing, Jewelry, Shoes.

To prepare this vast amount of raw data, a dedicated script, build_lspc_dataset.py, is used, working with the lspcV2020.zip file downloaded from Web Data Commons. This ensures the data is in the right format for the models to learn from. More insights are available from the data discovery script.

Qwen3 Embedding Models

The classification pipeline relies on the Qwen3 Embedding family of models. As documented in its accompanying paper (see link), Qwen3 currently occupies the top position on the Hugging Face MTEB leaderboard, a benchmark that measures how well language models perform on a wide range of retrieval and clustering tasks.

Qwen3 Embeddings are focused on the tasks: text embeddings extraction and reranking.

Because Qwen 3 is a large language model trained on diverse text, its embeddings capture subtleties such as brand references, synonyms, and domain‑specific jargon. This depth of representation enables the classifier to interpret even short or ambiguous product titles accurately.

How Embedding Models Work for Text Classification

For generating text embeddings, the Qwen3 models employ a technique called causal attention. When a product title is fed into the model, a special [EOS] (End of Sequence) token is appended to the end of the input sequence. This token signals the model that the input has concluded.

The final embedding, which is essentially a numerical representation of the product title’s meaning, is then derived from the hidden state of the last layer corresponding to this [EOS] token. This numerical embedding captures the semantic relationships of the text, allowing the model to understand what the product title is truly about. To ensure these embeddings are useful for specific tasks, the system can even concatenate instructions with the product query into a single input context, allowing the model to follow specific guidelines.

The Qwen3 Embedding series offers a range of model sizes, including 0.6 billion, 4 billion, and 8 billion parameters, allowing for flexibility depending on the balance needed between efficiency and effectiveness.

Training for Precision: A Multi-Stage Approach

Qwen3 Embedding turns the dense Qwen3 LLM backbone (0.6 B, 4 B and 8 B parameters; 32 k context) into state-of-the-art sentence-representation engines by running three carefully staged steps.

Large-scale weak-supervision pre-training (150 M pairs).
Using Qwen3–32B itself as a data generator, the authors synthesize roughly 150 million multilingual pairs that cover retrieval, Semantic Textual Semilarity, bitext mining and classification. Prompts control query type, language, difficulty and persona, so the unsupervised corpus is both broad and balanced.

Get Ivan’s stories in your inbox

Join Medium for free to get updates from this writer.

Remember me for faster sign in

Supervised fine-tuning with high-quality data (≃ 19 M pairs).
A second pass polishes the model on (i) ~7 million human-labelled instances drawn from MS MARCO, NQ, MIRACL, CodeSearchNet, etc., and (ii) ~12 million synthetic pairs filtered from the step-1 pool by requiring cosine ≥ 0.7. This sharpened subset gives the model task-specific precision without sacrificing coverage.

Checkpoint model-merging (slerp).
After fine-tuning, several intermediate checkpoints are merged with spherical linear interpolation; this consistently boosts robustness and transfer to unseen distributions.

Our Experiments and Performance

For product title classification specifically, which focuses on a 6-class categorisation, the project achieved results:

Macro F1: 0.8360 (83.60%).

Accuracy: 0.8791 (87.91%).

The github repository highlights specific optimisations that contribute to its efficiency and performance:

LoRA Fine-tuning: The models were fine-tuned using LoRA (Low-Rank Adaptation), a technique that makes fine-tuning large models more efficient. The specific configuration for LoRA was r=16, alpha=32.

Optimiser and Learning Rate: The adamw_torch optimizer was employed for training, utilizing a learning rate of 5e-5 over a single epoch. This configuration proved optimal among other optimizers and learning rates tested, indicating an efficient training process.

Dependency Management: Poetry is recommended for managing project dependencies, ensuring reproducible builds and virtual environments. The repository project is for CUDA RTX 5090 GPU.

Practical inference performance on CUDA GPUs.

We wanted to make sure our fine-tuned model wasn’t just accurate, but also ready for real-world use. So, we put it to the test with our measure_lora_latency.py benchmark on a single 32 GB-VRAM NVIDIA RTX 5090 GPU, using half-precision. After a quick five-batch warm-up, we processed 100,000 new product titles. The 615 M-parameter Qwen3-Embedding LoRA checkpoint consistently delivered impressive latency: 3.3–3.9 ms per title across batch sizes 16, 32, 64, and 128. Batch size 32 gave us the best results, achieving a throughput of ≈ 299 titles · s⁻¹ while keeping per-item latency under 4 ms. Even with the largest batch size we tested (128), latency stayed under 4 ms, showing excellent GPU utilization and almost linear scaling. These numbers mean our classifier can easily handle large-scale back-processing or real-time ingestion pipelines on a single, affordable GPU, with plenty of room for any extra pre- or post-processing.

Conclusion: A Step Forward in Product Understanding

These experiments show how far you can push an off-the-shelf, open-source embedding model with nothing more than a large, self-labelled product corpus and a lightweight LoRA patch:

  • Massive fine-tuning pay-off. Training on tens of millions of titles from the Web Data Commons LSPC vault reshaped Qwen3-Embedding into a domain-aware vectoriser that hits 0.836 macro-F1 on six broad categories — without touching the full model weights.
  • Real-time speed on a single GPU. The 615 M-parameter LoRA head processes ~300 titles · s⁻¹ at 3–4 ms latency on an RTX 5090, so one card can handle both historic catalogue clean-ups and live ingestion.
  • Plug-and-play for new teams. Checkpoints, training scripts, and benchmarking code are all public, so anyone can replicate, extend, or repurpose the workflow — swapping in their own categories, languages, or product feeds — without rebuilding a model from scratch.

In short, leveraging existing open-source LLM embeddings plus a very large, readily available dataset delivers production-grade product classification with minimal compute and zero licensing friction.