GitHub - mburaksayici/smallevals: smallevals — CPU-fast, GPU-blazing fast offline retrieval evaluation for RAG systems with tiny QA models.

smallevals - Local LLM Evaluation Framework with Tiny 0.6B Models

A lightweight evaluation framework powered by tiny ( really tiny ) 0.6B models — runs 100% locally on CPU/GPU/MPS, attach any vector DB connection and run, fast and free.

Evaluation tools requiring LLM-as-a-judge or external, that costs/doesn't scale easily. evaluates in seconds in GPU, in minutes in any CPU !

Evaluate Retrieval

Evaluation of RAG system includes retrieval and RAG stage, attacks to test retrieval and RAG answers(in the near future)!

Models

Model Name	Task	Status	Link
QAG-0.6B	Generate golden Q/A from chunks or docs (synthetic evaluation data)	Available	🤗
CRC-0.6B	Context relevance classifier (question ↔ retrieved chunk)	Incoming	—
GJ-0.6B	Groundedness / faithfulness judge (answer ↔ context)	Incoming	—
ASM-0.6B	Answer correctness / semantic similarity	Incoming	—

Current Focus: Retrieval evaluation (QAG-0.6B), after being sure the model generates correct answers and better questions for RAG(it does, but still room for improvement), the model will be the first model of pipeline leading to (RAG) generation evaluation models (CRC-0.6B, GJ-0.6B, ASM-0.5B) which are the future work.

Installation

Quick Start

Evaluate Retrieval Quality (Python)

Connect to your favourite Vector DB (Milvus, Elastic, PGVector, Chroma, Pinecone, FAISS, Weawiate), attach your favourite embeddings, generate questions, and visualise results!

Under the hood, generates question per chunk, and tries to retrieve it as a single-first relevant docs, calculate scores.

from smallevals import evaluate_retrievals, SmallEvalsVDBConnection

vdb = SmallEvalsVDBConnection(
    connection=chroma_client, # or elastic, milvus, pgvector, pinecone, faiss, weavite 
    collection="my_collection",
    embedding=embedding # hf embedding model 
)

# Run evaluation
result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200) # Generate question for 200 chunks, and test to retrieve them!

And evaluate results!

smallevals dash --host 0.0.0.0 --port 8050 --debug

ChromaDB

from sentence_transformers import SentenceTransformer
import chromadb
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals

embedding = SentenceTransformer("intfloat/e5-small-v2")

# Connect to your existing ChromaDB (already populated with chunks)
client = chromadb.PersistentClient(path="path/to/chroma")

vdb = SmallEvalsVDBConnection(
    connection=client,
    collection="your_collection_name",
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

Elasticsearch

from sentence_transformers import SentenceTransformer
from elasticsearch import Elasticsearch
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals

embedding = SentenceTransformer("intfloat/e5-small-v2")

# Elasticsearch should already have an index with your chunks + dense_vector field
es = Elasticsearch("http://localhost:9200", verify_certs=False)

vdb = SmallEvalsVDBConnection(
    connection=es,
    collection="your_index_name",
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

Milvus

from sentence_transformers import SentenceTransformer
from pymilvus import connections, Collection
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals

embedding = SentenceTransformer("intfloat/e5-small-v2")

# Connect to an existing Milvus instance and collection
connections.connect(alias="default", host="localhost", port="19530")
collection = Collection("your_collection_name", using="default")

vdb = SmallEvalsVDBConnection(
    connection=collection,
    collection="your_collection_name",
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

pgvector (PostgreSQL)

from sentence_transformers import SentenceTransformer
from sqlalchemy import create_engine
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals

embedding = SentenceTransformer("intfloat/e5-small-v2")

# PostgreSQL must have the pgvector extension enabled and a table with a vector(...) column
engine = create_engine("postgresql://user:password@localhost:5432/dbname")

vdb = SmallEvalsVDBConnection(
    connection=engine,           # or a psycopg2 connection
    collection="your_table_name",
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

Qdrant

from sentence_transformers import SentenceTransformer
from qdrant_client import QdrantClient
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals

embedding = SentenceTransformer("intfloat/e5-small-v2")

client = QdrantClient(host="localhost", port=6333)

vdb = SmallEvalsVDBConnection(
    connection=client,
    collection="your_collection_name",
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

Weaviate

from sentence_transformers import SentenceTransformer
import weaviate
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals

embedding = SentenceTransformer("intfloat/e5-small-v2")

client = weaviate.connect_to_custom(
    http_host="localhost",
    http_port=8080,
    http_secure=False,
    grpc_host="localhost",
    grpc_port=50051,
    grpc_secure=False,
)

vdb = SmallEvalsVDBConnection(
    connection=client,
    collection="your_collection_name",
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

FAISS (in-memory)

from sentence_transformers import SentenceTransformer
from smallevals import SmallEvalsVDBConnection, evaluate_retrievals
from smallevals.vdb_integrations.faiss_con import FaissConnection

embedding = SentenceTransformer("intfloat/e5-small-v2")
embedding_dim = embedding.get_sentence_embedding_dimension()

# Create a FAISS-backed connection and populate it with your vectors
faiss_conn = FaissConnection(
    embedding_model=embedding,
    dimension=embedding_dim,
    index_type="Flat",
    metric="L2",
)

# Populate faiss_conn with your vectors and metadata (see tests/test_faiss.py for a full example)

vdb = SmallEvalsVDBConnection(
    connection=faiss_conn,
    collection=None,  # FAISS has no collection name
    embedding=embedding,
)

result = evaluate_retrievals(connection=vdb, top_k=10, n_chunks=200)

Generate QA from Documents (CLI)

smallevals generate_qa --docs-dir ./documents --num-questions 100

QAG-0.6B

The model was trained on TriviaQA, SQuAD 2.0, Hand-curated synthetic data generated using Qwen-70B , generating a question from the chunk/doc.

Given the passage below, extract ONE question/answer pair grounded strictly in a single atomic fact.

PASSAGE:
"Eiffel tower is built at 1989"

Return ONLY a JSON object.

{
  "question": "When was the Eiffel Tower completed?",
  "answer": "1889"
}

How does it work?

Question Generator model, reads your chunk, assumes the chunk is the one that answers the question, and tries to match it back via Vector DB query.

This allows directly to test your retrieval pipelines tied to your RAG systems. Whatever the complexity of your RAG system, you'll be sure if your vector queries works fine.

Why this is a need?

Other frameworks requiring APIs are costly, hard-to-scale, although they are better(for now).

Known issues:

Model is trained on text/wiki data, bias towards well structured text.
Dataset contains question that ask generic questions, dataset will be more carefully crafted in v3.
Some questions may be generic for the first version, leading a small decrease on the scores. 25$ led me to have this model. Let's see what I can do with more!

Other Models:

Other models to be trained to eliminate the need of external LLMs.

CRC-0.6B : Context relevance classifier (question ↔ retrieved chunk) GJ-0.6B : Groundedness / faithfulness judge (answer ↔ context)
ASM-0.6B | Answer correctness / semantic similarity