TeletextSignals - Embeddings & RAG
TLDR;
This is a proof of concept for a local RAG and LLM setup to query your documents. See the final examples here and here.
Swiss Teletext articles from the last 25 years are used as the example corpus.
Because this runs completely locally, none of your data (neither your queries nor your documents) leaves your device, so this setup could be used to process confidential information.
Intention
This repo is intended as a working proof of concept (PoC) for two things:
- a completely local RAG architecture to query a (relatively large) body of text documents on-device
- some experiments in extracting temporal news signals (work in progress)
Text documents
Swiss news teletext articles in the German language are used as an underlying source. They are downloaded from teletext.ch. The archive reaches back 25 years and contains >500k entries, see 0_teletext_key_figures.ipynb for details.
The advantage is that these are relatively short and on-point summaries of recent events.
Retrieval-Augmented Generation (RAG)
Document Preparation (Embedding and Vector Store)
Texts are chunked (Note: usually the texts are short enough to not require chunking, but this step makes the workflow extendable for longer documents, e.g. from other sources).
Embedding transforms text into a point in a multidimensional vector space, with the intention to cluster documents with a similar semantic meaning close together. We need to use a model that is optimized for a) query retrieval, b) multilingual coverage, and c) works well with news articles.
Postgres with pgvector is used as vector database to efficiently store and query the vector space.
- Embedding Model:
intfloat/multilingual-e5-large- Purpose:
- High-recall semantic retrieval
- Fine-grained chunk matching
- RAG-style downstream usage
- Characteristics:
- Text is chunked (≈800 characters with overlap)
- Uses required E5
query:/passage:prefixes - 1024-dimensional vectors
- Stored in PostgreSQL via pgvector
- Purpose:
Semantic Retrieval
Bi-encoder retrieval: starting with a query in natural language, the same embedding model is used to generate the vector representation of the query and retrieve the closest k vectors from the database.
Full-text search: for short (e.g. single word) queries, the bi-encoder retrieval will fail (see A_retrieval_examples.ipynb
for examples). This is why a full-text search (based on PostgreSQL's tsvector) is performed as an additional source for documents.
Cross-ranking: to increase the quality of the results, a cross-encoder takes the full query + document full text and assigns a cross-ranking score. This is repeated for each document retrieved by the bi-encoder retrieval. The final output is sorted by the cross-ranking score.
- Cross Encoder Model:
cross-encoder/mmarco-mMiniLMv2-L12-H384-v1- Input pairs:
(query, title + chunk_text) - Produces higher-precision final ordering
- Input pairs:
Note
Semantic retrieval is already quite a powerful approach for querying a large document corpus. See the A_retrieval_examples.ipynb for examples. To use LLMs to specifically answer your queries, there are two approaches available: a) Two-step RAG and b) Agentic RAG
Two-Step RAG (retrieve → generate)
The query is directly sent to the Semantic Retrieval module (as described above), and the resulting documents are sent as context to an LLM along with the query. The LLM is instructed to only use the available documents from the context and properly cite the sources in its answer. This prevents hallucinations and makes the sources tractable.
See B_2step_rag_example.ipynb for examples.
- LLM model: gemma3:4b-it-qat
- Quantization-aware trained 4B model
- Using ollama
Agentic RAG
Contrary to a two-step RAG, the agentic RAG uses the Semantic Retrieval module as a tool that it autonomously queries. The exact query is generated by the LLM and if necessary (e.g. if no matching results are generated) can be repeated with a modified query.
See C_agentic_rag_example.ipynb for examples.
- LLM model: qwen2.5:7b-instruct
- Quantization-aware trained 7.6B model
- Tool aware
- Using ollama
Requirements
Hardware
- GPU: NVIDIA GPU with ≥ 4 GB VRAM
- Tested successfully on Quadro T2000 Max-Q
- CPU: Any modern x86_64 CPU
- RAM: ≥ 16 GB recommended
- Disk: SSD strongly recommended (vector indexes)
Notes
multilingual-e5-largeis GPU-heavy- For lower VRAM environments,
multilingual-e5-smallmight be an alternative (not tested)
Software Stack
- Python 3.10+
- PostgreSQL 14+ with
pgvector - PyTorch
- Hugging Face Transformers
- Sentence-Transformers
- LangChain (HuggingFaceEmbeddings wrapper)
- See pyproject.toml for all libraries
Docker
- Postgres on port 5433
- Ollama on port 11434
- Pull gemma3:4b-it-qat and qwen2.5:7b-instruct once on first run
Scripts and notebooks
Scripts (/scripts)
scripts/1_fetch_teletext.py: Fetch Swiss Teletext articles from the API and upsert them intodocs_teletext.scripts/2a_embed_articles_E5.py: Chunk and embed articles withintfloat/multilingual-e5-largeand store chunk vectors inemb_teletext_chunk.
Notebooks
0_teletext_key_figures.ipynb: Corpus size and summary statistics for the Teletext archive.A_retrieval_examples.ipynb: Semantic vs. full-text retrieval examples and failure modes.B_2step_rag_example.ipynb: Two-step RAG pipeline example.C_agentic_rag_example.ipynb: Agentic RAG workflow example.

