I work in Zalando Search and Browse team, recently I published my post with details about Zalando Search system. Modern web search isn’t magic; it’s a massive engineering feat. Crawlers scour billions of pages, hybrid indexes organise that data, query processors infer user intent, ranking cascades order candidates in milliseconds, and relevance feedback loops continually refine the model. In this article, we go inside that complete pipeline—from crawl to ranking, AI‑generated answers and back again—to explain how engines like Google, Bing and Yandex deliver relevant results and learn from user behaviour to keep getting smarter.
Crawl and Canonicalise
Everything starts with polite crawlers (Googlebot, Bingbot, YandexBot) exploring the web. They fetch pages, respect robots.txt directives and render JavaScript when needed. Crawled content is canonicalised (URL normalisation, removing tracking parameters, etc.) and duplicate or near‑duplicate pages are removed using techniques like MinHash and SimHash. This deduplication ensures that identical content doesn’t appear multiple times in search results and improves indexing efficiency. Google’s “How Search works” documentation notes that their ranking systems evaluate hundreds of billions of pages, so deduplicating pages and canonicalisation are essential to avoid clutter.
Indexing: Building Searchable Structures
After fetching and cleaning, the content is transformed into structures optimised for retrieval and ranking:
- Inverted index (BM25/BM25F) — maps terms to documents, enabling fast keyword look‑ups.
- Vector index (HNSW/ANN) — stores dense vector embeddings computed from document contents to support semantic search.
- Link graph and structured metadata — collects link information (used by PageRank and similar algorithms) and metadata like language, location or entity annotations.
These structures allow engines to mix traditional lexical search with neural search. Modern open‑source search stacks (Lucene, Solr, Elasticsearch) now ship hybrid retrieval (sparse+dense) out of the box.
Query understanding: intent, entities and contextual filters
Modern search engines don’t simply match strings; they pre‑process every query to understand what you really mean. Deep‑learning systems first normalise and tokenise the input — splitting it into meaningful pieces and correcting misspellings or odd phrasing. Google notes that roughly 15 % of searches each day are brand‑new, so this step is critical for handling unfamiliar queries.
Next, AI models recognise entities and concepts within the query. RankBrain and Neural Matching map words to real‑world concepts and find broader connections beyond exact keyword matches. BERT examines word order and context, ensuring small but crucial words (like prepositions) aren’t ignored. For more details on how these AI systems work read Google’s blog post “How AI powers great search results”.
After the query is tokenised and entities are identified, the engine classifies the intent — is the user looking for general information, a specific site, a product, or something local? — and routes the query to the most appropriate vertical (web, images, news, maps, or shopping). It then applies contextual filters: language settings, geographic location and device type, all influence which results are surfaced. Google’s multilingual‑search guide explains that Search automatically determines which languages to show results in based on the user’s browser, device and language settings rather than forcing results into a single language.
Because the engine tracks how many documents are retrieved at each step, it can relax or broaden filters when very few results are found (for example, dropping strict locale constraints or expanding synonym matches) to avoid zero‑result pages. Light personalisation, when enabled, further tweaks the ordering using signals from past searches or clicks. By capturing entities, intent, and context, and by dynamically adjusting filters as needed, the query‑understanding layer transforms raw input into structured metadata that guides retrieval and ranking.
Press enter or click to view image in full size
Hybrid Retrieval: Sparse and Dense Candidates
Once the query is parsed, the engine searches for documents in two complementary ways:
- Sparse retrieval (keyword‑based) — an inverted index (e.g., BM25/BM25F) returns documents containing the terms and ranks them by term frequency, document length and saturation.
- Dense retrieval (neural) — bi‑encoder or late‑interaction models (e.g., DSSM, ANCE, ColBERT) embed queries and passages into a shared vector space and use approximate nearest‑neighbour (ANN) search (HNSW or IVF‑PQ) to find semantically similar candidates.
See Figure 2 for the four neural matching paradigms: (a) representation‑based bi‑encoders (e.g., DSSM/ANCE) for first‑stage retrieval; (b) query–document interaction; (с) all‑to‑all cross‑encoders (e.g., BERT) for top‑K re‑ranking; and (d) late interaction (e.g., ColBERT) — which engines typically fuse with BM25 in hybrid systems.
Get Ivan’s stories in your inbox
Join Medium for free to get updates from this writer.
The engine then pools/fuses candidates from both methods (e.g., weighted union or RRF) and passes the set to the ranking cascade — boosting recall for concept‑level matches while preserving precision and latency via fast lexical search.
Press enter or click to view image in full size
Ranking Cascade and Whole‑Page Layout
Modern engines use a multi‑stage ranking cascade that progressively reorders and filters candidates to produce a concise, diverse and safe result set. A typical pattern is:
- Pre‑rank (GBDT) — light gradient‑boosted trees (MatrixNet/CatBoost at Yandex; LambdaMART/LightGBM at Google/Microsoft) combine hundreds of inexpensive features: BM25 signals, link/authority, freshness, click priors, dwell time, and geo relevance.
- Neural re‑rank (cross‑encoders / late‑interaction) — transformer models rescore the top‑K by computing deep query–document similarity.
- Diversity & policy controls — systems reduce repetition and enforce safety/policy. Google publicly documents a site diversity system that generally limits the same site to no more than two listings in top results (with exceptions when especially relevant), and describes deduplication to avoid near‑duplicate results; see Google’s A guide to Search ranking systems.
- Whole‑page layout — composes the SERP: blends web results with verticals (news, images, maps, shopping), chooses block order, and integrates ads. Bing has described this as optimising whole‑page relevance; Google enumerates the ranking systems that drive these decisions in its guide.
The cascade is like a factory line: cheap models quickly trim the pool; heavy models refine ordering; and final systems ensure variety, safety and business requirements are met.
Press enter or click to view image in full size
LLM Answers: Retrieval‑Augmented Generation (RAG)
Since 2024, search engines have been overlaying AI‑generated answers on the SERP. Google’s AI Overviews and Bing’s Copilot Search use retrieval‑augmented generation (RAG):
- The engine runs the hybrid retrieval pipeline to fetch relevant passages.
- Selected passages are fed into a large language model (LLM) as context.
- The LLM generates a concise answer or overview, citing the sources.
This reduces hallucinations because the LLM is grounded in retrieved documents. Search engines decide when to show AI answers based on query intent and freshness signals, and they provide citations so users can verify the information. Ads can appear above, below or within the AI answer when relevant; policy rules govern their inclusion and labelling.
Feedback, Evaluation and Training
Search engines continuously improve by closing the loop between user signals and model updates:
- Signals — clicks, dwell time, reformulations and explicit ratings indicate satisfaction or dissatisfaction.
- Judgments — human assessors provide relevance labels following detailed guidelines; small randomized buckets or interleaving tests collect unbiased click data.
- Evaluation — Offline quality is typically tracked with NDCG@k and Precision/Recall@k. NDCG measures ranked‑list quality with graded relevance: DCG sums each result’s gain discounted by its rank, and NDCG normalizes by the ideal DCG so scores lie in [0, 1] and are comparable across queries. Precision is the fraction of retrieved items that are relevant, while Recall is the fraction of all relevant items that were retrieved.
- Training & deployment — Models are retrained on fresh signals and editorial judgments, then must beat a strong baseline in offline evaluation before moving to online tests. Only candidates that pass offline checks go to controlled online evaluation — typically interleaving or A/B tests on a small share of traffic — to verify real user impact. To reduce presentation bias and collect debiased signals for future training, a small randomized traffic slice (exploration buckets) continuously introduces position shuffles or uniform sampling, enabling propensity‑scored learning from clicks.
Press enter or click to view image in full size
Comparing Google, Bing and Yandex
While all major engines share this high‑level architecture, their emphasis and implementations differ:
- Google — public documentation emphasises a suite of ranking systems that process hundreds of billions of pages using many signals. Google lists systems like BERT, neural matching, RankBrain and helpful content, and notes that ranking happens at the page level with continuous testing and improvements. Mobile‑first indexing and Core Web Vitals influence rankings. Ads may appear within AI overviews.
- Bing/Microsoft — Bing’s retrieval and ranking are powered by the Turing model family. The company discusses whole‑page relevance and uses the Copilot interface to blend generative answers with search results. Ads are context‑sensitive and appear alongside AI summaries.
- Yandex — historically uses MatrixNet (GBDT) and CatBoost for ranking and integrates a transformer ranker called YATI for semantic matching. Yandex places strong emphasis on regional signals and provides localised results. Public announcements highlight improvements in ranking quality rather than generative answers.
Why This Matters
Understanding the architecture behind search engines helps users and developers appreciate why results look the way they do and why certain queries trigger AI responses while others do not. It also highlights the balance these systems must strike: retrieving comprehensive information, ranking it by relevance and quality, injecting diversity and safety, blending verticals and adverts, and now generating summarised answers, all under strict latency constraints.
As LLMs become more capable, the retrieval‑ranking‑generation stack will only get deeper: models will rewrite queries, plan multi‑hop retrieval, generate structured responses and even evaluate their own output. The research community is actively exploring new learning paradigms, including LLM‑as‑judge evaluation and counterfactual learning, to scale training and evaluation across billions of queries.
Conclusion
Modern web search engines are sophisticated pipelines combining information retrieval, machine learning, neural ranking and generative AI. They start with crawling and canonicalising web content, build rich indices, understand user queries through intent and entity recognition, retrieve candidates via hybrid methods, rank and blend results through cascades of models and rules, and increasingly provide AI‑generated overviews grounded in retrieved passages. Continuous feedback from users and human judges keeps these systems improving.