Information Retrieval & Search

Search is so ordinary we forget how remarkable it is: you type a few words and, from a corpus of billions of documents, the most relevant ones come back ranked, in milliseconds. Information retrieval (IR) is the discipline behind it — the science of finding and ranking documents by relevance to a query — and it's one of the oldest and most refined areas of computing. It's also having a renaissance: the same retrieval machinery, upgraded with semantic embeddings, is how modern AI systems fetch the knowledge they reason over.

This page is the working landscape: how search is made fast, how classic ranking works (and the BM25 function that still quietly powers most search bars), how you measure whether results are good, and the shift to semantic search that closed the gap classic methods could never reach. It builds on the NLP page and ties to recommenders — both are ranking problems.

Finding the needle

The IR problem: given a query and a huge corpus of documents, return the documents ranked by relevance to that query. Two things make it hard. First, scale — you can't read every document for every query, so you need a structure that finds candidates fast. Second, relevance — "relevant" is fuzzy and human, and turning a few query words into a good ranking is the whole art. The field tackles these in two stages: a fast structure to retrieve candidates, then a ranking function to order them.

The inverted index: why search is fast

The structure that makes search possible is the inverted index. Instead of storing "document → its words" (which would mean scanning every document), it flips it to "word → the documents containing it." For each term in the vocabulary, the index keeps a list of every document that contains it.

Now a query is fast: look up each query term, grab its (pre-computed) list of documents, and intersect or union the lists — you instantly have the small set of candidate documents worth scoring, without touching the billions that don't contain any query word. It's the same idea as a book's index versus reading the whole book, and it's what turns an impossible scan into a millisecond lookup. The ranking happens only on that candidate set.

TF-IDF & the vector space model

Once you have candidates, how do you score relevance? The classic answer represents both query and document as vectors of word weights — the vector space model — using TF-IDF, the same weighting from the NLP page. The intuition is two-part:

Term frequency (TF) — a word appearing often in a document signals that document is about it.
Inverse document frequency (IDF) — a word appearing in few documents is more discriminating; "the" is everywhere and useless, a rare term is informative. IDF down-weights the common, up-weights the rare.

Multiply them and a document scores highly when it contains the query's rare, distinctive terms often. It's simple, interpretable, and surprisingly effective — but it has rough edges that the next refinement smooths out.

BM25: the workhorse that still wins

The ranking function that has powered search engines for decades — and remains a tough baseline today — is BM25. It's a refinement of TF-IDF that fixes two of its flaws, and its score for a document $D$ given query $Q$ sums over the query terms:

\text{BM25}(D, Q) = \sum_{t \in Q} \text{IDF}(t) \cdot \frac{f(t, D)\,(k_1 + 1)}{f(t, D) + k_1\!\left(1 - b + b\,\frac{|D|}{\text{avgdl}}\right)}

The two fixes, both visible in that formula, are what make it better than raw TF-IDF:

Term-frequency saturation — a word appearing 100 times isn't 100× more relevant than once. The $k_1$ term makes the contribution saturate, so extra repetitions add less and less. (Plain TF-IDF grows linearly forever.)
Length normalisation — long documents naturally contain more words, which would unfairly inflate their scores. The $b$ term, dividing by document length over the average, corrects for that.

BM25 is fast, interpretable, needs no training, and is genuinely hard to beat — which is why it's still the default in most production search and a key half of modern hybrid systems.

Measuring relevance

How do you know your ranking is good? IR has its own evaluation metrics, and the key shift from classification is that order matters — a relevant result at position 1 is worth far more than the same result at position 50. Beyond plain precision and recall at the top-k, the workhorses are MAP (mean average precision) and NDCG (normalised discounted cumulative gain), which reward putting the most relevant results highest — the same ranking-quality idea as in recommender systems, because both are fundamentally "order the list well" problems.

The lexical gap

For all its strengths, classic keyword search (BM25 included) has one fundamental blind spot: it matches words, not meaning. Search for "car" and a document that only says "automobile" scores zero — no shared term, no match, despite identical meaning. This lexical gap (vocabulary mismatch) is the ceiling on lexical methods: synonyms, paraphrases, and related concepts are invisible to a system that only counts exact word overlaps. Closing it requires understanding meaning, not matching strings — which is what the embedding revolution finally delivered.

Semantic search: dense retrieval

The modern answer is dense retrieval. Instead of sparse word-count vectors, encode the query and every document as dense embeddings — vectors that capture meaning, so "car" and "automobile" land close together. Retrieval becomes finding the document vectors nearest the query vector in that semantic space, which leaps the lexical gap: it matches on concept, not keyword.

Sparse vs dense retrieval. Sparse (BM25) matches exact query terms via the inverted index — fast and precise, but blind to synonyms. Dense matches meaning via embedding similarity (nearest-neighbour search) — catches paraphrases, but heavier. Hybrid runs both and fuses the rankings.

Dense retrieval needs approximate nearest-neighbour search (algorithms like HNSW, libraries like FAISS) to find close vectors among millions quickly, and it's heavier and less precise on exact terms (names, codes, rare jargon) than BM25. So the modern best practice is hybrid search: run sparse and dense, fuse the rankings, optionally add a neural re-ranker on the top results. This retrieve-then-rank pipeline is exactly the R in RAG — the retrieval step that grounds a large language model in real documents instead of its memory, which is why decades-old IR is suddenly at the centre of modern AI.

Where it shows up in my work

Searching the haystack well

Much intelligence and analytical work begins with the same problem IR solves: finding the relevant documents in a very large collection. Knowing how search actually ranks changes how I use it — understanding BM25 (it matches exact terms and rewards rare, distinctive words) explains why a query succeeds or fails, and knowing the lexical gap explains why a keyword search misses a document that used a synonym, which is exactly the blind spot that loses important material.

It's increasingly practical, too: semantic/dense retrieval and hybrid search are how you find conceptually-related material that keyword search can't, and the retrieve-then-rank pipeline is the engine under the LLM-powered tools entering the toolkit (RAG). It ties straight to NLP (the embeddings), recommenders (ranking), and OSINT (searching the open-source haystack).

Refresh in 60 seconds

The inverted-index/BM25 foundations, the sparse-vs-dense framing, and hybrid-search/RAG practice reflect current information-retrieval references alongside coursework.