When we talk about information retrieval, as SEO pros, we tend to focus heavily on the information collection stage – the crawling.
During this phase, a search engine would discover and crawl URLs that it has access to (the volume and breadth depending on other factors we colloquially refer to as a crawl budget).
The crawl phase isn’t something we’re going to focus on in this article, nor am I going to go in-depth on how indexing works.
If you want to read more on crawl and indexing, you can do so here.
In this article, I will cover some of the basics of information retrieval, which, when understood, could help you better optimize web pages for ranking performance.
It can also help you better analyze algorithm changes and search engine results page (SERP) updates.
To understand and appreciate how modern-day search engines process practical information retrieval, we need to understand the history of information retrieval on the internet – particularly how it relates to search engine processes.
Regarding digital information retrieval and the foundation technologies adopted by search engines, we can go back to the 1960s and Cornell University, where Gerard Salton led a team that developed the SMART Information Retrieval System.
Salton is credited with developing and using vector space modeling for information retrieval.
Vector Space Models
Vector space models are accepted in the data science community as a key mechanism in how search engines “search” and platforms such as Amazon provide recommendations.
This method allows a processor, such as Google, to compare different documents with queries when queries are represented as vectors.
Google has referred to this in its documents as vector similarity search, or “nearest neighbor search,” defined by Donald Knuth in 1973.
In a traditional keyword search, the processor would use keywords, tags, labels, etc., within the database to find relevant content.
This is quite limited, as it narrows the search field within the database because the answer is a binary yes or no. This method can also be limited when processing synonyms and related entities.
The closer the two entities are in terms of proximity, the less space between the vectors, and the higher in similarity/accuracy they are deemed to be.
To combat this and provide results for queries with multiple common interpretations, Google uses vector similarity to tie various meanings, synonyms, and entities together.
A good example of this is when you Google my name.
To Google, [dan taylor] can be:
- I, the SEO person.
- A British sports journalist.
- A local news reporter.
- Lt Dan Taylor from Forrest Gump.
- A photographer.
- A model-maker.
Using traditional keyword search with binary yes/no criteria, you wouldn’t get this spread of results on page one.
With vector search, the processor can produce a search results page based on similarity and relationships between different entities and vectors within the database.
You can read the company’s blog here to learn more about how Google uses this across multiple products.
When comparing documents in this way, search engines likely use a combination of Query Term Weighting (QTW) and the Similarity Coefficient.
QTW applies a weighting to specific terms in the query, which is then used to calculate a similarity coefficient using the vector space model and calculated using the cosine coefficient.
The cosine similarity measures the similarity between two vectors and, in text analysis, is used to measure document similarity.
This is a likely mechanism in how search engines determine duplicate content and value propositions across a website.
Cosine is measured between -1 and 1.
Traditionally on a cosine similarity graph, it will be measured between 0 and 1, with 0 being maximum dissimilarity, or orthogonal, and 1 being maximum similarity.
The Role Of An Index
In SEO, we talk a lot about the index, indexing, and indexing problems – but we don’t actively talk about the role of the index in search engines.
The purpose of an index is to store information, which Google does through tiered indexing systems and shards, to act as a data reservoir.
That’s because it’s unrealistic, unprofitable, and a poor end-user experience to remotely access (crawl) webpages, parse their content, score it, and then present a SERP in real time.
Typically, a modern search engine index wouldn’t contain a complete copy of each document but is more of a database of key points and data that has been tokenized. The document itself will then live in a different cache.
While we don’t know exactly the processes which search engines such as Google will go through as part of their information retrieval system, they will likely have stages of:
- Structural analysis – Text format and structure, lists, tables, images, etc.
- Stemming – Reducing variations of a word to its root. For example, “searched” and “searching” would be reduced to “search.”
- Lexical analysis – Conversion of the document into a list of words and then parsing to identify important factors such as dates, authors, and term frequency. To note, this is not the same as TF*IDF.
We’d also expect during this phase, other considerations and data points are taken into account, such as backlinks, source type, whether or not the document meets the quality threshold, internal linking, main content/supporting content, etc.
Accuracy & Post-Retrieval
In 2016, Paul Haahr gave great insight into how Google measures the “success” of its process and also how it applies post-retrieval adjustments.
You can watch his presentation here.
In most information retrieval systems, there are two primary measures of how successful the system is in returning a good results set.
These are precision and recall.
The number of documents returned that are relevant versus the total number of documents returned.
Many websites have seen drops in the total number of keywords they rank for over recent months (such as weird, edge keywords they probably had no right in ranking for). We can speculate that search engines are refining the information retrieval system for greater precision.
The number of relevant documents versus the total number of relevant documents returned.
Search engines gear more towards precision over recall, as precision leads to better search results pages and greater user satisfaction. It is also less system-intensive in returning more documents and processing more data than required.
The practice of information retrieval can be complex due to the different formulas and mechanisms used.
As we don’t fully know or understand how this process works in search engines, we should focus more on the basics and guidelines provided versus trying to game metrics like TF*IDF that may or may not be used (and vary in how they weigh in the overall outcome).
Featured Image: BRO.vector/Shutterstock