Token and Embedding: AI and LLM concepts that are used in SEO.

Token and Embedding: AI and LLM concepts that are used in SEO.

Tokens and embedding are concepts that are very much in focus today, whether in the study and application of Artificial Intelligence (AI) and Large-Scale Language Models ( LLMs ), or in the development of agents, applications, tools, and businesses, but they have been present in SEO for a long time! You might say to me: "But Alex, how so? I only started hearing about this after GPTs and similar technologies took over everything!"

So stick with me, I'll show you how this used to work, and how it works today, in the search engine landscape. Let's start from the beginning with tokens.

CTA Agent+Semantic

What are tokens?

A token is an individual unit of text. Imagine a sentence broken down into its smallest meaningful parts; these parts are the tokens. To illustrate, let's take a simple sentence:

In the sentence: " Semantic Search improves search quality."

When we apply basic tokenization (separation by spaces and punctuation), the tokens would be:

[“A”, “Search”, “Semantics”, “improvement”, “the”, “quality”, “of”, “search”, “.”]

In this example, each word and the period are considered distinct tokens, given that this system performs basic tokenization. More sophisticated systems could, for example, treat "Semantic Search" as a single token if it were a named entity or a frequently searched concept, and even if it were in the corpus of texts used for training.

In the past, in the context of traditional search, or token-based search , the system worked by dividing the text into these tokens. These were then used to create a type of representation called sparse embedding . Think of this type of embedding as a long list that shows how many times each word or subword appears in a text.

The main characteristic here is that sparse embeddings don't consider the meaning of words , only the frequency of their appearances. It's like a library , where you search for keywords .

To illustrate again, let's imagine we have a phrase like the one in the example above. It is then "tokenized" (divided into tokens) so that the system can index it and compare it with the exact words in your query . There are classic algorithms used to generate sparse embeddings, such as TF-IDF (Term Frequency-Inverse Document Frequency) , BM25 , or SPLADE .

TF-IDF, for example, gives more weight to words that are frequent in a specific document but rare in the overall corpus, highlighting their importance to that document. But in general, they all only consider the frequency of words.

Historically, search was more deterministic, meaning content was indexed exactly as it was received, without much interpretation by the algorithms. Documents were "decomposed" in a way we call "lexical," basically counting the distribution of words. This contrasts with the information , which is semantic. And to arrive at semantics, another concept is essential: embeddings!

What are (dense) embeddings?

An embedding , on the other hand, is a numerical representation of words or text, specifically as numerical vectors, that captures semantic relationships and contextual information . Imagine each word or text as a point on a multidimensional “map,” where proximity between points indicates similarity of meaning. The distance and direction between these vectors encode the degree of semantic similarity between the words.

word embedding 3d vector space
The image above is merely an artistic representation of the vectors used in the embedding process.

So why do we need this, you might ask?

Because most machine learning algorithms cannot process raw text without any processing, they need to use numbers as input. This is where embeddings are used.

So, these embeddings are created by embedding models (which in themselves deserve another article), which are trained by scanning large volumes of text, such as the entire Wikipedia , for example, and that's where the term LLMs comes from: Large-Scale Language Models. You use this absurd volume of text so that these models can learn the relationships between words and their contexts.

This process involves:

  1. Pre-processing : tokenization and removal of "stop words" (common words like "the", "a", "and") and punctuation.
  2. Sliding context window : identifies target words and their contexts so that the model can learn their relationships.
  3. Training : The model is trained to predict words based on their context by placing semantically similar words close to each other in the vector space. The model parameters are adjusted to minimize prediction errors.
word embedding 3d vector space
The image above is merely an artistic representation of the vectors used in the embedding process.

These embeddings are also known as dense embeddings , and they are named as such because the matrices that represent them contain mainly non-zero values, unlike sparse embeddings. It took me a long time to understand this concept, but simplifying the story to my understanding , statistics are used to densify a large number of points in this graph, grouping them by similarity, which helps in several aspects, including performance . From what I understand, I eliminate the 0s and look for the meaningful values.

Please correct me in the comments if I've said anything wrong here.

But what matters for our article is that they are extremely efficient at creating models that understand the meaning and context of words.

For example, in a system that uses one of these models, a search for "film" might also return relevant results with "cinema" or "feature film," because the embedding model understands that these words have similar meanings. This significantly improves the quality of the search.

Embeddings in information retrieval: an old topic

Google has been incorporating this technology into its search engine for years!

RankBrain , launched back in 2015, was the first deep learning system implemented in search, which at that time already helped to understand how words relate to concepts .

In 2018, Neural Matching made it possible to understand how queries relate to pages by looking at the entire query or page, and not just keywords.

BERT , in 2019, was a major breakthrough in understanding natural language, helping to understand how combinations of words express different meanings and intentions .

And MUM , from 2021, was launched as a breakthrough, being a thousand times more powerful than BERT, capable of understanding and generating language, being multimodal (text, images, etc.) and trained in 75 languages. This marked the beginning of multimodal search, meaning that various types of content, not just text, were transformed into embeddings. Transformed, does that remind you of anything?

To optimize this process, the documents are also decomposed into a level of vector embeddings for indexing . So let's organize all this into a table to understand it better? That's what I did to understand it.

Fundamental differences between tokens and embeddings:

FeatureTokenEmbedding (Dense)
RepresentationRaw text units (words, subwords)numerical vectors
FocusWord frequency and syntax of the textSemantic meaning and context
SimilarityBased on exact keywords and their distribution.Based on proximity of meaning in vector space
Main UseTraditional keyword search (lexical search)Semantic search and AI applications that require understanding of meaning.
DimensionalityIt can have tens of thousands of dimensions, with many (sparse) zeros.Generally hundreds or thousands of dimensions, with predominantly non-zero (dense) values.
ExamplesTF-IDF, BM25, SPLADEModels like Word2Vec, GloVe, and more recent ones like BERT, MUM, Gemini

Hybrid search, AI, tokens, and embeddings

The key insight from this change is that, for efficient search recovery with Artificial Intelligence, you don't use just one or the other, but rather a strategic combination: Hybrid Search . If you want to know what that is, click on the link I provided; it will take you to a LinkedIn article that came from research I did on the subject.

In short, hybrid search combines semantic search with vector search to meet a very specific need: to find similarities outside of a given domain of knowledge and to make the system you created and trained understand entities outside of it.

Why do you need a hybrid search?

You'll need this in very specific cases, such as if you're creating an agent that will interact with your customers and they might ask questions outside the domain of knowledge your model was trained on. Think about your business; is this a possibility? Then it's good to be familiar with this search model.

Semantic search, despite being very effective, has a disadvantage: it can have difficulties with information "outside the domain," that is, data on which the embedding model was not trained. Remember when the Claude and ChaGPT prompts leaked and we saw that they perform searches outside the training domain? This is to compensate for that lack, but it also includes, for example, specific product numbers, new product names, or internal company codes.

In these cases, semantic search "comes empty" because it can only find what it already "knows." And if the user needs something outside of what the model knows, they resort to token-based search, with the goal of filling that gap.

Hybrid search, by integrating semantic search (for more subtle and contextual queries) with traditional keyword search (for specific, non-domain terms), seeks the "best of both worlds," ensuring a experience due to the specific needs of AI models, something that was not the case on Google before the AI ​​Overview.

Could this be why Google took so long to get on board?

word embedding 3d vector space
The image above is merely an artistic representation of the vectors used in the embedding process.

So, shall we summarize?

Tokens are the lexical basis of language , while dense embeddings are the numerical representation of its meaning . Modern search, mediated by algorithms and AI, can use both, as in the case of hybrid search, but there is a growing trend to focus on embeddings. The fact that they help models understand context and intent, and increase the "reasoning" capacity of language models, makes their choice more than obvious.

Part of our job as search specialists is to structure data and content so that these systems can understand them, reason about them, and present them effectively, including in a hyper-personalized way. The era of the AI ​​Agent , and our next "client" is precisely one of these agents.

Hello, I'm Alexander Rodrigues Silva, SEO specialist and author of the book "Semantic SEO: Semantic Workflow". I've worked in the digital world for over two decades, focusing on website optimization since 2009. My choices have led me to delve into the intersection between user experience and content marketing strategies, always with a focus on increasing organic traffic in the long term. My research and specialization focus on Semantic SEO, where I investigate and apply semantics and connected data to website optimization. It's a fascinating field that allows me to combine my background in advertising with library science. In my second degree, in Library and Information Science, I seek to expand my knowledge in Indexing, Classification, and Categorization of Information, seeing an intrinsic connection and great application of these concepts to SEO work. I have been researching and connecting Library Science tools (such as Domain Analysis, Controlled Vocabulary, Taxonomies, and Ontologies) with new Artificial Intelligence (AI) tools and Large-Scale Language Models (LLMs), exploring everything from Knowledge Graphs to the role of autonomous agents. In my role as an SEO consultant, I seek to bring a new perspective to optimization, integrating a long-term vision, content engineering, and the possibilities offered by artificial intelligence. For me, SEO work is a strategy that needs to be aligned with your business objectives, but it requires a deep understanding of how search engines work and an ability to understand search results.

Post comment

Semantic Blog
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognizing you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.