What an AI brain taught me

This is the first article in a series of writings I'm doing with the help of one of my agents, created using the same strategy as the agent+Semantic , which is the service for creating agents specialized in research and content that Semântico SEO is offering to all those who want to create real content with the help of AI.

In this article, my agent and I investigated a series of videos from the channel “ 3Blue1Brown ,” which were very important to me in my quest to understand how LLMs work. The video series, called Neural Networks, offers a very deep insight into the central mechanisms of neural networks and large language models (LLMs), such as GPT-3.

A significant portion of the video content focuses on the structure and operation of Transformers, detailing the Attention mechanism, which is a process that allows word vectors to adjust their meanings based on context, through weight, query , key, and value matrices. When I understood how this works, it opened up a world of understanding and possibilities for how to interact with these models to obtain information , learn , generate content, and share knowledge with them.

And that spark is what gave rise to the Semantic+Agent and this series of articles.

In this first text, I began by asking the model about the central concepts of the Transformer's attention mechanisms, and after some interactions, I decided to focus on the mechanism of progressive adjustment of incorporations and the richer contextual meaning that this process generates. I know, it seems complicated, and it really is, but I'll try to make things easier.

Contents

A response mechanism that doesn't have pre-set answers?

If you've interacted with a chatbot in the last two years, you might have had the feeling that it seems like magic. You ask a complex question and, in seconds, you receive a coherent and creative answer. Not that it's always right, but there's text there that makes sense .

We might be tempted to imagine that what was generated came from an almost unintelligible jumble of words, but in reality it is much more coherent than that. There is an internal mechanism in Large-Scale Language Models (LLMs); in reality, it is a system with principles that echo concepts from information science—and I'm biased, but not exclusively.

But to understand how all this works, and to somehow relate it to our SEO work, I had to contain my curiosity and take it step by step. During the Q&A session with my agent, I learned that we can divide our understanding into six parts. That said, here are my notes on what I found.

My 6 surprising discoveries

Finding #1: A word doesn't have a meaning, it has a starting point.

At the beginning of the Transformer process, each word (or " token ") in the generated text receives a numeric vector, which is a long list of numbers called " embedding ." You've heard that word before, right? That's what it means. But the interesting finding here is that this initial embedding is identical for the same word, regardless of the context in which it appears. Did you know that? I didn't.

An example my agent gave me was this:

Consider the word "mole." In the phrases "the true American mole" and "perform a mole biopsy," the word has two semantically distinct meanings. However, in the first step of the model, the numerical vector for "mole" is exactly the same in both cases.

generic starting point in a gigantic space of meanings. This reminded me a lot of the concept of semantic field, and I started thinking about the size of the fields that the models create for each word they generate.

But the real work of a Transformer is to progressively adjust these embeddings, layer by layer, moving them in this high-dimensional space so that they incorporate a rich, contextual meaning specific to that phrase.

This completely changed my perception of this process. Since the model doesn't "look up" a definition in an internal dictionary, I understood that it constructs meaning in real time, in a process of continuous refinement.

Another example the agent gave me was even more impactful:

Imagine that, at the end of a long mystery novel that concludes with "therefore, the murderer was…", the final vector for the word "was" needs to have absorbed and encoded all the relevant information from the story to be able to correctly predict the murderer's name.

Finding #2: Models ask questions and find answers all the time.

I discovered that LLMs talk to themselves, just like us humans. There's an "attention" mechanism, which is the heart of a Transformer, and it can be understood as a constant internal dialogue. Remember that as the Transformer works, words gain more specific meanings, right? So, each word generates a "Query" vector , which is a process that asks a question about the rest of the sentence.

Let's look at another example:

Imagine a noun like "creature," the search for its meaning might generate a query that, in essence, asks something like: "Hey, is there an adjective around here that describes me?"

Other words in the sentence, in turn, generate "Key" vectors, which serve as potential answers. The adjectives "cute" and "blue" would have keys that "answer" affirmatively to the "creature's" question.

The strength of the match between the Query of one word and the Key of another (measured by a mathematical operation called the dot product ) determines how relevant one word is to the other in that specific context.

Once relevance is established, the "relevant" word sends its "Value," a packet of information, to update the embedding of the word that asked the question. Yes, I was confused by that part too.

But "attention" is not a foreign process to us, since models emulate our way of thinking. Imagine that this conversation takes place between two areas of your brain. The language part needs to reproduce in speech what the other is thinking. They exchange data between them in real time. As soon as the part that needs to speak finds what it needs to say, the sentence will be spoken.

This is a dynamic process where words ask each other questions, and when they find the most relevant answers, they exchange information to construct a contextualized meaning.

Finding #3: “Meaning” is merely a direction in space.

This has blown my mind since the first time I heard about it. How can meaning be a direction in space? And what's more: multidimensional!

But let's take it slowly. Consider that word embeddings are not just lists of random numbers; they exist in a very high-dimensional vector space (12,288 dimensions in GPT-3 and GPT-4, with estimates suggesting around 1.5 trillion). The most fascinating thing is that the directions in this space correspond to semantic concepts and meanings.

Let's use the image below as a basis for imagination. Notice that it doesn't have all dimensions; it's on a plane and uses two other dimensions. Each of these colored arrows represents a possible meaning for a word. The models use these vectors to calculate the possible meanings we discussed in the previous finding.

But let's look at an example my agent gave me:

The classic example is “vector arithmetic” with words. It has been found that the direction in space from the vector for “man” to the vector for “woman” is very similar to the direction from “king” to “queen”. Conceptually, this can be expressed as: vector(king) - vector(man) + vector(woman) ≈ vector(queen) . This demonstrates that the model, during training, learned to encode an abstract concept like “gender” as a specific geometric direction.

This concept is very complicated to explain in words, so I'm going to recommend this video to you. It's how I finally learned it:

But the most important idea I want to convey to you is how the directions in this high-dimensional space of all possible embodiments can correspond to semantic meaning.

The depth of this is mind-boggling, come with me!

Concepts are not stored in a dictionary, but as geometric relationships . For those from Information Science, this is a fascinating echo of the pillars of knowledge organization, such as taxonomy and ontology , except that here the structure emerges in a purely mathematical way, without any human curation. This has strengths and weaknesses.

This is why the use of tools such as Graphs and Ontologies as guides for the models has shown such encouraging results: they act as preliminary guides so that the models do not have to converse so much before finding the most relevant meanings.

Finding #4: a fact can be a simple "On/Off" switch.

One of the dozens of questions I asked my agent was about whether the models stored these conclusions about the best meanings of words somewhere. If I ask a question like this: Does Michael Jordan play basketball?

Where does an LLM store these concrete facts? I discovered that the most recent research suggests that these facts “live” in the neural network called Multilayer Perceptrons (MLPs). And the way they do this is very simple.

I will, again, transcribe the example I received:

Imagine that one of the "rows" in the first matrix of an MLP has been specifically trained to detect the simultaneous presence of the " Michael " and " Jordan " embeddings in an input vector. If both concepts are present and aligned with that row, a specific "neuron" is activated (its value becomes positive). If not, it remains inactive (zero value). Essentially, it works like an "AND" logic gate, which only fires a "true" signal when both conditions are met.

It's like an electrical switch; when both sides are touched, it turns on a light.

And what happens when that neuron fires is even more interesting. Imagine a corresponding "column" in the second matrix ( I imagined an Excel spreadsheet to make it easier ) of the MLP, which in turn was trained to represent the direction of the "basketball" concept. When the embeddings are created (remember: that giant line of numbers), this information is added to the original vector.

The result of this interaction is that now, when passing through this block, the embedding that represented "Michael Jordan" also contains the information "basketball." It's as if the attribute "basketball player" is added to the entity Michael Jordan, gaining context.

This is a counterintuitive mechanic for me, but it's brilliant at the same time. Something from the real world (an entity) is broken down into an almost mechanical operation: a conditional trigger that, when activated, adds a new vector of meaning to the flow of information. This happens in milliseconds.

Finding #5: the model stores more ideas than it has space for.

Here's something that challenges my three-dimensional intuition: in an n- dimensional space, I think I can only store n independent ideas or characteristics (that is, perpendicular to each other). If I have a 1000-dimensional world, I can only store 1000 ideas, my mind says. But that's not how it works.

This is true in our 3-dimensional world, but this rule is broken in very, very high dimensions.

This phenomenon is called "superposition." In high-dimensional spaces, such as the embedding space of an LLM, it is possible to embed an exponentially larger number of vectors that are "almost perpendicular" to each other (for example, all with angles between 89 and 91 degrees).

Remember the image of the vectors in the graph? Imagine that they are close to each other by distances smaller than a hair's width, to use a physical example.

I mentioned GPT-3, which has a space of 12,288 dimensions. It's not limited to storing only 12,288 distinct features but can store orders of magnitude more. This means that a single conceptual feature, such as 'plays basketball' or 'is a famous athlete', may not be represented by a single artificial neuron, going back to our brain example.

Instead, the model can represent millions of features as specific combinations (overlays) of many neurons, allowing for an information density that challenges our way of thinking (mine at least).

This is perhaps the strangest aspect of the "intelligence" inherent in Machine Learning. It operates on a geometry that our minds have difficulty comprehending, and allows for a density of information that seems to violate our rules of space and information organization.

Finding #6: learning is, literally, just rolling downhill.

At the start of training, all 175 billion parameters of a model like GPT-3 are random. If you asked it to generate text, it would produce only incomprehensible "garbage." The "learning" process is simply a method to correct this initial mess.

To do this, we define a "cost function"—a single number that measures "how bad" the network is at its task (for example, predicting the next word correctly). This number is the average error of the model across tens of thousands of training examples. The goal of training is simple: to minimize this number.

The technique used is called "gradient descent," and the best analogy is visual. Here's an example I received:

Imagine the cost function as a mountainous landscape, full of hills and valleys. The training process is like placing a ball at the top of a hill and simply letting it roll down to the nearest valley. At each step, an algorithm (backpropagation) calculates the direction of the "steepest descent" and slightly adjusts all 175 billion parameters in that direction to reduce the cost, that is, to make the ball roll a little further downhill.

This metaphor demystifies the so-called "machine learning." In reality, there is no understanding , only a mathematical optimization process, repeated trillions of times on a massive volume of data. The algorithm tirelessly adjusts the parameters to find a "valley" (a local minimum) where the model's performance on the training data is good.

So the question remains: what does an LLM mean by "understand"?

We began this article discussing static vectors, which are merely starting points, and arrived at a dynamic process of meaning-making. I provided information on how AI "thinks" through an exchange of questions and answers, vector additions, and logical triggers, all orchestrated and fine-tuned by an optimization process that resembles a ball rolling downhill.

While I may have gained some more certainty by the end of this article, some ideas still swirl around in my head: if "meaning" can be constructed through geometric operations in a high-dimensional space, and if "learning" is simply the mathematical way of minimizing an error function on a bizarrely colossal scale, are we one miscalculation away from complete disaster?

Is this what we call hallucinating? If so, is hallucinating a mistake or simply something expected in the process of going downhill?

I end this text promising you a second part, perhaps bringing answers to these questions of mine.

A response mechanism that doesn't have pre-set answers?

My 6 surprising discoveries

Finding #1: A word doesn't have a meaning, it has a starting point.

Finding #2: Models ask questions and find answers all the time.

Finding #3: “Meaning” is merely a direction in space.

Finding #4: a fact can be a simple "On/Off" switch.

Finding #5: the model stores more ideas than it has space for.

Finding #6: learning is, literally, just rolling downhill.

So the question remains: what does an LLM mean by "understand"?

What is Semantic SEO?

How do you organize the information on your website?

More articles

Post comment Cancel reply