LLMs: Self Attention

Apr 14
10 min read

Hello, Today we will be covering one of the most fascinating mechanisms that changed AI forever, i.e., "Self Attention". We will first build some intuition for why it works before jumping directly to the implementation.

I hope you have read my previous blog on Positional Embeddings, where we understood why "environment" information is also needed alongside "positional" information when generating a sentence embedding.

Let me start with an analogy.

Analogy

Let's say you have purchased "n" items that you want to place inside your room. But you are unable to decide on your own which items will go well together, etc., such that your "overall" room looks practical or "meaningful", i.e.

So you hired an interior designer because they have "enough experience" in the same field, since they have already designed so many houses.

Let's say you have the following items-

Television
Lamp
Recliner
Bed

Now, how will this "interior designer" work? There will be some "structure/methodology" that he might be following to make his job easy, right?

Let's see how this person will work.

Firstly, he tries to "describe" what exactly an item is, "looking for" or "query" for this item

Television: I am looking for something where people can sit

Lamp: I am looking for something where low light is needed

Recliner: I am looking for something where people can do something         while sitting

Bed: I am looking for something that can help people while relaxing

Note- The interior designer was able to put out these descriptive queries, since he has "enough experience" gained from past

After that, he tries to "describe" what exactly a given item is, "bringing into" the room or the "key" factor it is trying to fulfill

Television: I provide entertaintment when someone is idle or bored

Lamp: I provide dim lighting so that people can relax

Recliner: I provide people an entity to sit

Bed: I provide people an entity to relax or sleep

Note- The interior designer was able to put out these descriptive features, since he has "enough experience" gained from past

And at last, he tries to "describe" how to handle or "install" each item such that you get maximum "value" out of it

Television: Place the televion on the wall hinge with minimal reflection

Lamp: Place the lamp where it can be reached without any movement

Recliner: Place the recliner such that there is ample space in front to extend it

Bed: Place the bed where ample space is there on the sides

Note- The interior designer was able to put out these descriptive insights, since he has "enough experience" gained from past

Now, once he had noted down the "query", "key", and "value" for each item. He will try to figure out the "affinity" of 2 items based on the query and key, i.e.

Television: 

Query -> I am looking for something where people can sit

Candidates ** Keys **

Lamp: I provide dim lighting so that people can relax
Recliner: I provide people an entity to sit
Bed: I provide people an entity to relax or sleep

For television's query, he can easily determine which particular item has the highest affinity

Television -> Lamp : Low Affinity
Television -> Recliner : High Affinity
Television -> Bed : Medium Affinity

Similarly for bed

Bed: 

Query -> I am looking for something that can help people while relaxing

Keys->

Television: I provide entertaintment when someone is idle or bored
Lamp: I provide dim lighting so that people can relax
Recliner: I provide people an entity to sit

Bed -> Television : Low Affinity
Bed -> Lamp : High Affinity
Bed -> Recliner : Medium Affinity

So, at the end, he came up with these pairs or "combinations" by just picking the highest affinity ones

Television & Recliner 
Bed & Lamp

Now, before installing them, He will refer to the "value" insights that he noted earlier, i.e.

Television: Place the televion on the wall hinge with minimal reflection

Lamp: Place the lamp where it can be reached without any movement

Recliner: Place the recliner such that there is ample space in front to extend it

Bed: Place the bed where ample space is there on the sides

Now, combining the "affinity" and "values", the room eventually looks like:

A room with a bed at the center having a lamp on the side table and a recliner facing the television mounted on wall in the other half of the room

So, he was able to place items so that the overall room makes meaningful use of them.

If I quote this in a different way,

"He was able to describe the room in a way that none of the items loose their meaning or without any information loss," i.e., from the above description, you can easily determine which items are there, etc.

You might be wondering why we went through all of this, as this blog is not about mastering "interior design".

Well, while performing the above steps, you just did the actual "self-attention" mechanism!

In the above analogy:

Room -> Sentence
Items -> Tokens
Describe -> Projection
Query, Key, Value of Items -> Query, Key, Value of Tokens
Interior Designer -> Attention Block

Now we have the intuition behind "self-attention". Let's take a look at it formally.

Self Attention

So in our last blog, we learned about "position-aware" embeddings, a.k.a positional embeddings, i.e., for each token in a sentence, we were able to generate its corresponding embedding while also storing the positional information

So, let's say you have a sentence with "n" tokens and you want to embed it in a "d" dimensional embedding

So until self-attention, we were able to generate

Sentence -> Token(1) Token(2) Token(3) Token(4) .............. Token(n)

Postion Aware Embeddings:

Token1 -> [0.1, 0.6, 0.9 ........ 'd']
Token2 -> [0.3, 0.1, 0.5 ........ 'd']
Token3 -> [0.9, 0.7, 0.8 ........ 'd']
...........................
...........................
...........................

Let's start correlating it with our analogy.

The interior designer first wrote down the key, query, and value insights for each item.

Now, we will have to do the same thing for these tokens, i.e., generating "query", "key", and "value" projections for these tokens

Now, how exactly to do this? Notice the word "projection" here.

Well, you guessed it right. As we saw in Token Embeddings, matrix operations are used to perform any kind of projection. And matrices are an indirect translation of a neural network.

So, we want a matrix of dimension "d x d", where d is the "embedding dimension", such that once we multiply our current "position-aware" embeddings with it, we get the required projections for it, i.e.

query projection
key projection
value projection

Here, once we multiply our initial matrix [1 x d] with the projection matrix [d x d], we get the "transformed/enriched" projections depending on our need, i.e.

Since we need 3 projections, which are "key", "query", and "value", we will have 3 projection matrices corresponding to these, and once we multiply our token embedding with each of them, we will get our required projections

Now, you might be wondering how these projection matrices have the values such that, when multiplied by them, we get the desired projections.

If you go back to our analogy, our interior designer was able to do so because he had enough experience. Similarly, in this case, we will train these projection matrices so that the values are tuned to our use case.

Note- I'll be denoting Query - Q, Key - K, and Value - V

So for the sentence with "n" tokens, our effective Q, K, and V matrices post-projections will look like:

If we continue with our analogy, once the interior designer had "query", "key", and "value" for each item, he tried to figure out the affinity using "query" and "key", and picked the one with the highest affinity and used the "value" corresponding to it.

In our case, we have the "Q K V matrices" representing the corresponding values for each token, and to compute the affinity, we multiply the Q and K matrices, yielding "weightage / affinity/ scores" for a token with respect to other tokens.

Now, since Q and K are both of dimension [n x d], we cannot multiply them directly, so we will take the "transpose" of the Key matrix so that it becomes of dimension [d x n]

Think of a score matrix as a matrix containing affinity or attention scores of a given token with respect to other tokens.

In our analogy, we pick the item with the highest affinity. In self-attention, we do a "weighted" sum, i.e., we will multiply this scoring matrix [n x n] with the 'value' matrix [n x d]

The output matrix is of dimension [n x d], you can assume that each row in this matrix contains the "context-awared" embedding of each token.

If I formulate what we just did, it will be represented as:

Note: We also divide it by a factor at the end, which is mostly done to normalize the values.

At the end, in our analogy, the interior designer puts everything together. Similarly, we take the mean of our output matrix [n x d] to get the sentence embedding of dimension "d"

Cool, so we were able to generate an embedding for a sentence and have built a strong foundation in self-attention.

Now, let's see how we implement it and train it so that we have "Q K V" weighted matrices to create projections for any token.

Implementation

Self Attention Block:

As discussed, we will have matrices of dimension "d x d" for the Query, Key, and Value projection matrices. Since neural networks are an indirect translation of matrix operations, each Q K V matrix will be a linear layer of input and output size "d."
In the forward pass, we will get corresponding Q K V values for our sentence tokens, after which we will apply the self-attention formula, i.e.

class SelfAttentionBlock(nn.Module):
    def __init__(self, d_model):
        super().__init__()
        self.d_model = d_model

        self.query = nn.Linear(d_model, d_model)
        self.key = nn.Linear(d_model, d_model)
        self.value = nn.Linear(d_model, d_model)
        
        # normalization layer
        self.ln = nn.LayerNorm(d_model)

    def forward(self, sentence_tokens):
        q = self.query(sentence_tokens)
        k = self.key(sentence_tokens)
        v = self.value(sentence_tokens)

        scaling_factor = self.d_model ** 0.5
        attention_scores = torch.matmul(q, k.transpose(-2, -1)) / scaling_factor
        attention_weights = torch.softmax(attention_scores, dim=-1)
        attended_values = torch.matmul(attention_weights, v)
        
        normalized = self.ln(attended_values + sentence_tokens)
        return normalized

Embedding Model:

In our model, we first have an "embedding_layer", which is a single-layer neural network that specializes in generating Token Embedding
After that, we have an attention layer, which takes these token embeddings and generates the "context aware" embeddings
We also have a method to get "sinusoidal positional embeddings" (It's not a model, as it's a deterministic mathematical function)
In the forward pass
- First, we get the token embeddings for our sentence
- Then we supply it with the self-attention block
- At last, we take the average of all "attended" token embeddings

class EmbeddingModel(nn.Module):
    def __init__(self, tokenizer):
        super().__init__()
        self.tokenizer = tokenizer
        self.embedding_layer = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIM)
        # List of attention layers
        self.attention_layers = nn.ModuleList([
            SelfAttentionBlock(EMBEDDING_DIM)
        ])

    def _get_positional_embedding(self, seq_length):
        # Creates a [seq_length, EMBEDDING_DIM] matrix
        position = torch.arange(seq_length, dtype=torch.float, device=device).unsqueeze(1)
        div_term = torch.exp(
            torch.arange(0, EMBEDDING_DIM, 2, dtype=torch.float, device=device) * (-math.log(10000.0) / EMBEDDING_DIM)
        )
        pe = torch.zeros(seq_length, EMBEDDING_DIM, device=device)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        return pe

    def forward(self, sentence):
        token_ids = self.tokenizer.encode(sentence, out_type=int)
        token_ids = token_ids[:MAX_SEQ_LEN]
        x = torch.tensor(token_ids, dtype=torch.long, device=device)
        x = self.embedding_layer(x)
        x = x + self._get_positional_embedding(x.size(0))
        for attn in self.attention_layers:
            x = attn(x)
        return torch.mean(x, dim=0)

Training

Remember how we performed training for Token Embeddings, where we followed the "training for a proxy" paradigm since we did not have a "supervised" dataset.

Similarly, since we don't have a "supervised" dataset present, i.e.

sentence          |       embedding
-------------------------------------------------
blah blah blah    |   [0.67, 0.543, 0.21, 0.21]
................  |   .....................
................  |   ....................

We will train for a "proxy" task, a.k.a "unsupervised" training, where our task would be:

"Given a target sentence with the corresponding similar and non-similar pair, compute a confidence score."

Dataset:

main_sentence    |    similar_sentence    |   non-simlar sentence
------------------------------------------------------------------
tom chase jerry  |   jerry likes cheese   |  harry killed voldemort 
................ |  ..................... |   ...................
................ |  ....................  |   ...................

Such a dataset is easy to generate, since the internet is filled with text, and it's a fair assumption that sentences that are closely present are similar and vice versa

Proxy Task:

Here, we use our model to generate embeddings for the triplets and then compute the confidence score by taking the dot product and normalizing it

class ProxyTask:
    def __init__(self, embedding_model):
        self.embedding_model = embedding_model

    def execute(self, main_sentence, similar_sentence, non_similar_sentence):
        main_emb = self.embedding_model(main_sentence)
        sim_emb = self.embedding_model(similar_sentence)
        non_sim_emb = self.embedding_model(non_similar_sentence)

        pos_dot = torch.dot(main_emb, sim_emb)
        neg_dot = torch.dot(main_emb, non_sim_emb)

        pos_conf = torch.log(torch.sigmoid(pos_dot))
        neg_conf = torch.log(torch.sigmoid(-neg_dot))

        confidence = pos_conf + neg_conf
        return confidence

Backpropagation (Gradient Descent)

Now, we will start the training loop for our proxy task
Here, our loss function becomes the negation of the confidence score, i.e., low confidence equals high loss and vice versa
After that, we backpropagate the loss via gradient descent, i.e., tuning the value of each parameter, which also includes the Q K V matrices

EPOCHS = 100
LEARNING_RATE = 1e-4

optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)

for epoch in range(EPOCHS):
    total_loss = 0.0

    for i, (main_batch, similar_batch, non_similar_batch) in enumerate(dataloader):
        # DataLoader wraps single items in tuples/lists, so we unpack the strings
        main_sentence = main_batch[0]
        similar_sentence = similar_batch[0]
        non_similar_sentence = non_similar_batch[0]
        optimizer.zero_grad()
        confidence = task.execute(main_sentence, similar_sentence, non_similar_sentence)
        loss = -confidence
        loss.backward()
        optimizer.step()
        total_loss += loss.item()
        # Print progress every 100 steps
        if (i + 1) % 100 == 0:
            print(f"Epoch [{epoch+1}/{EPOCHS}], Step [{i+1}/{len(dataloader)}], Loss: {loss.item():.4f}")

    avg_loss = total_loss / len(dataloader)
    print(f"--- Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f} ---")

Once our training loop gets completed, we now have a trained "embedding model" with a trained attention block, i.e., tuned values of Q K V matrices.

Now, for any new sentence, you can use this trained model to generate an embedding

embedding = model.forward("jerry is in trouble")

Multi Layer Self Attention

If you notice the code of our embedding model. We just used a single attention layer, i.e.

        self.attention_layers = nn.ModuleList([
            SelfAttentionBlock(EMBEDDING_DIM)
        ])

As we know, real-life use cases can be quite complex, which demands more non-linearities. We can have "multiple" attention blocks fused together, for example, here we have 3 attention blocks fused together, one after the other, where the output of one block gets supplied to the other as input

        self.attention_layers = nn.ModuleList([
            SelfAttentionBlock(EMBEDDING_DIM),
		 SelfAttentionBlock(EMBEDDING_DIM),
		 SelfAttentionBlock(EMBEDDING_DIM)
        ])

And this is called "Multi Layer Self Attention."

In a nutshell,

Multi-Head Self Attention

Similarly, to add more non-linearities, instead of just stacking attention blocks one after the other, we can also have "a group of attention blocks" also called "heads" working in parallel.

Here, the results of each parallel block are merged and sent to the subsequent layer.

Note- The basic idea of attention remains the same in any case; these are just multiple variants of combining multiple blocks to add more non-linearities.

Transformers

Well, it might surprise you, but we just designed the "encoder" part of the "Transformers" architecture on which a ton of embedding models, like BERT, etc., are based.

You might be noticing another section in this architecture, i.e., "Decoder", on which all the text-generation models like GPT, LLama, etc. are based, and we will be covering that in the next blog, where we will be doing a deep dive into the fascinating world of decoders and will unveil the magical capabilities it brings.

I hope you now have a strong intuition about Self Attention and are no longer scared.

Note- I skipped some of the parts like normalization, residue connections, etc., since the primary aim is to build intuition and these are more of an optimization.

Thank you for your attention!

Stay Tuned....

Aditya x Mittal