LLMs: Self Attention
- 3 days ago
- 10 min read
Hello, Today we will be covering one of the most fascinating mechanisms that changed AI forever, i.e., "Self Attention". We will first build some intuition for why it works before jumping directly to the implementation.
I hope you have read my previous blog on Positional Embeddings, where we understood why "environment" information is also needed alongside "positional" information when generating a sentence embedding.
Let me start with an analogy.
Analogy
Let's say you have purchased "n" items that you want to place inside your room. But you are unable to decide on your own which items will go well together, etc., such that your "overall" room looks practical or "meaningful", i.e.
So you hired an interior designer because they have "enough experience" in the same field, since they have already designed so many houses.
Let's say you have the following items-
Television
Lamp
Recliner
BedNow, how will this "interior designer" work? There will be some "structure/methodology" that he might be following to make his job easy, right?
Let's see how this person will work.
Firstly, he tries to "describe" what exactly an item is, "looking for" or "query" for this item
Television: I am looking for something where people can sit
Lamp: I am looking for something where low light is needed
Recliner: I am looking for something where people can do something while sitting
Bed: I am looking for something that can help people while relaxingNote- The interior designer was able to put out these descriptive queries, since he has "enough experience" gained from past
After that, he tries to "describe" what exactly a given item is, "bringing into" the room or the "key" factor it is trying to fulfill
Television: I provide entertaintment when someone is idle or bored
Lamp: I provide dim lighting so that people can relax
Recliner: I provide people an entity to sit
Bed: I provide people an entity to relax or sleepNote- The interior designer was able to put out these descriptive features, since he has "enough experience" gained from past
And at last, he tries to "describe" how to handle or "install" each item such that you get maximum "value" out of it
Television: Place the televion on the wall hinge with minimal reflection
Lamp: Place the lamp where it can be reached without any movement
Recliner: Place the recliner such that there is ample space in front to extend it
Bed: Place the bed where ample space is there on the sidesNote- The interior designer was able to put out these descriptive insights, since he has "enough experience" gained from past
Now, once he had noted down the "query", "key", and "value" for each item. He will try to figure out the "affinity" of 2 items based on the query and key, i.e.
Television:
Query -> I am looking for something where people can sitCandidates ** Keys **
Lamp: I provide dim lighting so that people can relax
Recliner: I provide people an entity to sit
Bed: I provide people an entity to relax or sleepFor television's query, he can easily determine which particular item has the highest affinity
Television -> Lamp : Low Affinity
Television -> Recliner : High Affinity
Television -> Bed : Medium AffinitySimilarly for bed
Bed:
Query -> I am looking for something that can help people while relaxing
Keys->
Television: I provide entertaintment when someone is idle or bored
Lamp: I provide dim lighting so that people can relax
Recliner: I provide people an entity to sitBed -> Television : Low Affinity
Bed -> Lamp : High Affinity
Bed -> Recliner : Medium AffinitySo, at the end, he came up with these pairs or "combinations" by just picking the highest affinity ones
Television & Recliner
Bed & LampNow, before installing them, He will refer to the "value" insights that he noted earlier, i.e.
Television: Place the televion on the wall hinge with minimal reflection
Lamp: Place the lamp where it can be reached without any movement
Recliner: Place the recliner such that there is ample space in front to extend it
Bed: Place the bed where ample space is there on the sidesNow, combining the "affinity" and "values", the room eventually looks like:
A room with a bed at the center having a lamp on the side table and a recliner facing the television mounted on wall in the other half of the roomSo, he was able to place items so that the overall room makes meaningful use of them.
If I quote this in a different way,
"He was able to describe the room in a way that none of the items loose their meaning or without any information loss," i.e., from the above description, you can easily determine which items are there, etc.
You might be wondering why we went through all of this, as this blog is not about mastering "interior design".
Well, while performing the above steps, you just did the actual "self-attention" mechanism!
In the above analogy:
Room -> Sentence
Items -> Tokens
Describe -> Projection
Query, Key, Value of Items -> Query, Key, Value of Tokens
Interior Designer -> Attention Block
Now we have the intuition behind "self-attention". Let's take a look at it formally.
Self Attention
So in our last blog, we learned about "position-aware" embeddings, a.k.a positional embeddings, i.e., for each token in a sentence, we were able to generate its corresponding embedding while also storing the positional information
So, let's say you have a sentence with "n" tokens and you want to embed it in a "d" dimensional embedding
So until self-attention, we were able to generate
Sentence -> Token(1) Token(2) Token(3) Token(4) .............. Token(n)
Postion Aware Embeddings:
Token1 -> [0.1, 0.6, 0.9 ........ 'd']
Token2 -> [0.3, 0.1, 0.5 ........ 'd']
Token3 -> [0.9, 0.7, 0.8 ........ 'd']
...........................
...........................
...........................Let's start correlating it with our analogy.
The interior designer first wrote down the key, query, and value insights for each item.
Now, we will have to do the same thing for these tokens, i.e., generating "query", "key", and "value" projections for these tokens
Now, how exactly to do this? Notice the word "projection" here.
Well, you guessed it right. As we saw in Token Embeddings, matrix operations are used to perform any kind of projection. And matrices are an indirect translation of a neural network.
So, we want a matrix of dimension "d x d", where d is the "embedding dimension", such that once we multiply our current "position-aware" embeddings with it, we get the required projections for it, i.e.
query projection
key projection
value projection

Here, once we multiply our initial matrix [1 x d] with the projection matrix [d x d], we get the "transformed/enriched" projections depending on our need, i.e.
Since we need 3 projections, which are "key", "query", and "value", we will have 3 projection matrices corresponding to these, and once we multiply our token embedding with each of them, we will get our required projections

Now, you might be wondering how these projection matrices have the values such that, when multiplied by them, we get the desired projections.
If you go back to our analogy, our interior designer was able to do so because he had enough experience. Similarly, in this case, we will train these projection matrices so that the values are tuned to our use case.
Note- I'll be denoting Query - Q, Key - K, and Value - V
So for the sentence with "n" tokens, our effective Q, K, and V matrices post-projections will look like:

If we continue with our analogy, once the interior designer had "query", "key", and "value" for each item, he tried to figure out the affinity using "query" and "key", and picked the one with the highest affinity and used the "value" corresponding to it.
In our case, we have the "Q K V matrices" representing the corresponding values for each token, and to compute the affinity, we multiply the Q and K matrices, yielding "weightage / affinity/ scores" for a token with respect to other tokens.
Now, since Q and K are both of dimension [n x d], we cannot multiply them directly, so we will take the "transpose" of the Key matrix so that it becomes of dimension [d x n]

Think of a score matrix as a matrix containing affinity or attention scores of a given token with respect to other tokens.
In our analogy, we pick the item with the highest affinity. In self-attention, we do a "weighted" sum, i.e., we will multiply this scoring matrix [n x n] with the 'value' matrix [n x d]

The output matrix is of dimension [n x d], you can assume that each row in this matrix contains the "context-awared" embedding of each token.

If I formulate what we just did, it will be represented as:

Note: We also divide it by a factor at the end, which is mostly done to normalize the values.
At the end, in our analogy, the interior designer puts everything together. Similarly, we take the mean of our output matrix [n x d] to get the sentence embedding of dimension "d"

Cool, so we were able to generate an embedding for a sentence and have built a strong foundation in self-attention.
Now, let's see how we implement it and train it so that we have "Q K V" weighted matrices to create projections for any token.
Implementation
Self Attention Block:
As discussed, we will have matrices of dimension "d x d" for the Query, Key, and Value projection matrices. Since neural networks are an indirect translation of matrix operations, each Q K V matrix will be a linear layer of input and output size "d."
In the forward pass, we will get corresponding Q K V values for our sentence tokens, after which we will apply the self-attention formula, i.e.

class SelfAttentionBlock(nn.Module):
def __init__(self, d_model):
super().__init__()
self.d_model = d_model
self.query = nn.Linear(d_model, d_model)
self.key = nn.Linear(d_model, d_model)
self.value = nn.Linear(d_model, d_model)
# normalization layer
self.ln = nn.LayerNorm(d_model)
def forward(self, sentence_tokens):
q = self.query(sentence_tokens)
k = self.key(sentence_tokens)
v = self.value(sentence_tokens)
scaling_factor = self.d_model ** 0.5
attention_scores = torch.matmul(q, k.transpose(-2, -1)) / scaling_factor
attention_weights = torch.softmax(attention_scores, dim=-1)
attended_values = torch.matmul(attention_weights, v)
normalized = self.ln(attended_values + sentence_tokens)
return normalizedEmbedding Model:
In our model, we first have an "embedding_layer", which is a single-layer neural network that specializes in generating Token Embedding
After that, we have an attention layer, which takes these token embeddings and generates the "context aware" embeddings
We also have a method to get "sinusoidal positional embeddings" (It's not a model, as it's a deterministic mathematical function)
In the forward pass
First, we get the token embeddings for our sentence
Then we supply it with the self-attention block
At last, we take the average of all "attended" token embeddings
class EmbeddingModel(nn.Module):
def __init__(self, tokenizer):
super().__init__()
self.tokenizer = tokenizer
self.embedding_layer = nn.Embedding(VOCAB_SIZE, EMBEDDING_DIM)
# List of attention layers
self.attention_layers = nn.ModuleList([
SelfAttentionBlock(EMBEDDING_DIM)
])
def _get_positional_embedding(self, seq_length):
# Creates a [seq_length, EMBEDDING_DIM] matrix
position = torch.arange(seq_length, dtype=torch.float, device=device).unsqueeze(1)
div_term = torch.exp(
torch.arange(0, EMBEDDING_DIM, 2, dtype=torch.float, device=device) * (-math.log(10000.0) / EMBEDDING_DIM)
)
pe = torch.zeros(seq_length, EMBEDDING_DIM, device=device)
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
return pe
def forward(self, sentence):
token_ids = self.tokenizer.encode(sentence, out_type=int)
token_ids = token_ids[:MAX_SEQ_LEN]
x = torch.tensor(token_ids, dtype=torch.long, device=device)
x = self.embedding_layer(x)
x = x + self._get_positional_embedding(x.size(0))
for attn in self.attention_layers:
x = attn(x)
return torch.mean(x, dim=0)Training
Remember how we performed training for Token Embeddings, where we followed the "training for a proxy" paradigm since we did not have a "supervised" dataset.
Similarly, since we don't have a "supervised" dataset present, i.e.
sentence | embedding
-------------------------------------------------
blah blah blah | [0.67, 0.543, 0.21, 0.21]
................ | .....................
................ | ....................We will train for a "proxy" task, a.k.a "unsupervised" training, where our task would be:
"Given a target sentence with the corresponding similar and non-similar pair, compute a confidence score."
Dataset:
main_sentence | similar_sentence | non-simlar sentence
------------------------------------------------------------------
tom chase jerry | jerry likes cheese | harry killed voldemort
................ | ..................... | ...................
................ | .................... | ...................Such a dataset is easy to generate, since the internet is filled with text, and it's a fair assumption that sentences that are closely present are similar and vice versa
Proxy Task:
Here, we use our model to generate embeddings for the triplets and then compute the confidence score by taking the dot product and normalizing it
class ProxyTask:
def __init__(self, embedding_model):
self.embedding_model = embedding_model
def execute(self, main_sentence, similar_sentence, non_similar_sentence):
main_emb = self.embedding_model(main_sentence)
sim_emb = self.embedding_model(similar_sentence)
non_sim_emb = self.embedding_model(non_similar_sentence)
pos_dot = torch.dot(main_emb, sim_emb)
neg_dot = torch.dot(main_emb, non_sim_emb)
pos_conf = torch.log(torch.sigmoid(pos_dot))
neg_conf = torch.log(torch.sigmoid(-neg_dot))
confidence = pos_conf + neg_conf
return confidenceBackpropagation (Gradient Descent)
Now, we will start the training loop for our proxy task
Here, our loss function becomes the negation of the confidence score, i.e., low confidence equals high loss and vice versa
After that, we backpropagate the loss via gradient descent, i.e., tuning the value of each parameter, which also includes the Q K V matrices
EPOCHS = 100
LEARNING_RATE = 1e-4
optimizer = optim.Adam(model.parameters(), lr=LEARNING_RATE)
dataloader = DataLoader(dataset, batch_size=1, shuffle=True)
for epoch in range(EPOCHS):
total_loss = 0.0
for i, (main_batch, similar_batch, non_similar_batch) in enumerate(dataloader):
# DataLoader wraps single items in tuples/lists, so we unpack the strings
main_sentence = main_batch[0]
similar_sentence = similar_batch[0]
non_similar_sentence = non_similar_batch[0]
optimizer.zero_grad()
confidence = task.execute(main_sentence, similar_sentence, non_similar_sentence)
loss = -confidence
loss.backward()
optimizer.step()
total_loss += loss.item()
# Print progress every 100 steps
if (i + 1) % 100 == 0:
print(f"Epoch [{epoch+1}/{EPOCHS}], Step [{i+1}/{len(dataloader)}], Loss: {loss.item():.4f}")
avg_loss = total_loss / len(dataloader)
print(f"--- Epoch {epoch+1} completed. Average Loss: {avg_loss:.4f} ---")Once our training loop gets completed, we now have a trained "embedding model" with a trained attention block, i.e., tuned values of Q K V matrices.
Now, for any new sentence, you can use this trained model to generate an embedding
embedding = model.forward("jerry is in trouble")Multi Layer Self Attention
If you notice the code of our embedding model. We just used a single attention layer, i.e.
self.attention_layers = nn.ModuleList([
SelfAttentionBlock(EMBEDDING_DIM)
])As we know, real-life use cases can be quite complex, which demands more non-linearities. We can have "multiple" attention blocks fused together, for example, here we have 3 attention blocks fused together, one after the other, where the output of one block gets supplied to the other as input
self.attention_layers = nn.ModuleList([
SelfAttentionBlock(EMBEDDING_DIM),
SelfAttentionBlock(EMBEDDING_DIM),
SelfAttentionBlock(EMBEDDING_DIM)
])And this is called "Multi Layer Self Attention."
In a nutshell,

Multi-Head Self Attention
Similarly, to add more non-linearities, instead of just stacking attention blocks one after the other, we can also have "a group of attention blocks" also called "heads" working in parallel.

Here, the results of each parallel block are merged and sent to the subsequent layer.
Note- The basic idea of attention remains the same in any case; these are just multiple variants of combining multiple blocks to add more non-linearities.
Transformers
Well, it might surprise you, but we just designed the "encoder" part of the "Transformers" architecture on which a ton of embedding models, like BERT, etc., are based.

You might be noticing another section in this architecture, i.e., "Decoder", on which all the text-generation models like GPT, LLama, etc. are based, and we will be covering that in the next blog, where we will be doing a deep dive into the fascinating world of decoders and will unveil the magical capabilities it brings.
I hope you now have a strong intuition about Self Attention and are no longer scared.
Note- I skipped some of the parts like normalization, residue connections, etc., since the primary aim is to build intuition and these are more of an optimization.
Thank you for your attention!
Stay Tuned....



Comments