The first step in using attention for natural language is to embed words into vectors.
Let’s start with a four-word sentence, with word vectors of length three:
Our embedded sentence has a dimensionality of (4, 3):
Position Encoding
In the previous section we embedded a sentence into word vectors.
Currently our word vector embedding has no concept of the position or order of the words in the input sequence.
Attention is permutation invariant - it doesn’t consider the order or position of the elements in the input sequence. It treats each word independently. However, word order is critical for understanding language.
We need a way to inject information about the relative or absolute position of the words in the sequence - one way to do this is to add a position encoding to each input embedding.
The position encodings have the same dimension d_embedding as the word embeddings, so we can add them element-wise.
A simple position encoding scheme is to use powers of sine and cosine functions of different frequencies:
Each position has a unique encoding. The encodings form a sort of “positional space” where positions close to each other have more similar encodings compared to positions far apart.
Now we can add the position encodings to our word embeddings:
Query, Key and Value Weights
In previous sections we:
embedded a sentence into word vectors,
added positional encoding.
Central to the attention mechanism are the query, key and value vectors. These are the vectors we use to compute attention.
To calculate the query, key and value vectors we need three sets of weights.
The dimensionality of the embeddings and weights determine the dimensionality of the query, key and value vectors:
The query dimensionality is set by the embedding dimension and an arbitrary dimension d_query,
The key dimensionality is set by the query dimensionality d_query,
The values dimensionality is set by the values dimension d_values.
The values dimension d_values is arbitrary, and will set the size of the output context vector.
The shapes of these weights do not depend on the length of the input sequence. This is important, as it’s how the attention mechanism can deal with sequences of arbitrary length.
Query, Key and Value for One Word
Let’s select one word to do attention over.
We select the second word, and use a dot product to calculate the query, key and value for a single word:
Query, Key and Value for All Words
Let’s now do attention over all the words, which we can do with the same dot product:
Now our queries, keys and values have an additional dimension - the number of words in the sentence.
Attention Scores
In previous sections we:
embedded a sentence into word vectors,
added positional encoding,
transformed our input sentence into queries, keys and values.
The attention score is a measure of how similar a query is to a key. It’s computed using a dot-product.
Attention Scores for One Word with One Other Word
We can start by calculating the attention scores for a single word:
We can also calculate the attention scores between one word and another word:
Attention Scores for One Word with All Other Words
We can extend this to calculate attention scores for one word with all other words:
Attention Scores for All Words with All Other Words
Finally, we can calculate the attention scores for all words with all other words:
Scaling
Scaling a common technique used in attention - it has several benefits:
It reduces the variance of the dot products, making the softmax function more stable and less prone to yielding extreme values.
It helps maintain reasonable gradients even for large input sequences, facilitating the training process.
It becomes particularly important when using multi-headed attention, as it allows each head to specialize and attend to different aspects of the input.
Scaled Attention Scores
Normalization
Normalization is the process of converting the raw scores into a probability distribution using a softmax.
Normalization is important for several reasons:
It converts the raw scores into a valid probability distribution, allowing the model to weigh the importance of each input in a principled way.
It ensures that the attention weights are non-negative and sum up to 1, which is a desirable property for a weighting scheme.
It introduces a degree of competition among the inputs - increasing the weight of one input necessarily decreases the weights of others.
It makes the attention weights more interpretable and amenable to visualization and analysis.
Attention Weights for One Word
We can use the softmax to normalize the scores for a single word:
Attention Weights for All The Words
We can use the softmax to normalize the scores for all the words:
Calculating the Context Vector
In previous sections we:
embedded a sentence into word vectors,
added positional encoding,
transformed our input sentence into queries, keys and values,
used scaling, a softmax & normalization to produce attention scores.
The context vector aggregates the information from the entire sequence, weighted by relevance to each input word. It is the output of the attention mechanism.
The context vector is a weighted sum of value vectors, where the weights are given by the attention scores.
The context vector captures the relevant information from other parts of the input sequence needed to focus on specific elements during processing.
Context Vector for All The Words
We can calculate the context vector with matrix multiplication:
Multi-Head Attention
Multi-head attention involves multiple attention mechanisms - a single attention mechanism is a single head.
Each head has its own set of query, key and attention weights. Each head can be used to learn different representations of the input at the same time.
One head may focus on learning syntax, with the other semantics. One head may focus on short term dependencies (between one token and the next) or long term dependencies (between one token and the end of the sentence).
Embedding Words to Vectors
Let’s start as we did previously, by embedding words to vectors.
This step is the same as with a single attention head:
Position Encoding
Next we encode position - this step is the same as with a single attention head:
Multi-Head Attention Weights
Next we create our query, key and value weights.
This step is different from a single attention head, as we create a set of weights for each head:
Query, Key and Value for One Word
Let’s select one word to do attention over:
Query, Key and Value for All Words
First we need to stack input embeddings - one for each head. After this we can calculate the queries, keys and values:
Scaled Attention Scores
After calculating queries and keys, we can calculate the attention scores:
Normalized Attention Weights
As with a single head, we use a softmax function to normalize the attention scores:
Context Vector
Finally we can calculate the context vector with matrix multiplication: