The purpose of this post is to break down the math behind the
Transformer
architecture, as well as share some helpful resources and gotcha's based on my
experience in learning about this architecture. We start with an exploration of sequence
transduction literature leading up to the Transformer, after which
we dive into the foundational
Attention is All
You Need
paper by Vaswani, et al. (2017).
Before Transformers [Top]
While
Attention is All You Need introduced a watershed neural network architecture with
vast and growing applications, a look into preceding research on sequence transduction is very
instructive. Doing so yields both the
motivation behind, as well as the machinery and techniques that enabled the development of the
Transformer. For an excellent deep dive into this area, with a particular focus on machine
translation, I highly recommend Prof. Graham
Neubig's
tutorial. It builds up from the most basic
frequency and regression-based models, all the way to encoder-decoder and attention-based neural
networks. Much of this section is drawn
from or inspired by various parts of Neubig's tutorial.
The
Attention paper was motivated by the problem of machine translation. In
keeping with the notation in the tutorial, this problem can be expressed as follows: we start
with a source sequence $\vec{F} = (f_0, f_1,
..., f_n)$, where each $f_i$ represents an individual word drawn from a source vocabulary, such
as that of English. We seek to predict a translation of $\vec{F}$ into a different language,
a.k.a. a target sequence $\vec{E} = (e_0,
e_1, ..., e_m)$ comprised of words $e_i$ from a target vocabulary, such as that of French.
A few things before we continue - first, note that the lengths of $\vec{F}$ and
$\vec{E}$ are not necessarily the same. This is one aspect of the translation task that makes it
especially difficult - human languages do
not simply translate word-for-word into other ones. Second, the individual
"words" $e_i$ and $f_i$ are represented as vectors,
usually one-hot encoded to start with.
Expressed probabilistically, we seek to model the distribution $P(\vec{E} | \vec{F}) =
P ( e_0, e_1, ..., e_m | \vec{F})$. One helpful observation is that we can break up the joint
probability of $\vec{E}$ into a product
of conditional probabilities:
$$P(\vec{E} | \vec{F}) = P(e_0 | \vec{F}) \cdot P(e_1 | e_0, \vec{F}) \cdot ... \cdot
P(e_{m - 1} | e_0, ..., e_{m - 2}, \vec{F}) \cdot P(e_m | e_0, ..., e_{m - 1}, \vec{F}) $$
Thinking of this problem this way is convenient
for tackling it with recurrent neural networks, which led to the bulk of recent advances in
machine translation up until the introduction of Transformer-based models.
The paper notes at the beginning of Section 3 on Model Architecture that "Most
competitive neural sequence transduction models have an
encoder-decoder structure." This
is a point worth exploring, as the Transformer
is also an encoder-decoder. The first encoder-decoder models for translation were RNN-based, and
introduced almost simultaneously in 2014 by
Learning Phrase Representations using RNN
Encoder–Decoder for
Statistical Machine Translation and
Sequence to Sequence Learning with Neural
Networks. The encoder-decoder framework in general refers to a situation in which one
process represents, or "encodes," input data into
one vector, and then another process "decodes" that vector into a desired output. It is by no
means specific to NLP, having found many applications in computer vision as well. Below is a
visualization of this process from the
Sequence to Sequence paper above by Sutskever, et
al.
At a high level, encoder-decoder RNN models work as follows: let
$\text{RNN}^{(f)}(\vec{x}, \vec{h})$ be an RNN model that accepts an input vector $\vec{x}$ and
a hidden state $\vec{h}$. This RNN is the "encoder." As before,
we seek to predict $\vec{E} = (e_0, ..., e_m)$ given $\vec{F} = (f_0, ..., f_n)$.
To start, we compute $\text{RNN}^{(f)}(f_0, h_0)$, where $h_0$ is an initial
hidden state vector. This generates an output vector and a hidden state vector $h_1$. Next, we
simply ignore the output vector, and compute
$\text{RNN}^{(f)}(f_1, h_1)$. We repeat the process until we go through the entirety of
$\vec{F}$. The goal is to "prime" the model's hidden state vector such that by the time the
model
processes all of the source sequence elements, the hidden state contains all of the information
needed for another model to then sequentially
generate $\vec{E}$. Accordingly, we introduce a "decoder" RNN, $\text{RNN}^{(e)}(\vec{x},
\vec{h})$. We feed in the final encoder hidden state vector $h_{n+1}$ as the initial hidden
state of
the decoder, along with the first element
of the sequence to decode (more on this in a few paragraphs).
The decoder outputs a categorical distribution over the target vocabulary.
We use this distribution along with the ground truth to
compute the contribution of that target element
to the overall loss, which is some metric appropriate for comparing distributions such
as cross-entropy or KL divergence. There has been some research into modifying this aspect of
the training
process to yield more generalized and less overconfident models. One such technique, label
smoothing, is also employed in the
Attention paper at training time. Label smoothing was
introduced in
Rethinking the Inception
Architecture for Computer Vision in 2015; it involves using a
weighted average of the actual ground truth "distribution"
(categorical with mass 1 at the ground truth
label) and the discrete Uniform distribution, as ground truth.
What is fed into the decoder depends on whether we are doing
training or inference. When training, we feed in the ground truth target $\vec{E}$ one
element at a time.
Using ground truth at each time step, as opposed to outputs generated by the network,
is referred to as
teacher forcing. Meanwhile, inference can be performed in a number of
ways. Since we
do not have a ground truth target sequence $\vec{E}$ during inference,
and since each element of the generated $\vec{E}$ depends on the ones that came before
it, we need to sequentially generate its elements. The target sequence is considered
"finished" when the model says the
<EOS>
tag is the most likely next token.
This may seem straightforward enough, but there are some caveats. The simplest
method of generating target sequences is greedy decoding, which entails setting each target
prediction $\hat{e_i}$ to be the argmax of
the output probability distribution at each step. However, greedy decoding is not guaranteed to
yield the highest probability target
sequence; that is, even if one picks the most likely
next element at each time step, this
does not preclude the possibility of another sequence that has an overall higher likelihood than
the greedy solution. To account for this possibility, another way of performing inference is
beam search, which involves exploring
several target sequence trajectories simultaneously and picking the sequence with the highest
likelihood. The number of trajectories to be explored is a hyperparameter referred to as the
"beam width."
Attention utilizes beam search for benchmarking their various models after they have
been trained.
One drawback of the RNN-based encoder-decoder is that a single
hidden state vector can only carry so much information - the longer an input sequence, the
noisier the information from earlier parts of
that sequence becomes. This limits the ability of RNN-based models to make good predictions
over long sequence lengths. Researchers began exploring the possibility of using all of the
hidden state vectors generated during encoding
to decode the target sequence, as opposed to just the last encoder hidden state. A key
innovation in this direction was
bidirectional encoders, introduced in
Neural Machine Translation by Jointly Learning
to Align and Translate in 2016. In addition to having an RNN go through the source
sequence $\vec{F}$, this approach adds another RNN that goes in reverse through the same
sequence. For each time step, both RNNs cache their hidden states.
This means that once all words in $\vec{F}$ have been traversed, each one has two hidden state
vectors associated with it. The two hidden state vectors are concatenated into one, and all $n$
resulting vectors are concatenated into
one matrix.
But in order for the decoder to be able to use all of the hidden states, this matrix
needs to be condensed into a vector of consistent size - we cannot pass a hidden state into
the decoder whose size varies based on
the number of elements in the source sequence. And so,
attention was born. An attention
mechanism condenses this set of hidden states into a weighted sum of its constituent vectors,
and the weighting would be based in part
on the contents of the vectors. One such weighting was
dot-product attention, introduced
in
Effective Approaches to Attention-based Neural
Machine
Translation in 2015.
Attention Is All You Need [Top]
With the backdrop in place, we can begin to discuss the
Attention paper, in which the
authors introduce a new type of architecture that addresses many of the pitfalls to RNN-based
models in the context of machine translation.
Even with all of the advances in RNN encoder-decoders discussed above, the fact remained
that RNNs are difficult to parallelize due to the fact that they sequentially process input.
The key innovation of this paper is
that the reliance on RNNs and their hidden states are entirely replaced with attention-based
operations that are more efficient in many problem regimes. In doing so, the Transformer
architecture eliminated the most significant
bottleneck in the training process for preceding SOTA models, and suddenly made enormous models
trained on equally enormous datasets feasible to implement.
The following sections will describe the various components of the Transformer in more
detail; for now, we will start with a high-level overview. The architecture diagram from the
paper is shown below.
As mentioned before, the Transformer is an encoder-decoder model. The encoder is
comprised of the $N$ blocks on the left, and the decoder is comprised of the $N$ blocks on the
right.
During training, the input words $\vec{F} = (f_0, ..., f_n)$ are passed into the first
encoder block all at once, and the output of that block is then passed through its successor.
The process is repeated until all $N$
encoding blocks have processed the input. Each block has two components: a Multi-Head
Self-Attention layer, followed by a fully connected layer with ReLU activations that processes
each element of the input sequence in parallel.
When the input has passed through all the encoding blocks, we are left with the encoded
representation of $\vec{F}$. Both the multi-head attention layer and the fully connected layer
are followed by an "Add & Norm" step - the "add"
refers to a residual connection that adds the input of each layer to the output, and the "norm"
refers to
Layer Normalization.
Moving onto the decoder, it consists of three steps: a Masked Multi-Head self-attention
layer, a Multi-Head Attention layer connecting the encoded source representation to the decoder,
and a fully-connected layer with ReLU
activations. Just like in the encoder, each layer is followed by an "Add & Norm" layer. The
decoder accepts all of the target words $\vec{E} = (e_0, ..., e_m)$ as input. There are several
key differences from the encoder to note
here - one is that inputs to the first attention operation in the decoder blocks are
masked, hence the name of the layer. We explore this in detail in the
Multi-Head Attention section, but in
a nutshell, this means that any word in the target output can only attend to words that came
before it. The reason for this is simple: during
inference, we generate the predicted
translation $\vec{E}$ word-by-word using
the source sentence $\vec{F}$. In the process of predicting a word $e_i$, the decoder has access
to previously generated words. It cannot, however, have access to words that follow $e_i$, as
those have yet to be generated and depend
on the model's choice of $e_i$. Masking during training allows us to emulate the conditions
that the model will face during inference.
Another difference from the encoder is the second Multi-Head Attention layer, also
referred to as the Encoder-Decoder Attention layer. Unlike the attention layers at the start of
the encoder and decoder blocks, this layer
is not a self-attention layer. Self-attention is discussed in greater detail in the
Applications of Multi-Head Attention section, but the practical
significance of this is that encoder-decoder attention ties
together the encoded representation of $\vec{F}$ with the target words in $\vec{E}$.
Scaled Dot-Product Attention [Top]
We will now take a look at the Scaled Dot-Product Attention mechanism in the paper, shown below.
This function is used in all of the attention layers in the Transformer. For now, we
will break down the math behind this operation, just to get
a sense of which numbers are going where. Afterwards, we will focus on its applications in the
paper. $$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Scaled Dot-Product Attention is nearly identical
to the Dot-Product Attention mentioned earlier from Luong, et al.
(2015). The only difference is that the
input to the softmax is scaled by a factor
of $\frac{1}{\sqrt{d_k}}$. Once we go through how Scaled Dot-Product Attention works, we will
come back the this point and explore it in greater detail at the end of this section.
In the paper and preceding literature, the rows of $Q$ are referred to as "queries,"
those of $K$ "keys," and finally those of $V$ "values." Just think of these as labels for now;
we can examine their interpretations later.
Here, $Q \in \mathbb R ^{ m \times d_k}$, $K\in \mathbb R ^{n \times d_k }$ and $V \in \mathbb R
^{n \times d_v }$. Note that for the algebra to work out, the number of keys and values $n$ must
be equal, but the number of queries
$m$ can vary. Likewise, the dimensionality of the keys and queries must match, but that of the
values can vary.
We will start our exploration by writing out the individual rows of $Q$ and $K$, and
then express the product $QK^T$ in terms of those rows: $$Q = \begin{pmatrix} - & \vec{q}_0 & -
\\ - & \vec{q}_1 & - \\ - & \vdots &
- \\ - & \vec{q}_m & - \\ \end{pmatrix}; K^T = \begin{pmatrix} | & | & & | \\ \vec{k}_0 &
\vec{k}_1 & \cdots & \vec{k}_n\\ | & | & & | \\ \end{pmatrix} $$ $$ QK^T = \begin{pmatrix}
\vec{q}_0 \cdot \vec{k}_0 & \vec{q}_0 \cdot \vec{k}_1
& \cdots & \vec{q}_0 \cdot \vec{k}_n \\ \vec{q}_1 \cdot \vec{k}_0 & \vec{q}_1 \cdot \vec{k}_1 &
\cdots & \vec{q}_1 \cdot \vec{k}_n \\ \vdots & \vdots & \ddots & \vdots \\ \vec{q}_m \cdot
\vec{k}_0 & \vec{q}_m \cdot \vec{k}_1 &
\cdots & \vec{q}_m \cdot \vec{k}_n \\ \end{pmatrix} $$ Next, we obtain our attention weights by
dividing each element by $\sqrt{d_k}$ and applying the softmax function per row: $$
\text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)
= \begin{pmatrix} \text{softmax}( \frac{1}{\sqrt{d_k}} \langle \vec{q}_0 \cdot \vec{k}_0,
\vec{q}_0 \cdot \vec{k}_1, ..., \vec{q}_0 \cdot \vec{k}_n \rangle ) \\ \text{softmax}(
\frac{1}{\sqrt{d_k}} \langle \vec{q}_1 \cdot \vec{k}_0,
\vec{q}_1 \cdot \vec{k}_1, ..., \vec{q}_1 \cdot \vec{k}_n \rangle ) \\ \vdots \\ \text{softmax}(
\frac{1}{\sqrt{d_k}} \langle \vec{q}_m \cdot \vec{k}_0, \vec{q}_m \cdot \vec{k}_1, ...,
\vec{q}_m \cdot \vec{k}_n \rangle ) \\ \end{pmatrix}
= \begin{pmatrix} s_{0, 0} & s_{0, 1} & \cdots & s_{0, n} \\ s_{1, 0} & s_{1, 1} & \cdots &
s_{1, n} \\ \vdots & \vdots & \ddots & \vdots \\ s_{m, 0} & s_{m, 1} & \cdots & s_{m, n} \\
\end{pmatrix} $$ where for each row $i$, as
a result of the softmax operation, $$ \sum_{j = 0}^n s_{i, j} = 1 $$ The last step is to
multiply this matrix by $V$: $$ \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V =
\begin{pmatrix} s_{0, 0} & s_{0, 1} & \cdots & s_{0,
n} \\ s_{1, 0} & s_{1, 1} & \cdots & s_{1, n} \\ \vdots & \vdots & \ddots & \vdots \\ s_{m, 0} &
s_{m, 1} & \cdots & s_{m, n} \\ \end{pmatrix} \begin{pmatrix} - & \vec{v}_0 & - \\ - & \vec{v}_1
& - \\ - & \vdots & - \\ - & \vec{v}_n
& - \\ \end{pmatrix}= \begin{pmatrix} \sum_{i=0}^n s_{0, i} \vec{v}_i \\ \sum_{i=0}^n s_{1, i}
\vec{v}_i \\ \vdots \\ \sum_{i=0}^n s_{m, i} \vec{v}_i \\ \end{pmatrix} $$ The takeaway here is
that the Attention mechanism results
in a series of weighted averages of the rows of $V$, where the weighting depends on the input
queries and keys. Each of the $m$ queries in $Q$ results in a specific weighted sum of the value
vectors. Of note is that there are
no learnable parameters in this particular procedure - it's entirely comprised of matrix and
vector operations.
Let's zoom in on an individual row $i$ of attention weights: $$ \text{softmax}(
\frac{1}{\sqrt{d_k}} \langle \vec{q}_i \cdot \vec{k}_0, \vec{q}_i \cdot \vec{k}_1, ...,
\vec{q}_i \cdot \vec{k}_n \rangle ) = \frac{1}{S}\langle
\exp{\left( \frac{\vec{q}_i \cdot \vec{k}_0}{\sqrt{d_k}} \right) } , \exp{\left( \frac{\vec{q}_i
\cdot \vec{k}_1}{\sqrt{d_k}} \right) } , ... , \exp{\left( \frac{\vec{q}_i \cdot
\vec{k}_n}{\sqrt{d_k}} \right) } \rangle $$ where
$S$ is the normalization constant: $$ S = \sum_{j=0}^n \exp{\left( \frac{\vec{q}_i \cdot
\vec{k}_j}{\sqrt{d_k}} \right) }$$
Looking at how the weights are constructed, the origin of the names "queries,"
"keys," and "values" is clearer. Just like in a hashtable,
this operation picks out desired values via corresponding, one-to-one keys. The keys that we
seek are indicated by the queries - we can express the dot product between a given key and query
in terms of the angle $\theta$ between
them: $$ \vec{q}_i \cdot \vec{k}_j = \lvert \vec{q}_i \rvert \lvert \vec{k}_j \rvert
\cos(\theta)$$ Ignoring magnitudes for a moment, we can see in the plot below that
exponentiation amplifies positive cosine values and diminishes
negatives ones. Therefore, the closer in angle the key $\vec{k}_j$ and query $\vec{q}_i$ are,
the greater their representation in the attention vector.
To bring back some of the machine translation context into this, each row of the keys,
queries, and values in this setting is a vector representation of elements in a sequence. One
drawback
of RNN-based models that we noted is that they
have difficulty using information from elements observed far in the past. Put more generally,
they have trouble relating sequential information that are far apart from each other. Techniques
such as attention on hidden states and
bidirectional models were attempts to rectify this issue, and served as a natural gateway into
the
techniques in this paper. Scaled dot-product attention enables us to efficiently and rapidly
relate each element in a sequence to all
elements in another sequence, as well as to all others in the same sequence.
Our closing point for this section will be about the scaling factor that is the
namesake of this attention mechanism. The authors of
Attention
note that they divide the inputs to
the softmax function by $\sqrt{d_k}$ to mitigate the effects of large input values, which would
lead to small gradients during training. The authors were concerned that long key/query vectors
would lead to dot products with high
magnitudes. Their choice of the value of the scaling factor can be explained by Footnote 4 in
the paper, along with the fact that with the sole exception of the very first encoder/decoder
blocks, each input matrix goes through
a Layer Normalization step. As for why large softmax arguments lead to small gradients, we can
understand this with some calculus. We will start off with some definitions:
$$\vec{s} = (s_0,
..., s_n) \text{, } S(\vec{s}) = \sum_{i=0}^n
e^{s_i}$$
Next, let $\vec{p}_C(s)$ represent the softmax function whose input vector is scaled
by a factor of $1/C$. Note that if we set $C = 1$, we end up with the original, unscaled
softmax.
$$ \vec{p}_C(\vec{s}) = \text{softmax}(\vec{s}/C)
= \frac{1}{S(\vec{s}/C)}(e^{s_0/C}, ..., e^{s_n/C}) $$
We want to consider the derivative of the
scaled softmax with respect to its input vector $\vec{s}$. As the softmax's inputs and
outputs are both vector-valued, what
we are really looking for is the Jacobian: $$ \frac{d \vec{p}_C}{d
\vec{s}} = \left[ \frac{\partial \vec{p}_C}{\partial s_0}, ..., \frac{\partial
\vec{p}_C}{\partial s_n} \right] $$ $$ = \begin{pmatrix}
\frac{\partial p_0}{\partial s_0} && \frac{\partial p_0}{\partial s_1} && \cdots &&
\frac{\partial p_0}{\partial s_n} \\ \frac{\partial p_1}{\partial s_0} && \frac{\partial
p_1}{\partial s_1} && \cdots && \frac{\partial p_1}{\partial
s_n} \\ \vdots && \vdots && \ddots && \vdots \\ \frac{\partial p_n}{\partial s_0} &&
\frac{\partial p_n}{\partial s_1} && \cdots && \frac{\partial p_n}{\partial s_n} \\
\end{pmatrix}$$ Computing each partial derivative and moving
some terms around, the Jacobian can be expressed succinctly as: $$ = \frac{\text{diag} (\vec
p_c) - \vec p_c \otimes \vec p_c}{C}$$ where the $\text{diag}$ function projects vectors onto a
diagonal matrix, and $\otimes$ is the
outer product. The code below generates
visualizations of the scaled and unscaled
Jacobians evaluated on randomly generated vectors of various sizes. For a given vector length
$d_k$, each element is drawn from $N(0, d_k)$ to align with the assumptions in the paper.
from matplotlib.colors import Normalize
import matplotlib.pyplot as plt
import numpy as np
from scipy.special import softmax
def jacobian(s, C = 1):
""" Evaluates softmax Jacobian for vector `s`. Adds scaling factor `C` if
provided, else defaults to scaling of 1.
"""
softmax_s = softmax(s / C)
return 1./C * ( np.diag(softmax_s) - np.outer(softmax_s, softmax_s) )
def plot_jacobian(matrix, size, ax, norm=None):
im = ax.imshow(matrix, norm=norm, cmap="RdBu")
ax.set_title(
"Length: {} \n Mean: {:.3} \n SD: {:.3} \n Range: {:.3}"
.format(size, np.mean(matrix), matrix.std(), matrix.ptp())
)
ax.set_xticks([])
ax.set_yticks([])
plt.colorbar(im, ax=ax)
def generate_jacobian_comparison(sizes):
fig_scaled, axes_scaled = plt.subplots(1, len(sizes), figsize=(23, 4))
fig, axes = plt.subplots(1, len(sizes), figsize=(23, 4))
fig.suptitle("Unscaled Softmax Jacobian Evaluated at Random Vector", y=1.1)
fig_scaled.suptitle("Scaled Softmax Jacobian Evaluated at Random Vector", y=1.1)
for n, size in enumerate(sizes):
scale = size ** 0.5
# random normal vector with mean 0 and variance `size`.
# this represents the vector "s" in our math
s = np.random.randn(size) * scale
jac_s = jacobian(s, C=1)
jac_s_scaled = jacobian(s, C=scale)
# normalize to extrema of scaled Jacobian to better visualize
# unscaled, which is often dominated by a handful of large values
norm = Normalize(jac_s_scaled.min(), jac_s_scaled.max())
plot_jacobian(jac_s_scaled, size, axes_scaled[n], norm=norm)
plot_jacobian(jac_s, size, axes[n], norm=norm)
plt.show()
sizes = [4, 8, 16, 32, 64] # some sample lengths for our inputs
generate_jacobian_comparison(sizes)
Running the function
generate_jacobian_comparison()
yields outputs like the
below:
Repeated runs show that generally speaking, the elements of the unscaled softmax Jacobian
tend to have a higher variance compared to those of the scaled softmax, which look less peaky.
This is not surprising if we consider
our formula for the Jacobian, along with basic mathematical properties of the softmax.
There are two simple operations in the softmax that are at odds with each other, so
to speak:
exponentiation and addition. Exponentials change rapidly for increasing inputs - after all,
$\frac{d}{dx}e^x = e^x$. If we have a vector
$\vec{v} = \langle 1, 2 \rangle$, exponentiating its components gives us $e^1 \approx 2.7$, and
$e^2 \approx 7.4$. If we instead exponentiate each term in $10 \vec{v}$, we end up with $e^{10}
\approx 22\text{,}046.5$, and $e^{20}
\approx 485\text{,}165\text{,}195.41$.
Once we exponentiate each element of the input vector, we need to compute the
normalization term by adding up the exponentiated terms. Going back to our examples, $e^2$
is larger than $e^1$ by a factor of $e$. So in the normalization term, the second element only
contributes a bit under
three times as much as the first. But when we compute the same with $10\vec{v}$, we can quickly
see that the first element stands no
chance - the second element contributes over $22\text{,}000$ times as much to the normalization
term. When all is said and done, $\text{softmax}(\vec{s}) \approx \langle 0.27, 0.73 \rangle$,
but $\text{softmax}(10\vec{s}) \approx
\langle 4 \times 10^{-5}, 0.999...\rangle$. The takeaway is that otherwise minor differences
between inputs to the softmax are vastly amplified in the corresponding output.
Tying things back together, the Jacobian is again comprised of two parts: the
diagonalized softmax output, minus the outer product of the softmax output. Because of its
composition, the Jacobian matrix inherits the sensitivity
of the softmax to extreme-valued inputs, leading to certain terms in the matrix overwhelming
others. Though this property of the softmax may be desirable in situations such as performing
classification, it is not necessarily desirable
when the operation occurs deep inside a network. The advantage of scaling is that while little
information is lost in dividing the input vector by a scalar
(as the original information is retrievable by simply performing the
reverse, assuming arithmetic underflow is not an issue) the
properties of the softmax mean that doing
so leads to Jacobians with larger values overall during training.
Multi-Head Attention [Top]
Multi-Head Attention is an extension of Scaled Dot-Product Attention, in which linear
transformations are run on the queries, keys, and values to yield multiple sets of inputs to the
attention function. The attention function is computed in parallel given
each of these sets of inputs, and the results concatenated side-by-side into one matrix. The
final result is obtained via a linear transformation of the concatenated matrix to obtain a
matrix with the desired dimensionality. The
authors express this in the form below: $$\text{MultiHead}(Q, K, V) =
\text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$ Each $\text{head}_i$ is the result of
running Scaled Dot-Product Attention on the $i^{th}$ set of transformed
queries, keys, and values: $$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$ where $Q
\in \mathbb R^{m \times d_{model}}$, $K \in \mathbb R^{n \times d_{model}}$, and $V \in \mathbb
R^{n \times d_{model}}$. Further,
given a hyperparameter $h$ indicating the number of attention heads: $W_i^Q \in \mathbb
R^{d_{model} \times d_k}$, $W_i^K \in \mathbb R^{d_{model} \times d_k}$, $W_i^V \in \mathbb
R^{d_{model} \times d_v}$, and $W^O \in \mathbb
R^{hd_v \times d_{model}}$.
Let's perform a quick sanity check that the matrix multiplication works out. First, we
know from the previous section that each matrix $\text{head}_i$ will have the same number of
rows as $QW_i^Q$, and the same number
of columns as $VW_i^V$. Since $QW_i^Q \in \mathbb R^{m \times d_k}$ and $VW_i^V \in \mathbb R^{n
\times d_v}$, this means $\text{head}_i \in \mathbb R^{m \times d_v}$. When we concatenate $h$
of these together, we have a matrix
in $\mathbb R ^ {m \times hd_v}$. Multiplying with $W^O$ yields a matrix in $\mathbb R ^ {m
\times d_{model}}$. This makes sense - we started with $m$ queries in $Q$, and we ended up with
$m$ responses in the output of the $\text{MultiHead}$
operator.
Notice how each head computation has a different linear transformation for the key,
query, and value matrices. Each of these transformations is learned during training. For readers
familiar with convolutional neural networks,
I think of the heads and their weight matrices as superficially similar to different channels
and their learned kernels in convolutional layers. Both start with the same input, and generate
multiple representations of that input
simultaneously to reveal specific aspects of it.
In the next section, we will discuss the three different ways multi-head attention is
used in the Transformer architecture.
Applications of Multi-Head Attention [Top]
Multi-head attention is used in three ways in
Attention:
(1)
computing self-attention in the
encoder blocks, (2) computing
self-attention with masking in the decoder blocks, and (3
)
encoder-decoder attention within the decoder blocks.
We will first discuss self-attention, which refers to a situation where the keys,
queries, and values that are input into an attention function are one and the same matrix. The
usage of self-attention in the encoder and
decoder blocks is basically the same, but with just one notable caveat in the latter.
In the encoder layers, we perform the following operation: $$\text{MultiHead}(X, X, X)
= \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$ where, substituting $X$ for the query,
key, and value matrices in our computation
for each attention head, we get: $$\text{head}_i = \text{Attention}(XW_i^Q, XW_i^K, XW_i^V)$$
The decoder blocks do the same thing, but with one difference: they perform one additional step
in the attention function to ensure that
past elements in the target sequence can only attend to preceding elements. This is achieved by
masking inputs to the softmax function in scaled dot-product attention. Let $Q' =
XW_i^Q$, $K' = XW_i^K$, and $V' = XW_i^V$.
Normally, given $\text{head}_i$ defined just like above, we have: $$\text{Attention}(Q', K', V')
= \text{softmax}\left(\frac{Q'K'^T}{\sqrt{d_k}}\right)V'$$ Since the decoder processes the
target sequence, we need to prevent queries
from seeing keys that follow them in the sequence. In order to understand why, we can think back
to the
Before Transformers section. Consider the conditions of the model at
inference time. We start with a source
sequence $\vec{F}$ and perform a search to generate a target sequence $\vec{E}$ element by
element. At first, the model only has access to the source sequence itself to generate a
predicted first target element $e_0$. The next
iteration, the model has $\vec{F}$ and $e_0$ to work with in making a prediction for $e_1$, and
so on until it generates the
<EOS>
tag. At no point in generating
a target element $e_i$ does the model
have any access to information about future elements $e_j$ for $j > i$.
Masking the target sequence inputs during training emulates these conditions. During
training, the model uses the information in the $i^{th}$ row of its input matrix to make a
prediction about the $(i+1)^{th}$ target token.
In the first row, the only information we have is a null or empty token, as we only start with
the source sequence when generating the first target element
(note that the target sequence is shifted by one position to the
right - see architecture diagram). In the second row, we have
information about the empty token and the first
"real" target token to generate predictions for the second target token.
One way to mask the inputs is to simply add to the softmax argument a matrix $M$
consisting of 0's in its lower triangle and $-\infty$'s everywhere else:
$$\frac{1}{\sqrt{d_k}}Q'K'^T + M = \frac{1}{\sqrt{d_k}} \begin{pmatrix}
\vec{q'}_0 \cdot \vec{k'}_0 & \vec{q'}_0 \cdot \vec{k'}_1 & \cdots & \vec{q'}_0 \cdot \vec{k'}_n
\\ \vec{q'}_1 \cdot \vec{k'}_0 & \vec{q'}_1 \cdot \vec{k'}_1 & \cdots & \vec{q'}_1 \cdot
\vec{k'}_n \\ \vdots & \vdots & \ddots &
\vdots \\ \vec{q'}_n \cdot \vec{k'}_0 & \vec{q'}_n \cdot \vec{k'}_1 & \cdots & \vec{q'}_n \cdot
\vec{k'}_n \\ \end{pmatrix} + \begin{pmatrix} 0 & -\infty & -\infty & \cdots & -\infty & -\infty
\\ 0 & 0 & -\infty & \cdots & -\infty
& -\infty \\ \vdots & \vdots & \vdots & \vdots & \vdots & \vdots \\ 0 & 0 & 0 & \cdots & 0 &
-\infty \\ 0 & 0 & 0 & \cdots & 0 & 0 \\ \end{pmatrix} $$ $$ = \frac{1}{\sqrt{d_k}}
\begin{pmatrix} \vec{q'}_0 \cdot \vec{k'}_0 & -\infty
& -\infty & \cdots & -\infty \\ \vec{q'}_1 \cdot \vec{k'}_0 & \vec{q'}_1 \cdot \vec{k'}_1 &
-\infty & \cdots & -\infty \\ \vdots & \vdots & \ddots & \vdots \\ \vec{q'}_{n - 1} \cdot
\vec{k'}_0 & \vec{q'}_{n - 1} \cdot \vec{k'}_1
& \vec{q'}_{n - 1} \cdot \vec{k'}_2 & \cdots & -\infty \\ \vec{q'}_n \cdot \vec{k'}_0 &
\vec{q'}_n \cdot \vec{k'}_1 & \vec{q'}_n \cdot \vec{k'}_2 & \cdots & \vec{q'}_n \cdot \vec{k'}_n
\\ \end{pmatrix} $$ Then, running softmax
on each row has the effect of sending all of the $-\infty$ cells to 0, leaving only the valid
attention terms.
The third and final use of multi-head attention in the paper is Encoder-Decoder
Attention, which is used in the Decoder blocks directly after the Masked Multi-Head Attention
layer to relate the source and target sequences
to each other. Whereas in self-attention, all three inputs are the same matrix, this is not the
case here. For reference, below is multi-head attention written out: $$\text{MultiHead}(Q, K, V)
= \text{Concat}(\text{head}_1, ...,
\text{head}_h)W^O\text{, head}_i = f(Q, K, V)$$ When it comes to encoder-decoder attention, the
only difference from before is that $Q$ comes from the Masked Multi-Head Attention layer, while
$K$ and $V$ are the encoded representation
of $\vec{F}$. One way of thinking about this is that the model is able to ask a question about
how each position in the target sequence relates to the source, and obtain representations of
the source to use in generating the next
word in the target.
It is important to note that all of the decoder blocks receive the same data from the
encoder. From the first to the $N^{th}$ decoder block, each one takes in the encoded source
sequence to use as keys and values.
Conclusion [Top]
In this post, we went step-by-step through key components of the Transformer architecture
introduced in
Attention is All You Need. We saw how many of the ingredients that led to
the Transformer were developed from preceding
research on sequence transduction. We also explored the various attention mechanisms in the
paper, and in particular, how they utilize simple but effective matrix operations in order to
create representations of input sequences.
In the case of the Decoder blocks, we saw how multi-head attention is applied to relate the
encoded source sequence to each element of the target sequence.
Understanding these components from first principles is a crucial first step towards
learning about subsequent Transformer-based models. Just to go through the classic examples,
BERT is a series of Encoder blocks trained
on a cleverly designed objective function to account for the Encoder's inherent
bidirectionality; likewise, each of the
GPT models are
comprised of longer and longer series of Decoder blocks.
Work based on Transformers continues at a blistering pace, on both the research and
development fronts. Research into Transformer-based models continues to be exciting - recent
research has demonstrated that Transformers
are effective in various computer vision problems,
within GANs, and in areas that relate
natural language processing to CV, such as OpenAI's
DALL-E project. Research has also gone into making
Transformers smaller and/or more efficient.
As for development, researchers and practitioners continue to release large Transformer
models that are pretrained on large corpora, and can either be used out of the box or fine-tuned
on more specific use cases. Companies
such as
Hugging Face are developing open-source tools and
processes to make utilizing these models easier for researchers and engineers. And as the gap
between research and deployment shrinks with
each of these developments, the implications of the Transformer architecture will only continue
to grow.