Attention Is All You Need
Ashish Vaswani, Noam Shazeer, Niki Parmar et al.
2017
The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.
Reading Level:
The Transformer Architecture
The Transformer model departs entirely from recurrent and convolutional architectures, relying solely on attention mechanisms. The key innovation is the Multi-Head Self-Attention mechanism.
Scaled Dot-Product Attention
The attention function maps a query and a set of key-value pairs to an output:
$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$
Where $Q \in \mathbb{R}^{n \times d_k}$, $K \in \mathbb{R}^{m \times d_k}$, $V \in \mathbb{R}^{m \times d_v}$.
The scaling factor $\frac{1}{\sqrt{d_k}}$ prevents vanishing gradients in the softmax when $d_k$ is large.
Multi-Head Attention
Instead of applying a single attention function, multi-head attention linearly projects queries, keys, and values $h$ times with learned projections:
$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$
$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$
Positional Encoding
Since the model contains no recurrence or convolution, positional information is injected using sine/cosine functions:
$$PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$
Training Details
- Dataset: WMT 2014 English-German (4.5M sentence pairs)
- Hardware: 8 P100 GPUs for 12 hours (base) / 3.5 days (big)
- Optimizer: Adam with warmup schedule
- BLEU score: 28.4 (EN-DE), 41.0 (EN-FR) — new state of the art
XP Reward
+150
Earned after 10 seconds of reading
XP already earned
Key EquationsClick for Python code
Citation Graph
References (5)
Long Short-Term Memory
Hochreiter, Schmidhuber · 1997Neural Machine Translation by Jointly Learning to Align and Translate
Bahdanau, Cho, Bengio · 2015Convolutional Sequence to Sequence Learning
Gehring et al. · 2017Layer Normalization
Ba et al. · 2016Adam: A Method for Stochastic Optimization
Kingma, Ba · 2015