AIMLSAGA

transformers

attention

nlp

sequence-to-sequence

architecture

Attention Is All You Need

Ashish Vaswani, Noam Shazeer, Niki Parmar et al.

2017

The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely.

Reading Level:

The Transformer Architecture

The Transformer model departs entirely from recurrent and convolutional architectures, relying solely on attention mechanisms. The key innovation is the Multi-Head Self-Attention mechanism.

Scaled Dot-Product Attention

The attention function maps a query and a set of key-value pairs to an output:

$$\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V$$

Where $Q \in \mathbb{R}^{n \times d_k}$, $K \in \mathbb{R}^{m \times d_k}$, $V \in \mathbb{R}^{m \times d_v}$.

The scaling factor $\frac{1}{\sqrt{d_k}}$ prevents vanishing gradients in the softmax when $d_k$ is large.

Multi-Head Attention

Instead of applying a single attention function, multi-head attention linearly projects queries, keys, and values $h$ times with learned projections:

$$\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1, ..., \text{head}_h)W^O$$

$$\text{head}_i = \text{Attention}(QW_i^Q, KW_i^K, VW_i^V)$$

Positional Encoding

Since the model contains no recurrence or convolution, positional information is injected using sine/cosine functions:

$$PE_{(pos,2i)} = \sin(pos / 10000^{2i/d_{model}})$$ $$PE_{(pos,2i+1)} = \cos(pos / 10000^{2i/d_{model}})$$

Training Details

Dataset: WMT 2014 English-German (4.5M sentence pairs)
Hardware: 8 P100 GPUs for 12 hours (base) / 3.5 days (big)
Optimizer: Adam with warmup schedule
BLEU score: 28.4 (EN-DE), 41.0 (EN-FR) — new state of the art

What did this paper leave unanswered?

Join the gap discussion — identify research opportunities and connect with others building solutions.

XP Reward

+150

Earned after 10 seconds of reading

XP already earned

Key EquationsClick for Python code

Scaled Dot-Product Attention — the core attention mechanism

\text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V

Multi-Head Attention — parallel attention with learned projections

\text{MultiHead}(Q,K,V) = \text{Concat}(\text{head}_1,...,\text{head}_h)W^O

Positional Encoding — injecting sequence position information

PE_{(pos,2i)} = \sin\left(\frac{pos}{10000^{2i/d_{model}}}\right)

Citation Graph

References (5)

Long Short-Term Memory

Hochreiter, Schmidhuber · 1997

Neural Machine Translation by Jointly Learning to Align and Translate

Bahdanau, Cho, Bengio · 2015

Convolutional Sequence to Sequence Learning

Gehring et al. · 2017

Layer Normalization

Ba et al. · 2016

Adam: A Method for Stochastic Optimization

Kingma, Ba · 2015