Gap Discussion
Attention Is All You Need
What did this paper leave unanswered? Identify research gaps, propose solutions, and connect with others building implementations.
Identify a Research Gap
2 Discussions
Priya Krishnamurthy
The paper shows Transformers outperform LSTMs on translation, but there's no discussion of inference efficiency. For sequences longer than 8K tokens, the O(n²) attention complexity becomes prohibitive. Flash Attention solves this, but it wasn't known in 2017. Has anyone benchmarked vanilla attention vs FlashAttention-2 at context lengths of 32K-100K? What's the practical crossover point for different hardware?
Aditya Nair
The positional encoding scheme in the paper uses fixed sinusoidal functions. This fundamentally limits generalization to sequence lengths seen during training. ALiBi, RoPE, and YaRN all address this differently. Which works best in practice for lengths significantly beyond training distribution? Any ablation data?