AIMLSAGA

Gap Discussion

Attention Is All You Need

What did this paper leave unanswered? Identify research gaps, propose solutions, and connect with others building implementations.

Post a gap discussion to earn +30 XP

Identify a Research Gap

0/2000 characters

2 Discussions

Priya Krishnamurthy

Nov 15, 2024

The paper shows Transformers outperform LSTMs on translation, but there's no discussion of inference efficiency. For sequences longer than 8K tokens, the O(n²) attention complexity becomes prohibitive. Flash Attention solves this, but it wasn't known in 2017. Has anyone benchmarked vanilla attention vs FlashAttention-2 at context lengths of 32K-100K? What's the practical crossover point for different hardware?

Aditya Nair

Has PoC

Nov 12, 2024

The positional encoding scheme in the paper uses fixed sinusoidal functions. This fundamentally limits generalization to sequence lengths seen during training. ALiBi, RoPE, and YaRN all address this differently. Which works best in practice for lengths significantly beyond training distribution? Any ablation data?