AIMLSAGA
1,250 XP · Lv.13
AR
Lv.13
50/100 XP · 50 to Lv.14
7 day streak

/

/

Gap Discussion

Gap Discussion

Attention Is All You Need

What did this paper leave unanswered? Identify research gaps, propose solutions, and connect with others building implementations.

Identify a Research Gap
0/2000 characters
2 Discussions
PK
Priya Krishnamurthy
Nov 15, 2024
47

The paper shows Transformers outperform LSTMs on translation, but there's no discussion of inference efficiency. For sequences longer than 8K tokens, the O(n²) attention complexity becomes prohibitive. Flash Attention solves this, but it wasn't known in 2017. Has anyone benchmarked vanilla attention vs FlashAttention-2 at context lengths of 32K-100K? What's the practical crossover point for different hardware?

AN
Aditya Nair
Has PoC
Nov 12, 2024
38

The positional encoding scheme in the paper uses fixed sinusoidal functions. This fundamentally limits generalization to sequence lengths seen during training. ALiBi, RoPE, and YaRN all address this differently. Which works best in practice for lengths significantly beyond training distribution? Any ablation data?