Definition
A mechanism within Transformer architectures that computes a weighted representation of an input sequence by calculating the relevance of every token to every other token. In RAG and Agentic workflows, it enables the model to resolve coreferences and capture nuanced dependencies across long-range context, though it incurs a quadratic computational cost (O(n²)) relative to sequence length.
Distinguish from Cross-Attention, which relates tokens between two different sequences; Self-Attention only relates tokens within the same sequence.
"A high-powered searchlight that, for every word in a sentence, simultaneously illuminates all other words with varying intensities based on their relevance to the current word's meaning."
Conceptual Overview
A mechanism within Transformer architectures that computes a weighted representation of an input sequence by calculating the relevance of every token to every other token. In RAG and Agentic workflows, it enables the model to resolve coreferences and capture nuanced dependencies across long-range context, though it incurs a quadratic computational cost (O(n²)) relative to sequence length.
Disambiguation
Distinguish from Cross-Attention, which relates tokens between two different sequences; Self-Attention only relates tokens within the same sequence.
Visual Analog
A high-powered searchlight that, for every word in a sentence, simultaneously illuminates all other words with varying intensities based on their relevance to the current word's meaning.