What Transformers Taught Me About Attention

In 2017, a paper titled “Attention Is All You Need” revolutionized machine learning.1 Vaswani et al., “Attention Is All You Need” (2017). The paper introduced the Transformer architecture that now underlies GPT, BERT, LLaMA, and essentially every major language model. The title itself is a philosophical statement. The Transformer architecture it introduced now powers everything from GPT to BERT to the AI assistants we talk to daily.

But beyond its technical brilliance, the attention mechanism offers a surprisingly profound insight about how intelligence might work.

The Core Idea

Traditional neural networks processed sequences step by step, maintaining a hidden state that theoretically encoded everything that came before. The problem? Information had to survive a long game of telephone.2 RNNs and LSTMs suffered from the vanishing gradient problem—information from early in a sequence would get diluted as it passed through many layers. Attention solved this by allowing direct connections.

Attention said: forget that. Let every part of the input directly communicate with every other part.

1
2
3
4
# Simplified attention in pseudocode
for each word in sequence:
    relevance_scores = compute_similarity(word, all_other_words)
    attended_context = weighted_sum(all_words, relevance_scores)

Instead of a linear memory that degrades over distance, attention creates a fully connected graph where any piece of information can directly influence any other.

Why This Matters Beyond ML

Here’s what strikes me: attention mechanism isn’t just computationally useful—it’s a theory of understanding.

1. Context Is Everything

The word “bank” means something different in “river bank” vs. “bank account.” Attention allows the model to attend to relevant context to disambiguate meaning.

Humans do this constantly. We understand sentences not as isolated word sequences but as webs of interconnected meaning. Language is inherently contextual.3 This is what Wittgenstein meant by “meaning is use”—words don’t have fixed meanings in isolation, only in context. Attention mechanisms operationalize this philosophical insight.

2. Relevance Is Dynamic

The attention weights aren’t fixed. They’re computed per input, meaning the model dynamically decides what’s important based on what it’s looking at right now.

This mirrors something in cognitive science: relevance is contextual. What matters in one situation might be irrelevant in another. Attention isn’t a fixed filter—it’s adaptive.

3. Not All Information Is Equal

Some words in a sentence matter more than others. Attention learns to focus on what’s important and effectively ignore what isn’t.

1
2
3
"The quick brown fox jumps over the lazy dog"
          ↓         ↓              ↓
        [lower]   [HIGH]         [medium]

This selective focus is the essence of attention—both artificial and human.

The Philosophical Rabbit Hole

Cognitive scientists have studied attention for over a century. William James wrote in 1890:

“Everyone knows what attention is. It is taking possession by the mind, in clear and vivid form, of one out of what seem several simultaneously possible objects or trains of thought.” — James, William (1890). The Principles of Psychology

The Transformer attention mechanism is a mathematical approximation of this intuition. It’s not just that it works well—it works well because it captures something real about how understanding functions.5 This doesn’t mean Transformers “understand” in the human sense. But the architectural insight—that understanding requires dynamic, context-dependent weighting of information—seems genuinely deep.

Multi-Head Attention: Multiple Perspectives

Transformers don’t use single attention—they use multi-head attention. Multiple attention mechanisms running in parallel, each potentially learning to focus on different aspects:

One head might focus on syntax
Another on semantic similarity
Another on positional relationships
Another on coreference (what “it” refers to)

It’s like having multiple experts, each paying attention to what they specialize in, then combining their insights.6 Research by Olah, Clark et al. at Anthropic has shown that different attention heads do specialize. Some track syntax, others semantics, others handle specific linguistic phenomena like negation or quotation.

Does this remind you of anything? It reminds me of how humans can hold multiple perspectives simultaneously, considering a problem from different angles before synthesizing a response.

Attention as Bottleneck

Here’s another insight: attention is fundamentally about managing scarcity.

You can’t process everything equally. There’s too much information. So you have to choose—what deserves focus? Attention is the mechanism that makes this choice.

In a world of infinite information, the scarce resource isn’t data—it’s attention. The bottleneck isn’t compute—it’s relevance.7 This connects to Herbert Simon’s observation: “A wealth of information creates a poverty of attention.” The Transformer architecture is, in a sense, an engineering solution to this fundamental problem.

What Attention Doesn’t Explain

Let’s not overstate the case. Attention mechanisms:

Don’t explain consciousness
Don’t necessarily mean the model “understands” anything
Are mathematical operations, not magic
Don’t address grounding, embodiment, or real-world causation

But they do offer a computational theory of a cognitive process. And that’s valuable even if it’s not the whole story.

The Meta-Lesson

The reason Transformers work so well might be that they capture a fundamental truth: intelligence is less about raw processing power and more about knowing what to focus on.

This is true for machines. It’s true for humans. It might be a universal principle of cognition.

And now, the irony: you, a biological attention system, have chosen to attend to these words about artificial attention systems. Meta.

Changelog

2026-01-03: Initial draft
2026-01-29: Added sidenotes with citations, expanded philosophical connections, added James quote, improved Further Reading section

The Core Idea#

Why This Matters Beyond ML#

1. Context Is Everything#

2. Relevance Is Dynamic#

3. Not All Information Is Equal#

The Philosophical Rabbit Hole#

Multi-Head Attention: Multiple Perspectives#

Attention as Bottleneck#

What Attention Doesn’t Explain#

The Meta-Lesson#

Further Reading#

Related#

Changelog#