Inside Transformers: Scaled Dot-Product Attention & the Role of Position

Dive into the heart of transformer layers with a step-by-step look at scaled dot-product attention and discover how adding positional embeddings lets models capture both meaning and order.

Many modern retrieval systems are built on transformer architectures, so understanding the transformer layer is essential to see how these models learn from our data. In this post, we will dive into the scaled dot-product mechanism and explore the crucial role of positional embeddings.

Scaled Dot-Product Attention in the transformer layer

In the self-attention mechanism of a transformer layer, the scaled dot-product operation is used to compute the compatibility between each pair of elements in the input sequence. This compatibility score determines how much attention a given token should pay to other tokens in the sequence.

The compatibility score is given by:

$$ e_{i,j} = \frac{(x_i W^Q)(x_j W^K)^T}{\sqrt{d_z}} $$
where:

  • $e_{i,j}$ represents the scaled compatibility score between the $i$-th and $j$-th elements in the sequence.
  • $x_i$ and $x_j$ are the input token embeddings.
  • $W^Q$ and $W^K$ are learned projection matrices that map the input tokens into separate query and key spaces.
  • $d_z$ is the dimension of the transformed representation, and scaled by $\sqrt{d_z}$ prevents excessively large values that could destabilize the training process.

Understanding the Scaled Dot-Product

We can break down what happens when we compute compatibility:

  1. Projecting each input into query and key spaces:
    Each input token $x_i$ is mapped into two distinct representations:
    $$ q_i = x_i W^Q, k_j = x_j W^K $$
    query $q_i$ represents the token when attending to others while key $k_j$ represents how important another token is in relation to it.
  2. Measuring Similarity: The dot product between $q_i$ and $k_j$ quantifies their similarity. If the two vectors are highly aligned in the transformed space, they receive a higher score, indicating a stronger relationship.
  3. Scaling for Stability: Since the dot product values tend to grow with the embedding dimention $d_z$, dividing by $\sqrt{d_z}$ ensures that the values remain in a stable range, preventing extreme softmax outputs.
  4. Semantic Interpretation: By transforming the input embeddings into a new semantic space, the self-attention mechanism measures the relevance of each token to every other token. Therefore, tokens that are semantically closer will have higher compatibility scores, leading to stronger attention weights.

Decomposing the scaled dot-product attention

A key limitation of vanilla self-attention is its lack of positional awareness. Word order is crucial in natural language, so positional embeddings were introduced. We represent the inputs as:
$$ X = X_e + X_p $$
where:

  • $X_e$ represents the token embeddings
  • $X_p$ represents the positional embeddings.

Substituting $X$ into the attention formula:
$$ (X_e + X_p)W_Q((X_e+X_p)W_K)^T = (X_e + X_p)W_QW_K^T(X_e+X_p)^T$$
Distributing the terms:
$$ (X_eW_QW_K^T + X_pW_QW_K^T)(X_e+X_p)^T$$
Further expanding:
$$ (X_eW_QW_K^T X_e^T) + (X_eW_QW_K^TX_p^T) + (X_pW_QW_K^TX_e^T) + (X_pW_Q W_K^T X_p^T) $$


These four terms each play a distinct role:

  1. Token-to-Token Similarity $$ X_eW_QW_K^T X_e^T$$ This is the core attention term that captures the similarity between token embeddings. It determines how much attention each token pays to others based purely on semantic meaning.
  2. Token-Position Interaction (Two cross Terms)
    $$ (X_eW_QW_K^TX_p^T) + (X_pW_QW_K^TX_e^T) $$
    These terms encode interactions between token embeddings and positional embeddings. This suggests that positional information influences how much attention one token pays to another.
  3. Position-to-Position Similarity
    $$ X_pW_Q W_K^T X_p^T $$ This term captures positional relationships independent of tokens. It introduces a bias based purely on the position of tokens.

This decomposition highlights why positional embeddings are essential: token-only models like bag-of-words ignore order. Positional embeddings ensure self-attention captures sequence structure, blending semantic (token) and syntactic (position) information in a simple yet powerful extension of the scaled dot-product. In retrieval systems, where relevance depends not just on which terms appear but on their context and order, this ability blends meaning and position is what lets transformer-based ranking models outperform traditional approaches.

What's Next?
In our next post, we will dive into advanced positional mechanisms. We will explore relative positional embeddings, which let tokens reason about their dynamic distances, and rotary positional embeddings, which add a smooth, rotational bias to the attention scores, helping the model better track token order and improve efficiency.