Beyond Absolute Positional Embeddings with Relative and Rotary Methods
This post explores how positional embeddings evolved from absolute to relative to rotary forms, showing how each approach helps transformers capture sequence order and relationships more effectively while balancing flexibility, efficiency, and model complexity.
Position embedding is a simple yet clever extension to a vanilla transformer. It allows the same token to have slightly different embeddings depending on its position in the sequence. When we decompose the scaled dot-product (more details in this post), we find that it consists of three interaction scores: token-to-token similarity, token-to-position interactions, and position-to-position similarity. This breakdown gives us insight into how token semantics and positional information interplay.
In the original transformer paper, positional embeddings were introduced to assign a fixed embedding to each position in the sequence. This method is known as absolute positional embedding. It’s simple and effective, but not without limitations.
Limitation of Absolute Positional Embeddings
A key limitation of absolute positional embeddings is that they treat each position in the sequence as a fixed, independent location. While this does introduce a sense of order into the model, it fails to capture the relative distance or relationship between tokens.
This becomes problematic in natural language, where meaning often depends more on how close tokens are to each other rather than on their exact positions. Take the phrase "I think" as an example. These two words frequently appear together and form a coherent expression, but they don’t always occur at the beginning of a sequence:
- I think it’s going to rain.
- She, I think, will enjoy the concert.
In both cases, "I" and "think" appear close to each other, but not at fixed positions. Their connection is relative. With absolute positional embeddings, the model treats position 1 and position 2 as entirely different contexts, even when the semantic relationship between the tokens remains unchanged.
This lack of flexibility makes it harder for the model to generalize recurring patterns across different parts of a sentence or sequences of varying lengths.
Relative Positional Embedding
To address the limitation of absolute positional embeddings, relative positional embeddings were introduced, which directly model the distance between tokens, regardless of their absolute position in the sequence.
To see how relative position can be incorporated into the scaled dot-product attention, let's start with attention score.
Given token $x_i$ and $x_j$, the attention score between them is:
$$ \text{score}_{i,j} = (x_iW_Q)(x_jW_K)^T = (x_{e_i} + x_{p_i})W_Q \cdot (x_{e_i} + x_{p_i})W_K^T $$
Here:
- $x_{e_i}$ is the token embedding at position $i$
- $x_{p_i}$ is the positional embedding at position $i$
- $x_i = x_{e_i} + x_{p_i}$ is the sum of the token and positional embeddings.
Expanding the terms:
$$ \text{score} _{i , j } = x_{e_i}W_QW_K^Tx_{e_j}^T + x_{e_i}W_QW_K^Tx_{p_j}^T + x_{p_i}W_QW_K^Tx_{e_j}^T + x_{p_i}W_QW_K ^Tx_{p_j}^T$$
- The first term represents token-to-token similarity.
- The remaining terms capture position-related interactions.
Shaw et al.'s Relative Positional Approach
Researcher at Google (Shaw et al.) proposed removing the need to explicitly compute and project separate positional embeddings. Instead, they introduced a single learned embedding that depends only on the relative position between tokens.
The modified attention score is:
$$ \text{score} _ {i,j} = x_{e_i}W_Q(x_{e _ j}W_K + a_{i,j})^T $$
Where:
- $x_{e_i} \in \mathcal{R}^d$ : token embedding at position i
- $W_Q, W_K \in \mathcal{R}^{d \times d}$ : learned projection matrices
- $a_{i,j} \in \mathcal{R}^d$ : a relative positional embedding vector, parameterized by the offset $j-i$
To simply notation, the author wrote $a_{i,j} = w_{j-i}$ where $w_{j-i}$ is the learned embedding for the relative offset $j-i$. The attention score then becomes:
$$ \text{score}_{i,j} = x_{e_i}W_Q x_{e_j}W_K ^T + x_{e_i} W_Q w_{j-i}^T $$
- The first term is semantic similarity which is standard attention term, measuring how well token $i$ attends to token $j$ based on meaning.
- The second term is relative positional bias. It adjusts attention based on token distance or direction. For example, tokens may attend more to nearby than distance ones.
- Because $w_{j-i}$ is indexed by relative distance, this embedding is shared across all token pairs with the same offset.
DeBERTa's Disentangled Attention
The DeBERTa model extends the relative position idea by letting both query and key vectors interact with relative position embeddings. The modified attention score is:
$$ \text{score}_{i,j} = q _i \cdot k_j + q_i \cdot r_{i-j} + r_{i-j} \cdot k_j $$
Here:
- $r_{i-j}$ is the relative position embedding
- the first term is a standard token-to-token interaction.
- the second and third terms are query-to-position and position-to-key interactions, respectively.
This design makes the bias term more expressive, enabling richer modeling of how context and position interact. However, the trade-off is higher computational cost and more parameters.
Rotary Positional Embeddings (RoPE)
The relative positional approach allows the model to capture richer and more expressive patterns in text. However, it comes at the cost of extra parameters and increased computation during both training and inference.
To address these drawbacks, we turn to Rotary Positional Embeddings (RoPE). The key motivation behind RoPE is to encode relative positional information into attention mechanism without adding any new parameters.
Main Idea
The main idea behind RoPE is to apply a rotation to the input token embeddings, where the rotation angle is determined by the token's position in the sequence. Intuitively, nearby tokens will be rotated by similar angles, which in turn leads to higher attention scores due to alignment in rotated space.
The authors define an unnormalized attention score as:
$$ \text{score}_{i,j} = (\mathcal{R}_{\Theta, i}^d W_Q x_i) ^T (\mathcal{R}_ {\Theta, j}^dW_Kx_j) $$
Here:
- $x_i, x_j$ : token embeddings at position i and j
- $W_Q, W_K$ : query and key projection matrices
- $\mathcal{R}_{\Theta, i}^d$ : a rotation matrix applied at position i, with dimensionality $d$ and frequency parameter set $\Theta$
Although the rotation is applied using absolute position, the final attention score depends on the relative position $j-i$.
Rewriting the score:
$$ \text{score}_{i,j} = (x_i^TW_Q^T)(\mathcal{R}_{\Theta, i}^d )^T (\mathcal{R}_ {\Theta, j}^dW_Kx_j) $$
If a rotation matrix is orthogonal, meaning $R_i^TR_j = R_{j-i}$, then we can simplify the score to:
$$ \text{score}_{i,j} = x_i^T W_Q^T \mathcal{R}_{\Theta, j-i }^d W_k x_j $$
This shows that the relative position $j-i$ is embedded directly into the attention score through the rotation operation.
How does the Rotation Matrix work?
RoPE builds on a rotation matrix. In 2D Cartesian space, a vector $(x,y)$ can be represented in polar form:
$x = r \cos{(\alpha)}, y = r \sin{(\alpha)}$
where:
- $r$ is the vector's length
- $\alpha$ is the angle between the vector and the x-axis
If we want to rotate this vector by an additional angle $\beta$, the new coordinates become:
$x' = r \cos{(\alpha + \beta)}, y' = r \sin{(\alpha + \beta)}$
Using trigonometric identities:
- $\cos(\alpha+\beta) = \cos (\alpha) \cos(\beta) - \sin(\alpha)\sin(\beta)$
- $\sin(\alpha+\beta) = \cos (\alpha)\sin(\beta) + \sin(\alpha)\cos(\beta)$
Substitute back in terms of $x = r\cos(\alpha)$ and $y = \sin(\alpha)$, we have:
$x' = x\cos(\beta) - y\sin(\beta)$
$y' = x\sin(\beta) + y\cos(\beta)$
In matrix form:
$$\begin{bmatrix} x' \\ y' \end{bmatrix} = \begin{bmatrix} \cos \beta & -\sin \beta \\ \sin \beta & \cos \beta \end{bmatrix}\begin{bmatrix} x \\ y \end{bmatrix}$$
RoPE extends this by applying such 2D rotations pairwise across embedding dimensions.
Relative Position from Absolute Rotations
We define:
- The rotation angle for position $i$ as $\alpha = \Theta_i$
- The rotation angle for position $j$ as $\beta = \Theta_j$
The 2D rotation matrix for position $i$ is:
$$\mathcal{R}_{\Theta, i} = \begin{bmatrix} \cos \alpha & -\sin \alpha \\ \sin \alpha & \cos \alpha \end{bmatrix} $$
Similarly, the matrix for position $j$ is:
$$\mathcal{R}_{\Theta, j} = \begin{bmatrix} \cos \beta & -\sin \beta \\ \sin \beta & \cos \beta \end{bmatrix} $$
Now, computing the product $\mathcal{R}_{\Theta, i}^T \mathcal{R}_{\Theta, j}$:
$$ \mathcal{R}_{\Theta, i}^T \mathcal{R}_{\Theta, j} = \begin{bmatrix} \cos \alpha & \sin \alpha \\ -\sin \alpha & \cos \alpha \end{bmatrix}\begin{bmatrix} \cos \beta & -\sin \beta \\ \sin \beta & \cos \beta \end{bmatrix} $$
Compute each entry of the resulting matrix:
- Top-left: $\cos \alpha \cos \beta + \sin \alpha \sin \beta = \cos(\alpha-\beta)$
- Top-right: $-\cos \alpha \sin \beta + \sin \alpha \cos \beta = \sin(\alpha-\beta)$
- Bottom-left: $-\sin \alpha \cos \beta + \cos \alpha \sin \beta = -\sin(\alpha-\beta)$
- Bottom-right: $-\sin \alpha \sin \beta + \cos \alpha \cos \beta = \cos (\alpha -\beta)$
using $ \cos (\alpha - \beta) = \cos(\beta -\alpha), \sin(\alpha-\beta) = -\sin(\beta-\alpha) $, we can rewrite the matrix as:
$$ \mathcal{R}_{\Theta, i}^T \mathcal{R}_{\Theta, j} = \begin{bmatrix} \cos(\beta-\alpha) & -\sin (\beta -\alpha) \\ \sin (\beta-\alpha) & \cos (\beta-\alpha) \end{bmatrix} = \mathcal{R}_{\Theta, j-i}$$
Thus, the product of a transposed rotation matrix at position $i$ and a rotation matrix at position $j$ results in a new rotation matrix that encodes the relative angle $\beta - \alpha$, which corresponds to the relative position $j-i$. This is the mathematical key to RoPE.
Even though each vector is rotated using its absolute position, the dot product between tow rotated vectors is equivalent to applying a relative rotation. This means the attention mechanism is now sensitive to relative positions, without explicitly using relative embeddings or additional parameters.
Closing
From absolute embeddings to relative embeddings and finally to RoPE, we see a clear progression toward making positional encoding both more flexible and more efficient. Absolute positional embeddings give the model a sense of order but struggle with generalizing across varying sequence lengths. Relative positional embeddings address this by directly modeling distances, at the cost of extra parameters and computation. RoPE offers a clever middle ground by encoding relative position information implicitly through rotation, without adding complexity to the model's architecture.
This ability to capture positional relationship efficiently is crucial for tasks like language modeling, retrieval, and ranking, where the meaning of a sequence depends not only on the words themselves but also on their relationships in context.