How to understand RoPE

Firstly, the method in this blog is originally by Jianlin Su in his paper and blog: https://spaces.ac.cn/archives/8265/comment-page-4#comments

This method is proposed to solve the relative position embedding problem, more specifically, consider setting the m,n position information onto query and key: $\tilde{q}_m = f(q, m),\ \tilde{k}_n = f(k, n).$

Here $f$ is the embedding function. And we want no shift is the original element, which is the boundary case: $f(q, 0) = q,\ f(k, 0) = k.$

Intuitively, we want the embedding keeps the property the similarity has the relation with the relative position, which is $m - n$:

\[\langle f(q, m), f(k, n) \rangle = g(q, k, m - n).\]

Here $g$ is only an abstract function defined to consider the relative position, and we will find the structure of this.

Solving Path

Firstly, consider $q, k \in \mathbb{C}$ or $\mathbb{R}^2$ distinguished only by the context. Obviously, $\langle q, k \rangle = \text{Re}(qk^\ast).$

So the former equation becomes:

\[g(q, k, m - n) = \text{Re}(f(q, m) f^\ast(k, n)).\]

Since it is complex numbers, we can always use length and angle to represent: $f(\ast, n) = R_f(\ast, n)\, e^{\hat{i}\, \Theta_f(\ast, m)},\ g(q, k, m - n) = R_g(q, k, m-n)\, e^{\hat{i}\, \Theta_g(q, k, m-n)}.$

Taking into the last formula: $R_f(q, m)\, R_f(k, n) = R_g(q, k, m - n),\ \Theta_f(q, m) - \Theta_f(k, n) = \Theta_g(q, k, m - n).$

By the boundary condition, when $m = n$,

\[R_f(q, m)\, R_f(k, m) = R_g(q, k, 0) = R_f(q, 0)\, R_f(k, 0) = \|q\|\,\|k\|.\]

We can set $R_f(\ast, m) = ||\ast||.$

For the second equation:

\[\Theta_f(q, m) - \Theta_f(k, m) = \Theta_g(q, k, 0) = \Theta_f(q, 0) - \Theta_f(k, 0) = \Theta(q) - \Theta(k),\]

where the result shows the difference of angle won’t change no matter we consider the position or not. $\Theta$ here represents the angle of $q$ and $k$.

And fix the last equation:

\[\Theta_f(q, m) - \Theta(q) = \Theta_f(k, m) - \Theta(k) := \varphi(m).\]

Since the difference won’t change by vector. Notice:

\[\varphi(m) - \varphi(m-1) = \Theta_f(q, m) - \Theta(q) - \Theta_f(k, m-1) + \Theta(k) = \Theta_g(q, k, 1) + \Theta(k) - \Theta(q) := \theta.\]

Since $\varphi(0) = \Theta_f(q, 0) - \Theta(q) = \Theta(q) - \Theta(q) = 0$, we get $\varphi(m) = m\theta$, so:

\[\Theta_f(\ast, n) = m\theta + \Theta(\ast).\]

Now we get two parts of the complex number.

Embedding Form

To conclude:

\[f(q, m) = \|q\|\, e^{\hat{i}(\Theta(q) + m\theta)} = q\, e^{\hat{i} m\theta}.\]

In matrix form:

\[f(q, m) \rightarrow \begin{pmatrix} \cos(m\theta) & -\sin(m\theta)\\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} \text{Re}(q)\\ \text{Im}(q) \end{pmatrix}.\]

Tricky thing is, $\theta = \Theta_g(q, k, 1) + \Theta(k) - \Theta(q),$ but $g$ is still an abstract-defined function. So $\theta$ is chosen as:

\[\theta_i = \frac{1}{10000^{\frac{2(i-1)}{d}}}, \quad i \in \left[1, \frac{d}{2}\right], \quad \text{token} \in \mathbb{R}^d.\]

The whole structure mathematically is just rotating the vector by $\theta$ in complex space, but the result it can get is remarkable.




Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Dynamic Programming and Differential Equation
  • Nim Problem
  • Programmer in Imprisonment Problem
  • "Anti-intuitive" High Dimension Geometry
  • a post with plotly.js