How to understand RoPE
Firstly, the method in this blog is originally by Jianlin Su in his paper and blog: https://spaces.ac.cn/archives/8265/comment-page-4#comments
This method is proposed to solve the relative position embedding problem, more specifically, consider setting the m,n position information onto query and key: $\tilde{q}_m = f(q, m),\ \tilde{k}_n = f(k, n).$
Here $f$ is the embedding function. And we want no shift is the original element, which is the boundary case: $f(q, 0) = q,\ f(k, 0) = k.$
Intuitively, we want the embedding keeps the property the similarity has the relation with the relative position, which is $m - n$:
\[\langle f(q, m), f(k, n) \rangle = g(q, k, m - n).\]Here $g$ is only an abstract function defined to consider the relative position, and we will find the structure of this.
Solving Path
Firstly, consider $q, k \in \mathbb{C}$ or $\mathbb{R}^2$ distinguished only by the context. Obviously, $\langle q, k \rangle = \text{Re}(qk^\ast).$
So the former equation becomes:
\[g(q, k, m - n) = \text{Re}(f(q, m) f^\ast(k, n)).\]Since it is complex numbers, we can always use length and angle to represent: $f(\ast, n) = R_f(\ast, n)\, e^{\hat{i}\, \Theta_f(\ast, m)},\ g(q, k, m - n) = R_g(q, k, m-n)\, e^{\hat{i}\, \Theta_g(q, k, m-n)}.$
Taking into the last formula: $R_f(q, m)\, R_f(k, n) = R_g(q, k, m - n),\ \Theta_f(q, m) - \Theta_f(k, n) = \Theta_g(q, k, m - n).$
By the boundary condition, when $m = n$,
\[R_f(q, m)\, R_f(k, m) = R_g(q, k, 0) = R_f(q, 0)\, R_f(k, 0) = \|q\|\,\|k\|.\]We can set $R_f(\ast, m) = ||\ast||.$
For the second equation:
\[\Theta_f(q, m) - \Theta_f(k, m) = \Theta_g(q, k, 0) = \Theta_f(q, 0) - \Theta_f(k, 0) = \Theta(q) - \Theta(k),\]where the result shows the difference of angle won’t change no matter we consider the position or not. $\Theta$ here represents the angle of $q$ and $k$.
And fix the last equation:
\[\Theta_f(q, m) - \Theta(q) = \Theta_f(k, m) - \Theta(k) := \varphi(m).\]Since the difference won’t change by vector. Notice:
\[\varphi(m) - \varphi(m-1) = \Theta_f(q, m) - \Theta(q) - \Theta_f(k, m-1) + \Theta(k) = \Theta_g(q, k, 1) + \Theta(k) - \Theta(q) := \theta.\]Since $\varphi(0) = \Theta_f(q, 0) - \Theta(q) = \Theta(q) - \Theta(q) = 0$, we get $\varphi(m) = m\theta$, so:
\[\Theta_f(\ast, n) = m\theta + \Theta(\ast).\]Now we get two parts of the complex number.
Embedding Form
To conclude:
\[f(q, m) = \|q\|\, e^{\hat{i}(\Theta(q) + m\theta)} = q\, e^{\hat{i} m\theta}.\]In matrix form:
\[f(q, m) \rightarrow \begin{pmatrix} \cos(m\theta) & -\sin(m\theta)\\ \sin(m\theta) & \cos(m\theta) \end{pmatrix} \begin{pmatrix} \text{Re}(q)\\ \text{Im}(q) \end{pmatrix}.\]Tricky thing is, $\theta = \Theta_g(q, k, 1) + \Theta(k) - \Theta(q),$ but $g$ is still an abstract-defined function. So $\theta$ is chosen as:
\[\theta_i = \frac{1}{10000^{\frac{2(i-1)}{d}}}, \quad i \in \left[1, \frac{d}{2}\right], \quad \text{token} \in \mathbb{R}^d.\]The whole structure mathematically is just rotating the vector by $\theta$ in complex space, but the result it can get is remarkable.
Enjoy Reading This Article?
Here are some more articles you might like to read next: