Hello, Codeforces!↵
↵
I think many of you know about Mo's algorithm. For those, who don't know, please read this [blog](https://mirror.codeforces.com/blog/entry/7383).↵
↵
Here, I consider an alternative (and faster) approach of sorting queries in Mo's algorithm.↵
↵
# Table of contents↵
* [Canonical version of the algorithm](#s1)↵
* [Achieving a better time complexity](#s2)↵
* [Relation to TSP](#s2-1)↵
* [Hilbert curve](#s2-2)↵
* [Benchmarks](#s3)↵
* [Applicability](#s4)↵
↵
<a name="s1"></a>↵
# Canonical version of the algorithm↵
↵
The canonical version of this algorithm has $O((n + q)\cdot\sqrt{n})$ time complexity if insertions and deletions work in $O(1)$.↵
↵
Usually, the two comparators for query sorting are used. The slowest one:↵
↵
~~~~~↵
struct Query {↵
int l, r, idx;↵
↵
inline pair<int, int> toPair() const {↵
return make_pair(l / block, r);↵
}↵
};↵
↵
inline bool operator<(const Query &a, const Query &b) {↵
return a.toPair() < b.toPair();↵
}↵
~~~~~↵
↵
We know it can be optimized a little if for even blocks we use the reverse order for the right boundary:↵
↵
~~~~~↵
struct Query {↵
int l, r, idx;↵
↵
inline pair<int, int> toPair() const {↵
return make_pair(l / block, ((l / block) & 1) ? -r : +r);↵
}↵
};↵
↵
inline bool operator<(const Query &a, const Query &b) {↵
return a.toPair() < b.toPair();↵
}↵
~~~~~↵
↵
<a name="s2"></a>↵
# Achieving a better time complexity↵
↵
We can achieve $O(n\sqrt{q})$ complexity for Mo's algorithm (still assuming that insertions and deletions are $O(1)$). Note that $O(n\sqrt{q}))$ is always better than $O((n + q)\sqrt{n}$. We can prove it as follows:↵
↵
$n\sqrt{q} < (n + q)\sqrt{n},$↵
↵
$n^{2}q < n(n+q)^2,$↵
↵
$nq < n^2 + 2nq + q^2,$↵
↵
$n^2 + nq + q^2 > 0$↵
↵
But how this complexity can be achieved?↵
↵
<a name="s2-1"></a>↵
## Relation to TSP↵
↵
At first, notice that changing the segment $(l_1; r_1)$ to $(l_2; r_2)$ will take $|l_1 - l_2| + |r_1 - r_2|$ insertions and deletions.↵
↵
Denote the queries as $(l_1; r_1), (l_2; r_2), \dots, (l_q; r_q)$. We need to permute them to minimize the number of operations, i. e. to find a permutation $p_1, p_2, \dots p_q$ such that total number of operations↵
↵
$S = \sum_{i = 1}^{q-1}{|l_{p_i} - l_{p_{i+1}}| + |r_{p_i} - r_{p_{i+1}}|}$↵
↵
is minimal possible.↵
↵
Now we can see the relationship between Mo's algorithm and [TSP](https://en.wikipedia.org/wiki/Travelling_Salesman_Problem) on Manhattan metrics. Boundaries of queries can be represented as points on 2D plane. And we need to find an optimal route to visit all these points (process all the queries).↵
↵
But TSP is NP-complete, so it cannot help us to find the optimal query sorting order. Instead, we should use a fast heuristic approach. A good approach would be the use of recursive curves. Wikipedia says about using [Sierpiński curve](https://en.wikipedia.org/wiki/Sierpi%C5%84ski_curve) as a basis to find a good TSP solution.↵
↵
But Sierpiński curve is not very convenient in implementation for integer point coordinates, so we will use another recursive curve: [Hilbert curve](https://en.wikipedia.org/wiki/Hilbert_curve).↵
↵
<a name="s2-2"></a>↵
## Hilbert curve↵
↵
Let's build a Hilbert curve on a $2^k\times 2^k$ matrix and visit all the cells on the matrix according to this curve. Denote $ord(i, j, k)$ as the number of cells visited before the cell $(i, j)$ in order of Hilbert curve on the $2^k\times 2^k$ matrix. The following picture shows $ord(i, j)$ for each cell on the $8\times 8$ matrix:↵
↵
![image](https://i.imgur.com/Cuw8guP.png?3)↵
↵
Now we can write a function to determine $ord(i, j, k)$. It will do the following:↵
↵
* Divide the matrix onto four parts (as Hilbert curve is built recursively, each part is a rotated Hilbert curve).↵
* Determine, in which of the four parts the cell $(i, j)$ is located.↵
* Add all the cells in the previous parts. There are $4^{n-1}\cdot k$ or them, where $k$ is the number of parts visited before the current one.↵
* Recursively go into the selected part (do not forget that it can be rotated).↵
↵
The code looks in the following way:↵
↵
~~~~~↵
inline int64_t hilbertOrder(int x, int y, int pow, int rotate) {↵
if (pow == 0) {↵
return 0;↵
}↵
int hpow = 1 << (pow-1);↵
int seg = (x < hpow) ? (↵
(y < hpow) ? 0 : 3↵
) : (↵
(y < hpow) ? 1 : 2↵
);↵
seg = (seg + rotate) & 3;↵
const int rotateDelta[4] = {3, 0, 0, 1};↵
int nx = x & (x ^ hpow), ny = y & (y ^ hpow);↵
int nrot = (rotate + rotateDelta[seg]) & 3;↵
int64_t subSquareSize = int64_t(1) << (2*pow - 2);↵
int64_t ans = seg * subSquareSize;↵
int64_t add = hilbertOrder(nx, ny, pow-1, nrot);↵
ans += (seg == 1 || seg == 2) ? add : (subSquareSize - add - 1);↵
return ans;↵
}↵
~~~~~↵
↵
Assume the matrix has size $2^k\times 2^k$, where $2^k \lge n$ (you can take k = $20$ for most cases or find minimal such $k$ that $2^k \lge n$). Now denote $o_i$ as $ord(l_i, r_i, k)$ on this matrix. Then sort the queries according to their $o_i$.↵
↵
Here is the proof why this works in $O(n\sqrt{q})$ time. Suppose we have $n = 2^k$ and $q = 4^l$, where $k$ and $l$ are some integers. (If $n$ and $q$ are not powers of $2$ and $4$ respectively, we increase them, it don't have any effect on asymptotic). Now divide the matrix onto squares with size $2^l\times 2^l$. To travel between a pair of adjacent squares, we need $O(\frac{n}{2^l})$ time, so we can travel between all the squares in $O(\frac{n}{2^l}\cdot 4^l) = O(2^l\cdot n) = O(n\sqrt{q})$ time. Now consider the groups of queries inside a $2^l\times 2^l$ square. Here we can travel from one query to another in $O(\frac{n}{2^l})$, so we process all such groups of queries in $O(q\cdot\frac{n}{2^l}) = O(q\cdot\frac{n}{\sqrt{q}}) = O(n\sqrt{q})$. So the total time to process all the queries is $O(n\sqrt{q})$, which was to be proved.↵
↵
<a name="s3"></a>↵
# Benchmarks↵
↵
Let's compare the canonical version of Mo's algorithm and the version with Hilbert order. To do this, we will use the problem [problem:617E] with different constraints for $n$ and $q$. The implementations are here:↵
↵
* Standard implementation: [code](https://ideone.com/pojbqb)↵
* Mo with Hilbert curves: [code](https://ideone.com/WZfHWs)↵
↵
To reduce the amount of input and output, the generators are built into the code and the output is hashed.↵
↵
For benchmarks I used Polygon. The results are here:↵
↵
<table>↵
↵
<tr>↵
<th>$n$</th>↵
<th>$q$</th>↵
<th>$\frac{n}{q}$</th>↵
<th>Standard Mo time</th>↵
<th>Mo+Hilbert time</th>↵
<th>Time ratio</th>↵
</tr>↵
↵
<tr>↵
<td>400000</td>↵
<td>400000</td>↵
<td>1</td>↵
<td>2730 ms</td>↵
<td>2698 ms</td>↵
<td>1.012</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>1000000</td>↵
<td>1</td>↵
<td>13602 ms</td>↵
<td>10841 ms</td>↵
<td>1.255</td>↵
</tr>↵
↵
<tr>↵
<td>500000</td>↵
<td>250000</td>↵
<td>2</td>↵
<td>3369 ms</td>↵
<td>2730 ms</td>↵
<td>1.234</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>500000</td>↵
<td>2</td>↵
<td>10077 ms</td>↵
<td>7644 ms</td>↵
<td>1.318</td>↵
</tr>↵
↵
<tr>↵
<td>600000</td>↵
<td>200000</td>↵
<td>3</td>↵
<td>4134 ms</td>↵
<td>2901 ms</td>↵
<td>1.425</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>333333</td>↵
<td>3</td>↵
<td>8767 ms</td>↵
<td>6240 ms</td>↵
<td>1.405</td>↵
</tr>↵
↵
<tr>↵
<td>600000</td>↵
<td>150000</td>↵
<td>4</td>↵
<td>4851 ms</td>↵
<td>2496 ms</td>↵
<td>1.944</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>250000</td>↵
<td>4</td>↵
<td>8672 ms</td>↵
<td>5553 ms</td>↵
<td>1.561</td>↵
</tr>↵
↵
<tr>↵
<td>700000</td>↵
<td>140000</td>↵
<td>5</td>↵
<td>6255 ms</td>↵
<td>2854 ms</td>↵
<td>2.192</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>200000</td>↵
<td>5</td>↵
<td>8423 ms</td>↵
<td>5100 ms</td>↵
<td>1.652</td>↵
</tr>↵
↵
<tr>↵
<td>750000</td>↵
<td>100000</td>↵
<td>7.5</td>↵
<td>5116 ms</td>↵
<td>2667 ms</td>↵
<td>1.918</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>333333</td>↵
<td>7.5</td>↵
<td>7924 ms</td>↵
<td>4009 ms</td>↵
<td>1.976</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>100000</td>↵
<td>10</td>↵
<td>7425 ms</td>↵
<td>3977 ms</td>↵
<td>1.866</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>40000</td>↵
<td>25</td>↵
<td>9671 ms</td>↵
<td>2355 ms</td>↵
<td>4.107</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>20000</td>↵
<td>50</td>↵
<td>9016 ms</td>↵
<td>1590 ms</td>↵
<td>5.670</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>10000</td>↵
<td>100</td>↵
<td>6879 ms</td>↵
<td>1185 ms</td>↵
<td>5.805</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>5000</td>↵
<td>200</td>↵
<td>5802 ms</td>↵
<td>857 ms</td>↵
<td>6.770</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>2500</td>↵
<td>400</td>↵
<td>4897 ms</td>↵
<td>639 ms</td>↵
<td>7.664</td>↵
</tr>↵
↵
</table>↵
↵
What is interesting, when I ran these codes locally even on the test with $n = q$ (e. g. $n = q = 400000$), Hilbert version worked about three times faster.↵
↵
<a name="s4"></a>↵
# Applicability↵
↵
As you can see, such sorting order doesn't make Mo's algorithm work slower, and when $q$ is significantly less than $n$, it works much faster than the classical version. So it is ideal for problems with $n = 10^6$ and $q = 10^4$. For smaller $q$ solutions with naive query processing can pass.↵
↵
Thanks to everyone who read this article! I hope it will be useful for someone.↵
↵
If something is unclear in the post or the post contains bugs, feel free to write in comments.
↵
I think many of you know about Mo's algorithm. For those, who don't know, please read this [blog](https://mirror.codeforces.com/blog/entry/7383).↵
↵
Here, I consider an alternative (and faster) approach of sorting queries in Mo's algorithm.↵
↵
# Table of contents↵
* [Canonical version of the algorithm](#s1)↵
* [Achieving a better time complexity](#s2)↵
* [Relation to TSP](#s2-1)↵
* [Hilbert curve](#s2-2)↵
* [Benchmarks](#s3)↵
* [Applicability](#s4)↵
↵
<a name="s1"></a>↵
# Canonical version of the algorithm↵
↵
The canonical version of this algorithm has $O((n + q)\cdot\sqrt{n})$ time complexity if insertions and deletions work in $O(1)$.↵
↵
Usually, the two comparators for query sorting are used. The slowest one:↵
↵
~~~~~↵
struct Query {↵
int l, r, idx;↵
↵
inline pair<int, int> toPair() const {↵
return make_pair(l / block, r);↵
}↵
};↵
↵
inline bool operator<(const Query &a, const Query &b) {↵
return a.toPair() < b.toPair();↵
}↵
~~~~~↵
↵
We know it can be optimized a little if for even blocks we use the reverse order for the right boundary:↵
↵
~~~~~↵
struct Query {↵
int l, r, idx;↵
↵
inline pair<int, int> toPair() const {↵
return make_pair(l / block, ((l / block) & 1) ? -r : +r);↵
}↵
};↵
↵
inline bool operator<(const Query &a, const Query &b) {↵
return a.toPair() < b.toPair();↵
}↵
~~~~~↵
↵
<a name="s2"></a>↵
# Achieving a better time complexity↵
↵
We can achieve $O(n\sqrt{q})$ complexity for Mo's algorithm (still assuming that insertions and deletions are $O(1)$). Note that $O(n\sqrt{q}))$ is always better than $O((n + q)\sqrt{n}$. We can prove it as follows:↵
↵
$n\sqrt{q} < (n + q)\sqrt{n},$↵
↵
$n^{2}q < n(n+q)^2,$↵
↵
$nq < n^2 + 2nq + q^2,$↵
↵
$n^2 + nq + q^2 > 0$↵
↵
But how this complexity can be achieved?↵
↵
<a name="s2-1"></a>↵
## Relation to TSP↵
↵
At first, notice that changing the segment $(l_1; r_1)$ to $(l_2; r_2)$ will take $|l_1 - l_2| + |r_1 - r_2|$ insertions and deletions.↵
↵
Denote the queries as $(l_1; r_1), (l_2; r_2), \dots, (l_q; r_q)$. We need to permute them to minimize the number of operations, i. e. to find a permutation $p_1, p_2, \dots p_q$ such that total number of operations↵
↵
$S = \sum_{i = 1}^{q-1}{|l_{p_i} - l_{p_{i+1}}| + |r_{p_i} - r_{p_{i+1}}|}$↵
↵
is minimal possible.↵
↵
Now we can see the relationship between Mo's algorithm and [TSP](https://en.wikipedia.org/wiki/Travelling_Salesman_Problem) on Manhattan metrics. Boundaries of queries can be represented as points on 2D plane. And we need to find an optimal route to visit all these points (process all the queries).↵
↵
But TSP is NP-complete, so it cannot help us to find the optimal query sorting order. Instead, we should use a fast heuristic approach. A good approach would be the use of recursive curves. Wikipedia says about using [Sierpiński curve](https://en.wikipedia.org/wiki/Sierpi%C5%84ski_curve) as a basis to find a good TSP solution.↵
↵
But Sierpiński curve is not very convenient in implementation for integer point coordinates, so we will use another recursive curve: [Hilbert curve](https://en.wikipedia.org/wiki/Hilbert_curve).↵
↵
<a name="s2-2"></a>↵
## Hilbert curve↵
↵
Let's build a Hilbert curve on a $2^k\times 2^k$ matrix and visit all the cells on the matrix according to this curve. Denote $ord(i, j, k)$ as the number of cells visited before the cell $(i, j)$ in order of Hilbert curve on the $2^k\times 2^k$ matrix. The following picture shows $ord(i, j)$ for each cell on the $8\times 8$ matrix:↵
↵
![image](https://i.imgur.com/Cuw8guP.png?3)↵
↵
Now we can write a function to determine $ord(i, j, k)$. It will do the following:↵
↵
* Divide the matrix onto four parts (as Hilbert curve is built recursively, each part is a rotated Hilbert curve).↵
* Determine, in which of the four parts the cell $(i, j)$ is located.↵
* Add all the cells in the previous parts. There are $4^{n-1}\cdot k$ or them, where $k$ is the number of parts visited before the current one.↵
* Recursively go into the selected part (do not forget that it can be rotated).↵
↵
The code looks in the following way:↵
↵
~~~~~↵
inline int64_t hilbertOrder(int x, int y, int pow, int rotate) {↵
if (pow == 0) {↵
return 0;↵
}↵
int hpow = 1 << (pow-1);↵
int seg = (x < hpow) ? (↵
(y < hpow) ? 0 : 3↵
) : (↵
(y < hpow) ? 1 : 2↵
);↵
seg = (seg + rotate) & 3;↵
const int rotateDelta[4] = {3, 0, 0, 1};↵
int nx = x & (x ^ hpow), ny = y & (y ^ hpow);↵
int nrot = (rotate + rotateDelta[seg]) & 3;↵
int64_t subSquareSize = int64_t(1) << (2*pow - 2);↵
int64_t ans = seg * subSquareSize;↵
int64_t add = hilbertOrder(nx, ny, pow-1, nrot);↵
ans += (seg == 1 || seg == 2) ? add : (subSquareSize - add - 1);↵
return ans;↵
}↵
~~~~~↵
↵
Assume the matrix has size $2^k\times 2^k$, where $2^k \
↵
Here is the proof why this works in $O(n\sqrt{q})$ time. Suppose we have $n = 2^k$ and $q = 4^l$, where $k$ and $l$ are some integers. (If $n$ and $q$ are not powers of $2$ and $4$ respectively, we increase them, it don't have any effect on asymptotic). Now divide the matrix onto squares with size $2^l\times 2^l$. To travel between a pair of adjacent squares, we need $O(\frac{n}{2^l})$ time, so we can travel between all the squares in $O(\frac{n}{2^l}\cdot 4^l) = O(2^l\cdot n) = O(n\sqrt{q})$ time. Now consider the groups of queries inside a $2^l\times 2^l$ square. Here we can travel from one query to another in $O(\frac{n}{2^l})$, so we process all such groups of queries in $O(q\cdot\frac{n}{2^l}) = O(q\cdot\frac{n}{\sqrt{q}}) = O(n\sqrt{q})$. So the total time to process all the queries is $O(n\sqrt{q})$, which was to be proved.↵
↵
<a name="s3"></a>↵
# Benchmarks↵
↵
Let's compare the canonical version of Mo's algorithm and the version with Hilbert order. To do this, we will use the problem [problem:617E] with different constraints for $n$ and $q$. The implementations are here:↵
↵
* Standard implementation: [code](https://ideone.com/pojbqb)↵
* Mo with Hilbert curves: [code](https://ideone.com/WZfHWs)↵
↵
To reduce the amount of input and output, the generators are built into the code and the output is hashed.↵
↵
For benchmarks I used Polygon. The results are here:↵
↵
<table>↵
↵
<tr>↵
<th>$n$</th>↵
<th>$q$</th>↵
<th>$\frac{n}{q}$</th>↵
<th>Standard Mo time</th>↵
<th>Mo+Hilbert time</th>↵
<th>Time ratio</th>↵
</tr>↵
↵
<tr>↵
<td>400000</td>↵
<td>400000</td>↵
<td>1</td>↵
<td>2730 ms</td>↵
<td>2698 ms</td>↵
<td>1.012</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>1000000</td>↵
<td>1</td>↵
<td>13602 ms</td>↵
<td>10841 ms</td>↵
<td>1.255</td>↵
</tr>↵
↵
<tr>↵
<td>500000</td>↵
<td>250000</td>↵
<td>2</td>↵
<td>3369 ms</td>↵
<td>2730 ms</td>↵
<td>1.234</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>500000</td>↵
<td>2</td>↵
<td>10077 ms</td>↵
<td>7644 ms</td>↵
<td>1.318</td>↵
</tr>↵
↵
<tr>↵
<td>600000</td>↵
<td>200000</td>↵
<td>3</td>↵
<td>4134 ms</td>↵
<td>2901 ms</td>↵
<td>1.425</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>333333</td>↵
<td>3</td>↵
<td>8767 ms</td>↵
<td>6240 ms</td>↵
<td>1.405</td>↵
</tr>↵
↵
<tr>↵
<td>600000</td>↵
<td>150000</td>↵
<td>4</td>↵
<td>4851 ms</td>↵
<td>2496 ms</td>↵
<td>1.944</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>250000</td>↵
<td>4</td>↵
<td>8672 ms</td>↵
<td>5553 ms</td>↵
<td>1.561</td>↵
</tr>↵
↵
<tr>↵
<td>700000</td>↵
<td>140000</td>↵
<td>5</td>↵
<td>6255 ms</td>↵
<td>2854 ms</td>↵
<td>2.192</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>200000</td>↵
<td>5</td>↵
<td>8423 ms</td>↵
<td>5100 ms</td>↵
<td>1.652</td>↵
</tr>↵
↵
<tr>↵
<td>750000</td>↵
<td>100000</td>↵
<td>7.5</td>↵
<td>5116 ms</td>↵
<td>2667 ms</td>↵
<td>1.918</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>333333</td>↵
<td>7.5</td>↵
<td>7924 ms</td>↵
<td>4009 ms</td>↵
<td>1.976</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>100000</td>↵
<td>10</td>↵
<td>7425 ms</td>↵
<td>3977 ms</td>↵
<td>1.866</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>40000</td>↵
<td>25</td>↵
<td>9671 ms</td>↵
<td>2355 ms</td>↵
<td>4.107</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>20000</td>↵
<td>50</td>↵
<td>9016 ms</td>↵
<td>1590 ms</td>↵
<td>5.670</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>10000</td>↵
<td>100</td>↵
<td>6879 ms</td>↵
<td>1185 ms</td>↵
<td>5.805</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>5000</td>↵
<td>200</td>↵
<td>5802 ms</td>↵
<td>857 ms</td>↵
<td>6.770</td>↵
</tr>↵
↵
<tr>↵
<td>1000000</td>↵
<td>2500</td>↵
<td>400</td>↵
<td>4897 ms</td>↵
<td>639 ms</td>↵
<td>7.664</td>↵
</tr>↵
↵
</table>↵
↵
What is interesting, when I ran these codes locally even on the test with $n = q$ (e. g. $n = q = 400000$), Hilbert version worked about three times faster.↵
↵
<a name="s4"></a>↵
# Applicability↵
↵
As you can see, such sorting order doesn't make Mo's algorithm work slower, and when $q$ is significantly less than $n$, it works much faster than the classical version. So it is ideal for problems with $n = 10^6$ and $q = 10^4$. For smaller $q$ solutions with naive query processing can pass.↵
↵
Thanks to everyone who read this article! I hope it will be useful for someone.↵
↵
If something is unclear in the post or the post contains bugs, feel free to write in comments.