Standard approach
One of standard ways to implement constant-time LCA queries on a tree is to preprocess it by doing an Eulerian tour which creates an array of pairs (depth, index) for subsequent vertices visited on the tour, and then reduce an LCA query to RMQ query on a certain fragment of this array:
const int N = 1<<LOGN;
typedef pair<int,int> pii;
int n, T;
vector<int> adj[N]; // adjacency lists
pii euler[2*N]; // Eulerian tour of pairs (depth, index)
int tin[N]; // visit times for vertices
void dfs(int v, int par, int depth) {
tin[v] = T;
euler[T++] = make_pair(depth, v);
FORE(i,adj[v]) if (*i != par) {
dfs(*i, v, depth+1);
euler[T++] = make_pair(depth, v);
}
}
dfs(0, -1, 0);
int lca(int u, int v) {
return query_rmq(min(tin[u], tin[v]), max(tin[u], tin[v]) + 1).second;
}
The RMQ problem can be solved with a sparse table, which after $$$O(n \log n)$$$ preprocessing allows to answer RMQ queries in constant time:
pii sparse[LOGN][N];
int log[2*N]; // binary logarithm
void init_rmq() {
REP(i,T) { sparse[0][i] = euler[i]; }
int logT = 0;
while ((1<<logT) < T) { ++logT; }
REP(f,logT) REP(i,T - (1<<f)) {
sparse[f+1][i] = min(sparse[f][i], sparse[f][i + (1<<f)]);
}
log[0] = -1;
FOR(i,1,T) { log[i] = 1 + log[i>>1]; }
}
pii query_rmq(int a, int b) {
// query range [a, b)
int f = log[b-a];
return min(sparse[f][a], sparse[f][b - (1<<f)]);
}
The drawback of the solution above is long preprocessing. We can reduce it to $$$O(n)$$$ by using clever trick: we split the array into blocks of length $$$m$$$ and produce sparse table for array of size $$$O(n/m)$$$ created by taking minimum from each block. Then every query range either is an infix of some block or consists of several consecutive full blocks (which are handled by the new sparse table in constant time) and some prefix and some suffix of a block (which are also infixes).
Queries for infixes of blocks are preprocessed brutally: since the RMQ problem obtained from LCA has a property that differences between consecutive depths in the array are either +1 or -1, we can represent each block as the first element plus a binary string of length $$$m-1$$$ encoding the changes in depths. Thus there are $$$2^{m-1}$$$ different bitmasks and for each of them we have $$$O(m^2)$$$ infixes. For each infix we want to calculate the minimum value relative to the first element. It's easy to show how to preprocess them in $$$O(2^m m^2)$$$ time and memory, and if we take $$$m = \frac{\log n}2$$$, this is actually $$$O(n)$$$.
Conceptually this idea is quite easy, but in terms of implementation it could be rather lengthy (see e-maxx implementation). Now I will present my alternative approach to this problem, which results in a slightly shorter code.
Alternative approach
In the standard approach the details of encoding a block into binary string of length $$$m-1$$$ were not important, as long as we encode them consistently. Now we will fix a certain encoding, namely such that $$$m-1$$$ differences in depth are encoded as consecutive bits (the first difference from the left being the least significant byte), where +1 change is encoded as bit 0 and -1 change as bit 1. Suppose that we want to calculate the relative minimum value for a block described by some bitmask $$$mask$$$. If the LSB of $$$mask$$$ is 1 that means that we have -1 change and after that a mask $$$mask' = \lfloor mask/2 \rfloor$$$ of size $$$m-2$$$ shifted one level down. That means that minimal position for $$$mask$$$ is minimal position for $$$mask'$$$ plus 1. On the other hand if LSB of the mask is 0, the minimum is either on position 0 or in $$$mask'$$$ shifted one level up.
It looks, however, that we still need to calculate values for shorter masks (with $$$m-2$$$ bits and fewer). But observe that extending a mask, by adding bits 0 after MSB, creates a longer mask with the exactly the same value (since bits 0 correspond to series of changes +1 at the end of the block which don't affect the minimum). Thus we can just treat $$$mask'$$$ as mask of size $$$m-1$$$. Therefore we only need to preprocess $$$O(2^m)$$$ bitmasks, and we can do it in constant time per bitmask, thus we will be able to take $$$m = \log n$$$ to get $$$O(n)$$$ complexity of preprocessing.
const int NBYLOGN = N/LOGN;
pii sparse[LOGN][NBYLOGN];
int blmask[NBYLOGN]; // masks for blocks
pii lookup[1<<LOGN]; // minima for all masks as pairs (relative depth, relative index)
void init_rmq() {
int Tm = (T+m-1) / m;
REP(i,Tm) {
sparse[0][i] = *min_element(euler+i*m, euler+(i+1)*m);
blmask[i] = 0;
REP(j,m-1) {
blmask[i] |= (euler[i*m+j+1] < euler[i*m+j]) << j;
}
}
int logT = 0;
while ((1<<logT) < Tm) ++logT;
REP(f,logT) REP(i,Tm - (1<<f)) {
sparse[f+1][i] = min(sparse[f][i], sparse[f][i + (1<<f)]);
}
log[0] = -1;
FOR(i,1,Tm) { log[i] = 1 + log[i>>1]; }
lookup[0] = make_pair(0,0);
FOR(mask,1,1<<m-1) {
pii F = lookup[mask>>1];
lookup[mask] = min(make_pair(0,0), make_pair(F.first + (mask&1 ? -1 : 1), F.second + 1));
}
}
If we want to make a prefix query of size $$$k$$$ for a block, we need to take the submask having the first $$$k-1$$$ bits, but as we said before we could just zero the remaining bits. On the other hand if we want to take a suffix, we can just shift the whole mask to the right. Therefore the following code calculates value for a non-empty range $$$[a,b)$$$ contained in one block:
pii query_rmq_block(int a, int b) {
int mask = (blmask[a/m] >> (a % m)) & ((1 << b-a-1) - 1);
return euler[lookup[mask].second + a];
}
Finally, we have code for a general query, which takes care of two cases: either the query is inside one block or it spans multiple blocks (and then it separately calculates two border blocks leaving the rest for the sparse table):
pii query_rmq(int a, int b) {
int A = a / m, B = (b-1) / m;
if (A == B) {
return query_rmq_block(a, b);
} else {
pii F = min(query_rmq_block(a, (A+1)*m), query_rmq_block(B*m, b));
if (A+1 != B) {
int f = log[B - (A+1)];
F = min(F, min(sparse[f][A+1], sparse[f][B - (1<<f)]));
}
return F;
}
}
As we said before, taking $$$m = \log n$$$ will give us theoretical $$$O(n)$$$ complexity. But for practical approaches it could be useful to make some empirical tests which value would be the best. For instance, we could try setting $$$m$$$ to be a power of two (e.g. 16 when $$$n \approx 10^6$$$), so we can replace costly divisions in queries with shifts.