Convolution, In the Perspective of 3-Dimensional Tensors

Правка en1, от LorentzianExpanders, 2024-12-31 13:59:52

Hello everyone!

Convolution is perhaps the first class of algorithms that one encounters in competitive programming requiring algebraic knowledge. There are many types of convolutions that we are interested in, including:

  • Sequential convolution (i.e., polynomial multiplication),
  • XOR convolution,
  • AND/OR convolution,
  • GCD/LCM convolution,
  • Subset convolution, and others.

One of the most well-known ways to understand fast algorithms for these convolutions is through their associated algebraic structures. These algebraic structures can often be decomposed using the Chinese Remainder Theorem (CRT) into simpler components, i.e., direct products. For instance, in the case of XOR convolution, the algebraic structure arises from the computation rules of the group algebra:

$$$ R[\mathbb Z_2^n]\colon \left(\sum_{g\in \mathbb Z_2^n} a_g g\right) \left(\sum_{g\in \mathbb Z_2^n} b_g g\right) = \left(\sum_{g,h\in \mathbb Z_2^n} a_g b_h (g+h) \right), $$$

and we have the following isomorphism of algebras:

$$$ \begin{align*} R[\mathbb Z_2^n] &\cong R[X_1,\dots,X_n]/(X_1^2-1, \dots, X_n^2 - 1) \\ &= \prod_{e\in \\\{\pm 1\\\}^n } R[X_1,\dots,X_n]/(X_1-e_1,\dots,X_n-e_n)\\ &\cong R^{2^n}, \end{align*} $$$

assuming 2 is invertible in $$$R$$$.

Such isomorphism induced by CRT is essentially the associated "transform" of the convolution. For sequential convolution, the corresponding transform is the Fourier transform, and for XOR convolution, the corresponding transform is the Walsh-Hadamard transform, etc. The isomorphism induced by CRT essentially defines the "transform" associated with the convolution. For example:

  • For sequential convolution, the corresponding transform is the Fourier transform.
  • For XOR convolution, the corresponding transform is the Walsh-Hadamard transform.

All the convolutions listed earlier can be understood in this framework, although subset convolution is somewhat different. However, not all convolutions fit neatly into this algebraic perspective. For example, consider the following problem from adamant:

Mex Convolution. Compute

$$$ c_k = \sum_{i\circ j = k} a_i b_j, $$$

where $$$i,j,k\in \{0,1,2\}^n$$$ and $$$k = i\circ j$$$ is taking $$$k_\ell = \operatorname{mex} \{ i_\ell, j_\ell \}$$$ for all $$$\ell$$$.

One can verify that this convolution does not form an associative algebra, so the rich theory of algebras does not apply here. The intended solution for Mex convolution has a complexity of $$$N^{\log_3 4} \approx N^{1.262}$$$.

In this blog, I will introduce a new perspective for understanding convolution: 3-dimensional tensors, and discuss some insights provided by this perspective.

Bilinear Algorithms, Its Rotations, and Tensors

Now we suppose we are working on a field $$$F$$$. (Most algorithms conceptually work on arbitrary rings, but some characterization theorems require the field assumption.)

The first thing I want to define is called a bilinear problem. Given a 3-dimensional array $$$T_{ijk}$$$, the bilinear problem takes arrays $$$X_i$$$ and $$$Y_j$$$ and computes the array $$$Z_k$$$ such that

$$$ Z_k = \sum_{i,j} X_i Y_j T_{ijk}. $$$

Clearly, this class of problems includes all the convolutions mentioned above. The convolution is a bilinear problem with $$$T_{ijk} = [i \circ j = k]$$$.

Given one bilinear problem, the only information defining the problem is the 3-dimensional array $$$T_{ijk}$$$. Then here naturally come two rotations of the problem:

  • The first rotation is the problem of computing $$$X_i$$$ given $$$Y_j$$$ and $$$Z_k$$$, i.e.,
$$$ X_i = \sum_{j,k} Y_j Z_k T_{ijk}. $$$
  • The second rotation is the problem of computing $$$Y_j$$$ given $$$X_i$$$ and $$$Z_k$$$, i.e.,
$$$ Y_j = \sum_{i,k} X_i Z_k T_{ijk}. $$$

We remark that there is no relation between the arrays $$$X_i$$$, $$$Y_j$$$, and $$$Z_k$$$ between the original problem and the two rotations. Those familiar with the transposition principle may be accustomed to this kind of rotation.

Finally, we define the trilinear problem as the problem of computing

$$$ \sum_{i,j,k} X_i Y_j Z_k T_{ijk}. $$$

We usually call such a trilinear form $$$T(X,Y,Z)$$$ the tensor.

Therefore, the convolution problem can be written as tensors

$$$ T = \sum_{i,j} X_i Y_j Z_{i \circ j}. $$$

For XOR convolution and subset convolution, it's also convenient to write them in a symmetric form: For XOR convolution, we have

$$$ T_{\mathbb Z_2^n} = \sum_{\substack{i,j,k\in\mathbb Z_2^n \\ i+j+k=0}} X_i Y_j Z_k. $$$

For subset convolution, we have

$$$ T = \sum_{A \sqcup B \sqcup C = [n]} X_A Y_B Z_C. $$$

The important observation is that the bilinear problem, its rotations, and the trilinear problem are all equivalent.

Claim. Given a tensor $$$T$$$, its associated bilinear problem, its two rotations, and the trilinear problem can be computed in the same time complexity.

One direction is straightforward: Suppose we have an algorithm computing the bilinear problem (or its rotations), then we can compute the trilinear problem by running the algorithm to compute

$$$ z_k = \sum_{i,j} X_i Y_j T_{ijk}, $$$

and then compute the trilinear form by inner product:

$$$ T(X,Y,Z) = \sum_k z_k Z_k. $$$

The other direction is more interesting. We need the following result:

Baur-Strassen's Algorithm. Let $$$f(X_1, \dots, X_n)$$$ be a polynomial that can be computed in $$$M$$$ operations. Then we can compute all the partial derivatives

$$$ \frac{\partial}{\partial X_i} f(X_1, \dots, X_n) $$$

in $$$O(M)$$$ operations.

Those familiar with the transposition principle can see that Baur-Strassen's algorithm is a vast generalization. In fact, the algorithm itself is also a generalization of the transposition principle. The key idea is to first write down a sequence of arithmetic operations $$$O_1, \dots, O_M$$$ that ends with the desired polynomial. Considering only the last $$$k$$$ operations, this is also a polynomial where the variables are some intermediate variables $$$g_k = g_k(Y_1, \dots, Y_r)$$$. We could consider computing the partial derivatives of $$$g_k$$$ with $$$k$$$ growing from $$$1$$$ to $$$M$$$, thus gradually computing the partial derivatives of $$$f$$$.

Now here are two possibilities:

  • If $$$O_{M-k}$$$ is an addition operator, without loss of generality, assume $$$Y_1 = Y'_1 + Y'_2$$$. Then we have $$$g_{k+1}(Y'_1, Y'_2, Y_2, \dots, Y_r) = g_k(Y'_1 + Y'_2, Y_2, \dots, Y_r)$$$. Thus,
$$$ \frac{\partial}{\partial Y'_i} g_{k+1} = \frac{\partial}{\partial Y_1} g_k, $$$

and for $$$i > 1$$$,

$$$ \frac{\partial}{\partial Y_i} g_{k+1} = \frac{\partial}{\partial Y_i} g_k. $$$
  • If $$$O_{M-k}$$$ is a multiplication operator, without loss of generality, assume $$$Y_1 = Y'_1 Y'_2$$$. Then we have $$$g_{k+1}(Y'_1, Y'_2, Y_2, \dots, Y_r) = g_k(Y'_1 Y'_2, Y_2, \dots, Y_r)$$$. Thus,
$$$ \frac{\partial}{\partial Y'_i} g_{k+1} = Y'_{3-i} \frac{\partial}{\partial Y_1} g_k, $$$

and for $$$i > 1$$$,

$$$ \frac{\partial}{\partial Y_i} g_{k+1} = \frac{\partial}{\partial Y_i} g_k. $$$

So, passing from $$$g_1$$$ to $$$g_M$$$, we can compute all the partial derivatives of $$$f$$$, with each step requiring only $$$O(1)$$$ operations. Thus, the total complexity is $$$O(M)$$$.

Having Baur-Strassen's algorithm, we can compute the bilinear problem by computing the partial derivatives of the trilinear form:

$$$ \frac{\partial}{\partial Z_k} T(X,Y,Z) = \sum_{i,j} X_i Y_j T_{ijk}. $$$

This also gives an alternative explanation of the algorithm of those "transposed convolutions" we usually encounter in practice.

Tensor Rank Decomposition

Let us now revisit the perspective of bilinear algorithms computing $$$Z_k = \sum_{i,j} X_i Y_j T_{ijk}$$$. We introduce the following assumption about the algorithm:

Assumption. For each operation $$$O_i$$$, let $$$g_i$$$ denote the polynomial computing the intermediate variable. Then $$$g_i$$$ is either a constant, a linear form in $$$X$$$, a linear form in $$$Y$$$, or a bilinear form in $$$X$$$ and $$$Y$$$.

This assumption is not restrictive because any algorithm can be rewritten to satisfy it by splitting each operation into its homogeneous components. This process, called homogenization, simplifies the algorithm's structure, though we omit the technical details here.

Under this assumption, the algorithm naturally divides into three phases:

  1. The first phase computes a linear transformation $$$R_X = A X$$$.
  2. The second phase computes a linear transformation $$$R_Y = B Y$$$.
  3. In the third phase, with $$$(R_Z)_i = (R_X)_i (R_Y)_i$$$, the algorithm computes a linear transformation $$$Z = C^{\mathsf T} R_Z$$$.

Rewriting the algorithm in terms of the trilinear form, the final step involves taking the inner product of $$$C^{\mathsf T} R_Z$$$ and $$$Z$$$:

$$$ T(X,Y,Z) = (C Z)^{\mathsf T} R_Z = \sum_{i=1}^r (AX)_i (BY)_i (CZ)_i. $$$

In other words, the tensor $$$T$$$ can be expressed as a sum of $$$r$$$ products of linear forms in $$$X$$$, $$$Y$$$, and $$$Z$$$:

$$$ T(X,Y,Z) = \sum_{i=1}^r u_i(X) v_i(Y) w_i(Z). $$$

This representation is known as the rank decomposition of the tensor $$$T$$$. The smallest $$$r$$$ for which such a decomposition exists is called the tensor rank of $$$T$$$, denoted $$$\mathbf{R}(T)$$$.

The concept of tensor rank extends to $$$d$$$-dimensional tensors, where the case $$$d=2$$$ corresponds to matrix rank. However, for $$$d=3$$$, many properties of matrix rank no longer hold, and determining the tensor rank becomes an NP-hard problem.

Tensor rank serves as a relaxed measure of the complexity of bilinear problems. While it provides a lower bound on complexity—since any algorithm must use at least this many multiplications—it ignores the costs of the linear transformations $$$A$$$, $$$B$$$, and $$$C$$$. Thus, for general bilinear problems, tensor rank is not always an accurate measure of complexity.

However, for many convolution problems discussed earlier, this limitation is not significant because these problems exhibit a tensor power structure. Specifically, let $$$T$$$ and $$$T'$$$ be two tensors. The tensor product $$$T \otimes T'$$$ is defined as:

$$$ (T \otimes T')_{ii', jj', kk'} = T_{i,j,k} T'_{i',j',k'}. $$$

The tensor product of $$$n$$$ copies of $$$T$$$ is denoted $$$T^{\otimes n}$$$.

A key observation is that, for XOR/AND/OR convolutions and subset convolution, the tensor $$$T_n$$$ can be written as $$$T_1^{\otimes n}$$$ for some base tensor $$$T_1$$$. For instance:

$$$ T_{\mathbb{Z}_2^n} = T_{\mathbb{Z}_2}^{\otimes n}. $$$

In such cases, tensor rank becomes a meaningful complexity measure because the linear transformations $$$A$$$, $$$B$$$, and $$$C$$$ in $$$T_n$$$ derive from those in $$$T_1$$$. By the mixed-product property of the tensor product, the computation of $$$A^{\otimes n}$$$ can be decomposed into:

$$$ A^{\otimes n} = (A \otimes I \cdots I) \cdot (I \otimes A \otimes I \cdots I) \cdots (I \cdots I \otimes A). $$$

If $$$A$$$ is an $$$r \times a$$$ matrix with $$$r = a$$$, the time complexity is $$$O(n \cdot a^n)$$$; if $$$r > a$$$, the complexity is $$$O(r^n)$$$. This method is often referred to as Yates' algorithm.

For example, the XOR convolution has a tensor rank decomposition:

$$$ T_{\mathbb{Z}_2} = \frac{1}{2} \left[ (X_0 + X_1)(Y_0 + Y_1)(Z_0 + Z_1) + (X_0 - X_1)(Y_0 - Y_1)(Z_0 - Z_1) \right], $$$

where the linear transformation $$$A$$$ is:

$$$ A = \begin{pmatrix} 1 & 1 \\ 1 & -1 \end{pmatrix}. $$$

Hence, the linear transformation $$$A^{\otimes n}$$$ corresponds to the Walsh-Hadamard transform.

Border Rank

The concept of tensor rank successfully explains the fast transforms associated with XOR/AND/OR convolution. However, a subtle issue arises with subset convolution. The tensor rank of the subset convolution tensor $$$T_1$$$ is not $$$2$$$ but $$$3$$$. This issue can be resolved using the concept of border rank.

Consider the subset convolution tensor $$$T_1$$$:

$$$ T_1 = X_0Y_0Z_1 + X_0Y_1Z_0 + X_1Y_0Z_0. $$$

Although its tensor rank is $$$3$$$, we can perturb it slightly to approximate a tensor of rank $$$2$$$:

$$$ \frac{1}{\epsilon} \left[ (X_0 + \epsilon X_1)(Y_0 + \epsilon Y_1)(Z_0 + \epsilon Z_1) - X_0Y_0Z_0 \right] = T_1 + O(\epsilon). $$$

This might seem nonsensical from a computational perspective, but it can be converted into an exact algorithm. The key idea is to treat $$$\epsilon$$$ as a formal variable. Rewriting the equation gives:

$$$ T_1' = (X_0 + \epsilon X_1)(Y_0 + \epsilon Y_1)(Z_0 + \epsilon Z_1) - X_0Y_0Z_0 = \epsilon T_1 + O(\epsilon^2). $$$

Here, the terms in the decomposition are polynomials in $$$\epsilon$$$. The tensor power of $$$T_1'$$$ then becomes:

$$$ T_1'^{\otimes n} = \epsilon^n T_1^{\otimes n} + O(\epsilon^{n+1}). $$$

Thus, we can compute the tensor power of $$$T_1$$$ exactly by maintaining the formal variable $$$\epsilon$$$ to a precision of $$$\epsilon^n$$$.

This method leads to the standard $$$O(n^2 2^n)$$$ algorithm for subset convolution, which is essentially equivalent with maintaining the so-called rank vector.

Formally, the border rank $$$\underline{\mathbf{R}}(T)$$$ is defined as the smallest $$$r$$$ such that a tensor can be expressed as:

$$$ \sum_{i=1}^r u_i(X) v_i(Y) w_i(Z) = \epsilon^d T + O(\epsilon^{d+1}). $$$

The above algorithm demonstrates the use of border rank, as it enables an algorithm with a time complexity of $$$\tilde{O}(\underline{\mathbf{R}}(T)^n)$$$.

An intriguing result by Alder (1981) states that if $$$F$$$ is an algebraically closed field, and $$$R_{\leq r}$$$ denotes the set of tensors with rank at most $$$r$$$, then $$$\underline{\mathbf{R}}(T) \leq r$$$ if and only if $$$T$$$ lies in the Zariski closure of $$$R_{\leq r}$$$. This result provides a geometric interpretation of border rank, justifying that $$$\epsilon$$$ is really an "infinitesimal" in the algebraic sense.

More Topics

In this blog, tensor rank and border rank appear as alternative perspectives to describe well-known algorithms. However, perhaps the tensor that has garnered the most attention in the study of tensor rank is the matrix multiplication tensor. It is commonly written as:

$$$ \langle n, m, p \rangle = \sum_{i \in [n], j \in [m], k \in [p]} X_{ij} Y_{jk} Z_{ki}. $$$

The famous Strassen's algorithm can be viewed as finding a decomposition with $$$\mathbf{R}(\langle 2, 2, 2 \rangle) \leq 7$$$.

Interestingly, to compute the tensor power $$$T^{\otimes n}$$$, one does not necessarily need to start with the tensor rank decomposition of $$$T$$$. Instead, one can start with a fixed tensor power $$$T^{\otimes k}$$$ and compute $$$(T^{\otimes k})^{\otimes (n/k)}$$$ using the same method. This leads to the notion of asymptotic tensor rank:

$$$ \widetilde{\mathbf{R}}(T) = \lim_{k \to \infty} \mathbf{R}(T^{\otimes k})^{1/k}. $$$

Thus, the central question in algebraic complexity can be rephrased as determining $$$\widetilde{\mathbf{R}}(\langle 2, 2, 2 \rangle) = 2^\omega$$$, where $$$\omega$$$ is the exponent of matrix multiplication.

This is an area full of open questions, but there are still some interesting results:

For any $$$n \times n \times n$$$ tensor $$$T$$$, a trivial decomposition yields $$$\mathbf{R}(T) \leq n^2$$$. However, Strassen's asymptotic submultiplicativity shows that $$$\widetilde{\mathbf{R}}(T) \leq n^{2\omega / 3}$$$. This means that matrix multiplication is universal in the sense that it can speed up the computation of any tensor power. The current world record is $$$\omega \leq 2.371339$$$ (arXiv:2404.16349). For more on this topic, see the great lecture notes by Virginia Vassilevska Williams (starting from Lecture 19), which discuss key ideas in this line of research.

Now fix a finite group $$$G$$$, and let $$$T_G$$$ be the tensor of the group algebra of $$$G$$$ with coefficients in $$$\mathbb C$$$. When $$$G$$$ is abelian, we know that $$$\widetilde{\mathbf{R}}(T_G) = |G|$$$. What about the non-abelian case? This can be fully determined by $$$\omega$$$. Suppose the irreducible representations of $$$G$$$ have dimensions $$$d_1, \dots, d_k$$$. Then:

$$$ \widetilde{\mathbf{R}}(T_G) = \sum_i d_i^\omega. $$$

The inequality $$$\leq$$$ is relatively straightforward to observe, while the reverse direction follows from a deeper result known as Schönhage's asymptotic sum inequality. However, even the trivial bound $$$\omega \leq 3$$$ already leads to non-trivial algorithms in some sense.

История

 
 
 
 
Правки
 
 
  Rev. Язык Кто Когда Δ Комментарий
en1 Английский LorentzianExpanders 2024-12-31 13:59:52 16197 Initial revision (published)