Fastest bitwise convolutions

Hi everyone!

There are 3 roughly similar problems on Library Checker:

All of them ask us to find $$$c_0, \dots, c_{2^n-1}$$$, where

$$$ c_k = \sum\limits_{i \circ j = k} a_i b_j, $$$

where $$$\circ$$$ stands for bitwise xor, bitwise and, and "disjoint or" correspondingly.

As of now, my submissions are the fastest one for all 3 problems, and in this blog I'd like to explain how it is achieved.

XOR / AND convolutions are fairly simple, which is why my 30 ms submissions are closely followed by others. However, for subset convolution, my submission uses 94 ms with just 34 MiB, while all other submissions (except the one by Qwerty1232 on the second place, whose approach is similar to mine) use at least 300 ms and 180 MiB each.

If you want to check whether your subset convolution is good, I would suggest trying out 1034E - Little C Loves 3 III. Its constraints are very tight in both time and memory specifically to cut off subset convolution, but an optimal implementation would pass (see e.g. 355217827 by me or 355275135 by Qwerty1232).

Prerequisite: Know how the problems above are solved (e.g. see this blog).

Xor, And convolutions

Let's start with bitwise xor / and, as they're much simpler. If you look over some other top solutions, you would often see some manual vectorization with SIMD intrinsics. It is however completely unnecessary, as the whole convolution can be implemented like this:

Code

    enum transform_dir { forward, inverse };
    // Recursive Zeta / Möbius (AND) transform for size N (power of two)
    template<auto N, transform_dir direction>
    void and_transform(auto &&a) {
        if constexpr (N == 1) {
            return;
        } else {
            constexpr auto half = N / 2;
            and_transform<half, direction>(&a[0]);
            and_transform<half, direction>(&a[half]);
            for (uint32_t i = 0; i < half; i++) {
                if constexpr (direction == forward) {
                    a[i] += a[i + half];
                } else {
                    a[i] -= a[i + half];
                }
            }
        }
    }
    // Wrapper that deduces N at compile time via with_bit_floor
    template<transform_dir direction>
    inline void and_transform(auto &&a, auto n) {
        with_bit_floor(n, [&]<auto NN>() {
            assert(NN == n);
            and_transform<NN, direction>(a);
        });
    }
    template<transform_dir direction = forward>
    inline void and_transform(auto &&a) {
        and_transform<direction>(a, std::size(a));
    }
    // In-place AND convolution on sequences of equal length (power of two)
    void and_convolution_inplace(auto &a, auto &b) {
        auto N = static_cast<uint32_t>(std::size(a));
        and_transform(a);
        and_transform(b);
        for (uint32_t i = 0; i < N; i++) {
            a[i] *= b[i];
        }
        and_transform<inverse>(a);
    }

Here, with_bit_floor is a helper function that, given an integer $$$n$$$, returns $$$2^{\lfloor \log_2 n\rfloor}$$$ as a template parameter:

    template<int fl = 0>
    void with_bit_floor(size_t n, auto &&callback) {
        if constexpr (fl >= 63) {
            return;
        } else if (n >> (fl + 1)) {
            with_bit_floor<fl + 1>(n, callback);
        } else {
            callback.template operator()<1ULL << fl>();
        }
    }

You may see that, other than this helper function, the whole code is pretty much just a straightforward implementation of the "AND convolution". So, what makes it faster than everything else? There are several key reasons.

As a general rule, I always try to use autovectorization, rather than using intrinsics manually, because compilers are usually smart. So, if they have sufficient information about what you want them to do at compile time, they are often able to produce assembly that is close to what you'd want it to be if you were good at writing assembly yourself and did it manually.

Because of it, my main idea here was to provide the compiler as much information as possible about my problem, and in this particular case while the value of $$$n$$$ can be quite large, we always have it as a power of $$$2$$$, for which, on practice, there are only, like, 20 values of interest. As the implementation is recursive, we care about efficiency on all levels, so the best way to make sure compiler produces optimal code for all cases is, well, to simply give it the information about which level it currently works with at compile time.

This is basically what we do by putting the power at the compile-known template argument, rather than runtime-known regular function argument. The only thing left for us now is to ensure that the compiler vectorizes our code, for which we need to ensure the following:

Autovectorization is enabled: Use #pragma GCC with optimize("O3") and target("avx2");
Modint type is vectorization friendly: Addition and subtraction in modint should be branchless, or implemented with ternary operator (which autovectorizes well, unlike plain if-else).

XOR convolution then can be implemented in a very similar manner:

Code


    // Recursive FWHT (XOR) transform for size N (power of two)
    template<auto N>
    void xor_transform(auto &&a) {
        if constexpr (N == 1) {
            return;
        } else {
            constexpr auto half = N / 2;
            xor_transform<half>(&a[0]);
            xor_transform<half>(&a[half]);
            for (uint32_t i = 0; i < half; i++) {
                auto x = a[i] + a[i + half];
                auto y = a[i] - a[i + half];
                a[i] = x;
                a[i + half] = y;
            }
        }
    }

    // FWHT wrapper that deduces N at compile time via with_bit_floor
    inline void xor_transform(auto &&a, auto n) {
        with_bit_floor(n, [&]<auto NN>() {
            assert(NN == n);
            xor_transform<NN>(a);
        });
    }

    inline void xor_transform(auto &&a) {
        xor_transform(a, std::size(a));
    }

    // In-place XOR convolution on sequences of equal length (power of two)
    void xor_convolution_inplace(auto &a, auto &b) {
        auto N = static_cast<uint32_t>(std::size(a));
        xor_transform(a);
        xor_transform(b);
        for (uint32_t i = 0; i < N; i++) {
            a[i] *= b[i];
        }
        xor_transform(a);
        using base = std::decay_t<decltype(a[0])>;
        base ni = base(N).inv();
        for (auto &it : a) {
            it *= ni;
        }
    }

Subset convolutions

Okay, xor / and convolutions were a fairly easy warm-up exercise.

For the subset convolution, recall the general idea of subset convolution:

Group inputs by popcount;
Do xor / and transformation in each group independently;
For each mask, multiply its groups as polynomials;
Do inverse xor / and transformation in each group;

Thus, there are several main things we need to consider here:

Optimal layout;
Memory consumption;
Recursive transformation;
20x20 polynomial multiplication of $$$n$$$ polynomials;

Let's tackle them one by one.

Layout

Generally, we should decide, whether we want to store all groups per mask, or to store all masks per group. The second option might seem better in the sense that we may then just call standard OR / AND transformations for each group independently. This is, however, sub-optimal, because storing all groups per mask allows for better utilization of consecutive memory chunks (better for caching + more efficient vectorization).

That said, we will use A[mask][popcount] as our main layout.

Memory consumption

By default, subset convolution requires $$$2^n n$$$ memory, which is quite a lot (~180 MiB for N around $$$2^{20}$$$).

It is, however, possible to reduce it by a factor of $$$2^k$$$ at the cost of additional $$$3^k 2^{n-k}$$$ (with AND transformation) or $$$2^{n+k}$$$ (with XOR transformation) time consumption, an idea that I first saw in Qwerty1232's solutions. It, of course, has another downside that we will have to process rank vectors one by one "online" in increasing order of their mask, rather than having them all at once at our disposal.

The idea here is quite simple: Besides recursive formulations for AND / XOR transformations, we can express them explicitly as sum over supermasks (for AND convolution) or a WHT / weighted sum over all masks (for XOR convolution). As we only have $$$2^n$$$, rather than $$$2^n n$$$ elements in the input and output, we can "shortcut" computations at the top $$$k$$$ layers of the input/output transformations by directly going over corresponding masks and taking contribution only from non-zero inputs, or accumulating it only in outputs of interest.

In this manner, we will split $$$2^n$$$ masks into $$$2^{k}$$$ groups of $$$2^{n-k}$$$ masks each, and process one group at a time, going recursively only inside the group, and working around contributions from/to masks with different top $$$k$$$ bits manually.

While $$$3^k 2^{n-k}$$$ seems better than $$$2^{n+k}$$$, I nevertheless ended up using XOR convolution, for the reason that will be apparent below.

Recursive transformation

While reading up on old discussions on the topic, I found this comment by pajenegod which revealed that using XOR convolution instead of AND convolution allows us to reduce the number of masks we work with in half, because, unlike AND convolution, with XOR convolution, we can drop one bit from the mask, and in the end we will still be able to recover it from the popcount parity!

This is a pretty huge improvement, as it also reduces in half the heavier $$$2^n n^2$$$ time summand, ultimately outweighting the penalty we get from doing $$$2^{n+k}$$$ extra work on top $$$k$$$ bits instead of $$$3^k 2^{n-k}$$$.

Another useful improvements in this part of the algorithm (suggested by Qwerty1232):

On the lower levels, it's better to explicitly implement iterative version of the transformation;
On the top levels, it's better to recurse into 4 parts, rather than 2 parts.

Altogether, this yields the following code for the transformation:

Code

    template<auto N>
    inline void xor_transform(auto &&a) {
        if constexpr (N >> max_logn) {
            throw std::runtime_error("N too large for xor_transform");
        } else if constexpr (N <= 32) {
            for (size_t i = 1; i < N; i *= 2) {
                for (size_t j = 0; j < N; j += 2 * i) {
                    for (size_t k = j; k < j + i; k++) {
                        for (size_t z = 0; z < max_logn; z++) {
                            auto x = a[k][z] + a[k + i][z];
                            auto y = a[k][z] - a[k + i][z];
                            a[k][z] = x;
                            a[k + i][z] = y;
                        }
                    }
                }
            }
        } else {
            auto add = [&](auto &a, auto &b) __attribute__((always_inline)) {
                auto x = a + b, y = a - b;
                a = x, b = y;
            };
            constexpr auto quar = N / 4;

            for (size_t i = 0; i < (size_t)quar; i++) {
                auto x0 = a[i + (size_t)quar * 0];
                auto x1 = a[i + (size_t)quar * 1];
                auto x2 = a[i + (size_t)quar * 2];
                auto x3 = a[i + (size_t)quar * 3];

                #pragma GCC unroll max_logn
                for (size_t z = 0; z < max_logn; z++) {
                    add(x0[z], x2[z]);
                    add(x1[z], x3[z]);
                }
                #pragma GCC unroll max_logn
                for (size_t z = 0; z < max_logn; z++) {
                    add(x0[z], x1[z]);
                    add(x2[z], x3[z]);
                }

                a[i + (size_t)quar * 0] = x0;
                a[i + (size_t)quar * 1] = x1;
                a[i + (size_t)quar * 2] = x2;
                a[i + (size_t)quar * 3] = x3;
            }
            xor_transform<quar>(&a[quar * 0]);
            xor_transform<quar>(&a[quar * 1]);
            xor_transform<quar>(&a[quar * 2]);
            xor_transform<quar>(&a[quar * 3]);
        }
    }
    
    inline void xor_transform(auto &&a, auto n) {
        with_bit_floor(n, [&]<auto NN>() {
            assert(NN == n);
            xor_transform<NN>(a);
        });
    }
    
    inline void xor_transform(auto &&a) {
        xor_transform(a, std::size(a));
    }

20x20 polynomial multiplication

Now, there are several ways to implement basic 20x20 multiplication as well. You may try to do convolution for each mask individually, but it creates overhead due to alignment. You may consider doing something with FFT or Karatsuba, but it doesn't seem better than naive multiplication at these sizes.

What I ultimately ended up doing is, I process $$$K=4$$$ consecutive masks at once, putting their groups into 20 SIMD values, and then use _mm256_mul_epu32 to find their convolution as 64-bit numbers, and use Montgomery reduction to drop them back to being modulo $$$M$$$ in the end. The whole code then looks fairly simple:

Code

        constexpr size_t lgn = max_logn;
        outpa = on_rank_vectors([](auto &a, auto const& b) {
            std::decay_t<decltype(a)> res = {};
            const auto mod = base::mod();
            const auto imod = math::inv2(-mod);
            const auto r4 = u64x4() + uint64_t(-1) % mod + 1;
            auto add = [&](size_t i) {
                for(size_t j = 0; i + j + 1 < lgn; j++) {
                    res[i + j + 1] += (u64x4)_mm256_mul_epu32(__m256i(a[i]), __m256i(b[j]));
                }
                if constexpr (lgn >= 20) if (i == 15) {
                    for(size_t k = 0; k < lgn; k++) {
                        res[k] -= (res[k] >= base::modmod8()) & base::modmod8();
                    }
                }
            };
            for(size_t i = 0; i < lgn; i++) { add(i); }
            for(size_t k = 0; k < lgn; k++) {
                res[k] = montgomery_reduce(res[k], mod, imod);
                res[k] = montgomery_mul(res[k], r4, mod, imod);
                a[k] = res[k] >= mod ? res[k] - mod : res[k];
            }
        }, f, g);

Here, on_rank_vectors is the routine that does most of the heavy lifting, and just feeds SIMD-grouped values for the convolution into the callback. You may notice that I use i+j+1 in the summation, rather than i+j, which is another optimization that drops the case of popcount=0, which only happens in the zero-mask, and processes it externally. While seemingly insignificant, it keeps arrays per mask as multiples of $$$4$$$, which is pretty good for alignment.