Hi everyone!
There are 3 roughly similar problems on Library Checker:
All of them ask us to find $$$c_0, \dots, c_{2^n-1}$$$, where
where $$$\circ$$$ stands for bitwise xor, bitwise and, and "disjoint or" correspondingly.
As of now, my submissions are the fastest one for all 3 problems, and in this blog I'd like to explain how it is achieved.
XOR / AND convolutions are fairly simple, which is why my 30 ms submissions are closely followed by others. However, for subset convolution, my submission uses 94 ms with just 34 MiB, while all other submissions (except the one by Qwerty1232 on the second place, whose approach is similar to mine) use at least 300 ms and 180 MiB each.
If you want to check whether your subset convolution is good, I would suggest trying out 1034E - Little C Loves 3 III. Its constraints are very tight in both time and memory specifically to cut off subset convolution, but an optimal implementation would pass (see e.g. 355217827 by me or 355275135 by Qwerty1232).
Prerequisite: Know how the problems above are solved (e.g. see this blog).
Xor, And convolutions
Let's start with bitwise xor / and, as they're much simpler. If you look over some other top solutions, you would often see some manual vectorization with SIMD intrinsics. It is however completely unnecessary, as the whole convolution can be implemented like this:
Here, with_bit_floor is a helper function that, given an integer $$$n$$$, returns $$$2^{\lfloor \log_2 n\rfloor}$$$ as a template parameter:
template<int fl = 0>
void with_bit_floor(size_t n, auto &&callback) {
if constexpr (fl >= 63) {
return;
} else if (n >> (fl + 1)) {
with_bit_floor<fl + 1>(n, callback);
} else {
callback.template operator()<1ULL << fl>();
}
}
You may see that, other than this helper function, the whole code is pretty much just a straightforward implementation of the "AND convolution". So, what makes it faster than everything else? There are several key reasons.
As a general rule, I always try to use autovectorization, rather than using intrinsics manually, because compilers are usually smart. So, if they have sufficient information about what you want them to do at compile time, they are often able to produce assembly that is close to what you'd want it to be if you were good at writing assembly yourself and did it manually.
Because of it, my main idea here was to provide the compiler as much information as possible about my problem, and in this particular case while the value of $$$n$$$ can be quite large, we always have it as a power of $$$2$$$, for which, on practice, there are only, like, 20 values of interest. As the implementation is recursive, we care about efficiency on all levels, so the best way to make sure compiler produces optimal code for all cases is, well, to simply give it the information about which level it currently works with at compile time.
This is basically what we do by putting the power at the compile-known template argument, rather than runtime-known regular function argument. The only thing left for us now is to ensure that the compiler vectorizes our code, for which we need to ensure the following:
- Autovectorization is enabled: Use
#pragma GCCwithoptimize("O3")andtarget("avx2"); - Modint type is vectorization friendly: Addition and subtraction in modint should be branchless, or implemented with ternary operator (which autovectorizes well, unlike plain if-else).
XOR convolution then can be implemented in a very similar manner:
Subset convolutions
Okay, xor / and convolutions were a fairly easy warm-up exercise.
For the subset convolution, recall the general idea of subset convolution:
- Group inputs by popcount;
- Do xor / and transformation in each group independently;
- For each mask, multiply its groups as polynomials;
- Do inverse xor / and transformation in each group;
Thus, there are several main things we need to consider here:
- Optimal layout;
- Memory consumption;
- Recursive transformation;
- 20x20 polynomial multiplication of $$$n$$$ polynomials;
Let's tackle them one by one.
Layout
Generally, we should decide, whether we want to store all groups per mask, or to store all masks per group. The second option might seem better in the sense that we may then just call standard OR / AND transformations for each group independently. This is, however, sub-optimal, because storing all groups per mask allows for better utilization of consecutive memory chunks (better for caching + more efficient vectorization).
That said, we will use A[mask][popcount] as our main layout.
Memory consumption
By default, subset convolution requires $$$2^n n$$$ memory, which is quite a lot (~180 MiB for N around $$$2^{20}$$$).
It is, however, possible to reduce it by a factor of $$$2^k$$$ at the cost of additional $$$3^k 2^{n-k}$$$ (with AND transformation) or $$$2^{n+k}$$$ (with XOR transformation) time consumption, an idea that I first saw in Qwerty1232's solutions. It, of course, has another downside that we will have to process rank vectors one by one "online" in increasing order of their mask, rather than having them all at once at our disposal.
The idea here is quite simple: Besides recursive formulations for AND / XOR transformations, we can express them explicitly as sum over supermasks (for AND convolution) or a WHT / weighted sum over all masks (for XOR convolution). As we only have $$$2^n$$$, rather than $$$2^n n$$$ elements in the input and output, we can "shortcut" computations at the top $$$k$$$ layers of the input/output transformations by directly going over corresponding masks and taking contribution only from non-zero inputs, or accumulating it only in outputs of interest.
In this manner, we will split $$$2^n$$$ masks into $$$2^{k}$$$ groups of $$$2^{n-k}$$$ masks each, and process one group at a time, going recursively only inside the group, and working around contributions from/to masks with different top $$$k$$$ bits manually.
While $$$3^k 2^{n-k}$$$ seems better than $$$2^{n+k}$$$, I nevertheless ended up using XOR convolution, for the reason that will be apparent below.
Recursive transformation
While reading up on old discussions on the topic, I found this comment by pajenegod which revealed that using XOR convolution instead of AND convolution allows us to reduce the number of masks we work with in half, because, unlike AND convolution, with XOR convolution, we can drop one bit from the mask, and in the end we will still be able to recover it from the popcount parity!
This is a pretty huge improvement, as it also reduces in half the heavier $$$2^n n^2$$$ time summand, ultimately outweighting the penalty we get from doing $$$2^{n+k}$$$ extra work on top $$$k$$$ bits instead of $$$3^k 2^{n-k}$$$.
Another useful improvements in this part of the algorithm (suggested by Qwerty1232):
- On the lower levels, it's better to explicitly implement iterative version of the transformation;
- On the top levels, it's better to recurse into 4 parts, rather than 2 parts.
Altogether, this yields the following code for the transformation:
20x20 polynomial multiplication
Now, there are several ways to implement basic 20x20 multiplication as well. You may try to do convolution for each mask individually, but it creates overhead due to alignment. You may consider doing something with FFT or Karatsuba, but it doesn't seem better than naive multiplication at these sizes.
What I ultimately ended up doing is, I process $$$K=4$$$ consecutive masks at once, putting their groups into 20 SIMD values, and then use _mm256_mul_epu32 to find their convolution as 64-bit numbers, and use Montgomery reduction to drop them back to being modulo $$$M$$$ in the end. The whole code then looks fairly simple:
Here, on_rank_vectors is the routine that does most of the heavy lifting, and just feeds SIMD-grouped values for the convolution into the callback. You may notice that I use i+j+1 in the summation, rather than i+j, which is another optimization that drops the case of popcount=0, which only happens in the zero-mask, and processes it externally. While seemingly insignificant, it keeps arrays per mask as multiples of $$$4$$$, which is pretty good for alignment.
Other problems
I already mentioned 1034E - Little C Loves 3 III where this can be useful. Here are some others:
- Exp of Set Power Series
- Polynomial Composite Set Power Series (see blog by Elegia)
- Power Projection of Set Power Series (Transpose of the algorithm above)
- Chromatic Polynomial
Each of them uses subset convolution as black box, and my solution is top-1 in all of them too, both for memory and time consumption. For exponent, and similar algorithms, it is also possible to implement them directly on rank vectors (using differential techniques typically), but this is typically more complicated than just relying on the already efficient implementation of subset convolution.








Auto comment: topic has been updated by adamant (previous revision, new revision, compare).