Hi everyone!

Some time ago I started developing a linear algebra library for Library Checker, and it turned out to be much fun. I'd like to write this blog to outline how one can implement a linalg library with a very good Library Checker performance, while also avoiding highly technical low-level details that are common in other similar blogs. Instead, we will learn a few high level ideas that will let us write a code that the compiler will then optimize using all these fancy techniques *in our stead*.

Huge thanks to ToxicPie9 and magnus.hegdahl for explaining me some of the optimizations used here! You may also look into sslotin's entry about how it can be sped up even further with more advanced and low-level techniques.

### Tl'dr.

In this blog, I present my matrix library, which maintains a simple and elegant implementation of basic matrix operations without advanced techniques, while also having a remarkable performance on most Library Checker linear algebraic tasks:

- $$$1024\times 1024$$$ matrix product: 400ms, roughly top-15% of Library Checker submissions.
- $$$200\times 200$$$ matrix pow: 27ms with Frobenius form,
**top-1**performance. 200ms with binary exponentiation, top-10 performance. - $$$500 \times 500$$$ determinant: 29ms, top-2 performance.
- $$$nm \leq 2.5 \cdot 10^5$$$ rank of matrix: 38ms,
**top-1**performance. - $$$500 \times 500$$$ system of linear equations: 39ms,
**top-1**performance. - $$$500 \times 500$$$ inverse matrix: 84ms, top-2 performance (current top-1 is the same code + fast IO).
- $$$500 \times 500$$$ characteristic polynomial: 93ms,
**top-1**performance.

Unlike most implementations, I do not manually optimize each of these tasks separately, but rather use a uniform optimized approach to implementing them via simple array operations, and only the later ones are somewhat optimized (with a very simple high-level approach). This also allows to avoid a significant code bloat into unreadable mess, which is common for regular Library Checker's superfast solutions.

**Note**: All the implementations above are focused on integer arithmetics module a prime number known at compilation time. Results of operations when the underlying type does not form a field (e.g. pure integers, composite mod, etc) may be unexpected, and will most likely be so if any division is expected.

While the library theoretically also supports floating point underlying data types (as in, compiles when they're plugged in), this functional is completely experimental, untested and may easily behave in an unexpected manner.

### Vectorization

What would you consider to be the basic building block of matrix operations? Think about all the algorithms that you use. Finding inverse matrix, solving a system of linear equations, multiplying two matrices, adding two matrices. What do all they have in common?

Try to think as broadly as you possibly can, so that we can factor this part out, optimize it, and then rely on its efficient implementation in all these algorithms. Ideally, an operation that we *already know* how to implement efficiently and take advantage of when we work in vector spaces over $$$\mathbb F_2$$$ (yes, I'm talking about the xor basis via 64-bit integers or bitsets).

You may think about arithmetic operations on individual numbers here, but if you compare this idea to what we do in xor basis construction, you'll understand that we actually do our best to *avoid* working with individual remainders modulo $$$2$$$, and instead try to batch the operations, so that they're done "in parallel". So, the answer to my initial question would be the following operation:

Given a sequence $$$a_1,\dots,a_m$$$, a sequence $$$b_1,\dots,b_m$$$ and a constant $$$t$$$, find the sequence $$$a_1 + tb_1, \dots, a_m + tb_m$$$.

On one hand, you are likely to encounter such operation in whatever linear algebraic operation you do. On the other hand, modern compilers are very good at auto-vectorizing such operation via SIMD, if you write them as a simple for-loop. This vectorization, in its nature, is quite similar to how bitsets let us "parallelize" bit operations (although the speed up is typically much lower, around x4, not x64).

In other words, `#pragma GCC target("avx2,tune=native")`

tends to be *really* useful in linear algebraic operations, as long as you implement them in a way that is compatible with whatever compiler is able to automatically optimize.

In the future, we will refer to this operation as **Fused Multiply-Add** (FMA).

So, for now we should accept the paradigm, that whenever we do linear algebraic operations, we should perceive them as operations on *vectors* (represented by contiguous arrays), rather than individual numbers. In computer science, you would likely call such approach an array programming, as it forces us to focus on arrays of data as a whole, rather than on individual elements.

There is of course still a place in linear algebra for operations that deal with coefficients directly, especially when we're talking about efficiently implementing FMA itself or about some operations of much lower computational complexity compared to the "main" operation we're doing. But even then, direct number manipulation tend to make the code less elegant and more prone to slowdowns, thus it should be avoided as much as possible.

### Matrix multiplication

To put up an example, let's start with the simplest of operations: Matrix multiplication. If you have a matrix $$$A \in \mathbb R^{n\times t}$$$ and $$$B \in \mathbb R^{t \times m}$$$, then their product is the matrix $$$C = AB \in \mathbb R^{n\times m}$$$, such that $$$C_{ij} = \sum_k A_{ik} B_{kj}$$$. With this definition, you might be tempted to implement matrix multiplication as follows:

```
for(size_t i = 0; i < n; i++) {
for(size_t j = 0; j < m; j++) {
for(size_t k = 0; k < t; k++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
```

Surely enough, this is a terrible implementation. It doesn't vectorize, and it also has a lot of cache misses. One common advice to improve its performance is to transpose matrix $$$B$$$, so that your main loop operation turns into `C[i][j] += A[i][k] * B[j][k]`

instead.

And this, in fact, is a bad advice too! While it reduces cache misses, it still isn't good enough for auto-vectorization. The reason for this is that while we do iterate `A[i]`

and `B[j]`

as vectors, their product is stored in the same variable `C[i][j]`

, which prevents parallelesation. So, how do we mitigate this? We rewrite matrix product as follows:

```
for(size_t i = 0; i < n; i++) {
for(size_t k = 0; k < t; k++) {
for(size_t j = 0; j < m; j++) {
C[i][j] += A[i][k] * B[k][j];
}
}
}
```

This doesn't seem like a significant change at first. We just swapped the order of two loops, what effect could it have? But look, there is nothing prone to cache misses now, as the inner-most variable $$$j$$$ is only used to access matrices $$$C$$$ and $$$B$$$ by the second index. But even more importantly, the inner-most operation itself is the aforementioned fused multiply-add operation! Indeed, we treat `C[i]`

as a vector, to which we add a vector `B[k]`

multiplied by the constant `A[i][k]`

.

In other words, it would make more sense if we implement it roughly as follows, treating `C[i]`

and `B[k]`

as *vectors*:

```
for(size_t i = 0; i < n; i++) {
for(size_t k = 0; k < t; k++) {
C[i] += A[i][k] * B[k];
}
}
```

Note how this approach also simplifies the implementation notationally if you implement separate wrappers for vector operations!

Let's see how the two implementations mentioned above compare on **Library Checker — Matrix Product**:

- 3897ms — Naive implementation
- 1855ms — Vectorized implementation (without avx)
- 1394ms — Vectorized implementation (with avx)

Note the speedup from first to second implementation is just from rearranging loops in a more cache friendly manner, and the actual vectorization speedup only kicks in with the second implementation. Another important thing to mention is that in these submissions, we relied on `__int128`

, rather than utilizing the fact that the problem asks us to print the remainder modulo $$$998\;244\;353$$$, which allows us to optimize even further. If we *do* utilize some very simple modular arithmetics tricks, we can improve even further:

- 835ms — Optimized implementation (without avx)
- 377ms — Optimized implementation (with avx)

From 3897ms to 377ms... Almost 10x improvement. Quite a bit, isn't it? If you look in the rankings, the solution above lands us on the second page of fastest submissions, meaning that it is roughly in top-10% of submissions, **and we didn't use anything fancy to get here**!

### Considerations for modular arithmetics

In the scenaries above, we assumed that we work with integers, for which there are SIMD instructions that could be utilised by the compiler for vectorization. Surely enough, there don't seem to be such instructions when remainders are involved. So... What do we do about it?

There are several options that we have:

- Use
`__int128`

. A pretty bad option because operations with them are quite expensive. Moreover, it only works for computing matrix product, and if instead we would like to e.g. find the determinant, or find the inverse matrix,`__int128`

won't save us. - Use
`uint64_t`

for remainders and do`r = min(r, r - mod*mod)`

in the FMA body. This is a better idea because AVX-512 has min_epu64 operation, and because in the library implementation, we can simply use it as a part of modular arithmetic class. - Use
`uint64_t`

for remainders and do`r = min(r, r - 8*mod*mod)`

every 8 FMA operations. This is the best option in terms of performance, esp. if AVX-512 is not supported, but it can't be easily implement as a part of a separate modular arithmetics class.

We will focus on the third option. It looks roughly as follows:

```
for(int i = 0; i < n; i++) {
for(int k = 0; k < t; k++) {
for(int j = 0; j < m; j++) {
C[i][j] += A[i][k] * B[k][j];
}
if(k % 8 == 0) {
for(int j = 0; j < m; j++) {
C[i][j] = min(C[i][j], C[i][j] - modmod8);
}
}
}
}
```

To better explain the problem with it, when you write a library, you generally want to encapsulate things a lot, and then you'd likely want to keep the implementation of actual linear algebraic operations independent from the implementation of the underlying storage type (e.g. floating-point numbers or a modular arythmetics type). And while the third option provides the best performance, you can't reasonably book-keep the number of prior FMA operations within modular arithmetics type. The solution that I ultimately decided to follow is to adhere to the following model:

- modint<mod> implements only basic modular arithmetics using
`uint64_t`

as a storage for remainders, but also provides functions to do`add_unsafe`

(without checking whether the result is below`mod`

),`normalize`

(take remainder modulo`mod`

) or`pseudonormalize`

(take remainder modulo`8*mod*mod`

). - vec<base> serves as a wrapper to
`std::valarray<base>`

, but also provides its own`normalize`

function to normalize all its elements and`add_scaled`

to do FMA, but also maintains a counter of how many FMA operations were applied to it, and doing pseudo-normalize on the elements if the counter hits $$$8$$$. - matrix<base> avoids working with
`base`

directly, and refers to`vec<base>`

's FMA and`normalize`

instead.

A separation like this allows us to reduce to minimum how much we need to think about (pseudo)normalization of remainders when implementing actual matrix functions. In this paradigm e.g. matrix product takes the following form:

```
matrix operator *(matrix const& b) const {
matrix res(n, b.m);
for(size_t i = 0; i < n; i++) {
for(size_t j = 0; j < m; j++) {
res[i].add_scaled(b[j], row(i)[j]);
}
}
return res.normalize();
}
```

At the same time, `vec<base>`

has an override for `add_scaled`

when `base`

is a modular arithmetics class:

**add_scaled implementation**

We store matrix rows as `vec<base>`

and then call to `add_scaled`

and `normalize`

instead of working with `base`

.

### Gaussian elimination

When dealing with linear algebra, you typically expect the following operations to be available:

- Matrix product;
- Finding an inverse matrix;
- Finding the determinant of a matrix;
- Finding the rank of a matrix;
- Finding a solution to a system of linear equations.

We already covered matrix product, as it is the simpler of operations. All the remaining operations, however, require gaussian elimination in one way or another. If you're just presented with one particular of these problems, it's natural to implement a specific version of gaussian elimination that is relevant to it. But when you write a library code, it seems somewhat unreasonable to implement gaussian elimination over and over again. What can we do about it?

Ideally, we would like to concentrate the essence of gaussian elimination in a single form, and then use its results to compute whatever we want, without having any intermediate knowledge about what was going on in the intermediate steps. In its essense, what gaussian elimination does for any matrix is finding its row echelon form, with some nuances.

Also, to be more specific gaussian elimination only uses one of the following "elementary operations":

- Multiply a row by a number $$$t$$$, and add the result to another row;
- Multiply a row by a number $$$t$$$;
- Swap two rows.

Noteworthy, the first operation does not affect matrix's determinant, while the second option multiplies the determinant by $$$t$$$ and the third option multiplies the determinant by $$$-1$$$. As already mentioned, we would want to only care about the result of gaussian elimination, without caring about what happened during it. As such, **we will restrict ourselves to only using the following operations**:

- Multiply a row by a number $$$t$$$, and add the result to another row;
- Swap two rows
*and multiply one of them with $$$-1$$$*.

Furthermore, we will separate the elimination in two parts, one only doing the first operation (hence, find the reduced form for a basis of matrix rows), while the second will assume that the basis is already reduced, and will simply do the sorting of rows and classification of coordinates (columns/variables) into "free" and "pivots". Let's discuss these two parts separately.

#### Basis reduction

If you're familiar with xor basis, you know that it roughly looks as follows:

```
// Subtract a's contribution from b
void reduce(int &b, int a) {
b = min(b, b ^ a);
}
void add_vector(vector<int> &basis, int x) {
for(int it: basis) {
reduce(x, it);
}
if(x) {
basis.push_back(x);
// Optional: Reduce basis vectors by x and sort them
}
}
```

And you may also know that it *somehow* related to gaussian elimination, except for people usually write gaussian elimination in another way because... Reasons? Well, the fact is, we absolutely *can* write general gaussian elimination in the same paradigm:

```
enum gauss_mode {normal, reverse};
template<gauss_mode mode = normal>
matrix& gauss() {
for(size_t i = 0; i < n(); i++) {
row(i).normalize();
for(size_t j = (mode == normal) * i; j < n(); j++) {
if(j != i) {
row(j).reduce_by(row(i));
}
}
}
return normalize();
}
```

The only thing that we need is implement `reduce_by`

in the `vec<base>`

class, which will keep track of the leading non-zero element of each row (and its inverse), and then subtract the row appropriately from another row, if the leading coefficient of the subtracted row occurs later than the leading coefficient of the row from which it is subtracted. This is a general equivalent of `min(b, b ^ a)`

:

**reduce_by implementation**

#### Basis sorting / variable classification

Note than in your typical gaussian elimination, you end up with a matrix in the row echelon form. However, after doing the simplified algorithm above, while we get something similar to a row echelon form, the rows of the matrix are *not* sorted by their leading element, and it might prevent us from using the algorithm above for finding determinant, inverse matrix, etc.

To mitigate this, we add another function that does one final sweep over the matrix and sorts its rows, while also classifying the columns of the matrix into ones that have a corresponding pivotal row, and those that do not. This classification is crucial for e.g. solving systems of linear equations, as pivotal columns/variables allow us to determine a particular solution to the system, while the free variables allow us to find the basis in the kernel of the matrix, and thus describe all general solutions to the linear system.

So, we end up with a function `echelonize`

that does both gaussian elimination and the consequent sorting:

```
template<gauss_mode mode = normal>
auto echelonize(size_t lim) {
return gauss<mode>().sort_classify(lim);
}
template<gauss_mode mode = normal>
auto echelonize() {
return echelonize<mode>(m());
}
```

Here `lim`

is the number of columns/variables of interest, importance of which we will see later.

**sort_classify implementation**

#### Implementing some standard operations

Now that we have the code above, we can find matrix rank, determinant or the inverse matrix in a uniform short manner:

```
size_t rank() const {
if(n() > m()) {
return T().rank();
}
return size(matrix(*this).echelonize()[0]);
}
base det() const {
assert(n() == m());
matrix b = *this;
b.echelonize();
base res = 1;
for(size_t i = 0; i < n(); i++) {
res *= b[i][i];
}
return res;
}
std::optional<matrix> inv() const {
assert(n() == m());
matrix b = *this | eye(n());
if(size(b.echelonize<reverse>(n())[0]) < n()) {
return std::nullopt;
}
for(size_t i = 0; i < n(); i++) {
b[i] *= base(1) / b[i][i];
}
return b.submatrix(std::slice(0, n(), 1), std::slice(n(), n(), 1));
}
```

Note that for inverse matrix calculation, we used two new sub-routines:

- Operator
`|`

for matrix concatenation:`A | B`

appends all rows of`B`

to rows of`A`

. `submatrix(x, y)`

takes two std::slice, and returns a submatrix of rows and columns indexed by`x`

and`y`

correspondingly.

Together, these two functions allow us to implement inverse matrix computation in a very natural way that mirrors the standard algorithm:

To compute an inverse matrix $$$A^{-1}$$$, take a block matrix $$$A | I_n$$$ and transform it into $$$I_n | B$$$ using only row operations, then $$$B = A^{-1}$$$.

Thinking this way also largely helped me when implementing Frobenius form via the algorithm described by Elegia's alt account, as I simply appended an extra vector to each of "random" vectors to keep track of linear combinations they form during basis reduction.

### Conclusion

As you see, we got a top performing matrix library, while not really caring about performance in the implementation of actual linear algebraic operations, which allowed us to keep these implementations reasonably simple and elegant. Moreso, even where the performance actually mattered (vector FMA), we managed to only utilize the high level auto-vectorization idea to get a top-performing solution, which marks how useful even very superficial understanding of modern high performance computing can make a difference, *even* when you are not ready to go the extra mile to dive deeper into it.