I have done a matrix multiplication program that runs ~50x faster than plain naive implementation and ~3x faster than IKJ-order with modular tricks in just 55 lines of code (see lines 256 to 310 in my submission).
Typically, to get this kind of speed (top 4 on Library Checker), you would have to spend 300+ lines of code for a Strassen implementation.
Here is the link: https://judge.yosupo.jp/submission/310249. A lot of the code was based on https://mirror.codeforces.com/blog/entry/101655, aside from the tmp part.









Very cool, do you have advice on how to start programming for speed like this, like books to read or anything?
I learned about things like this by reading articles and figure things out by myself. To get a performance estimate, https://uops.info/ has a lot of performance data for instructions. Also https://godbolt.org/ is another useful tool for inspecting the code.
https://en.algorithmica.org/hpc/ is an excellent resource by sslotin (the author of the blog linked in the post).
Nice work! I think this kind of speed mainly from your excellent kernel.
An extension of this to FP32: https://judge.yosupo.jp/submission/314813. It gets about 91% efficiency ($$$102/112$$$ GFLOPS for $$$m=n=k=5440$$$, assuming multiplication and addition are 2 distinct operations), on a 3.5 GHz Zen 3.