Fast matrix multiplication does not need to be hard

→ Pay attention

Before contest
Educational Codeforces Round 189 (Rated for Div. 2)
3 days
Register now »

→ Streams

By aryanc403

Before stream 26:07:07

View all →

→ Top rated

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

Countries | Cities | Organizations

View all →

→ Top contributors

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
3	Proof_by_QED	147
5	Dominater069	145
6	errorgorn	142
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

View all →

→ Find user

→ Recent actions

Detailed →

i_love_sqrt_decomp's blog

Fast matrix multiplication does not need to be hard

By i_love_sqrt_decomp, history, 8 months ago, In English

I have done a matrix multiplication program that runs ~50x faster than plain naive implementation and ~3x faster than IKJ-order with modular tricks in just 55 lines of code (see lines 256 to 310 in my submission).

Typically, to get this kind of speed (top 4 on Library Checker), you would have to spend 300+ lines of code for a Strassen implementation.

Here is the link: https://judge.yosupo.jp/submission/310249. A lot of the code was based on https://mirror.codeforces.com/blog/entry/101655, aside from the tmp part.

i_love_sqrt_decomp
8 months ago
5

Comments (5)

Write comment?

coordinatebash

8 months ago, hide # |

Very cool, do you have advice on how to start programming for speed like this, like books to read or anything?

→ Reply

i_love_sqrt_decomp

8 months ago, hide # ^ |

← Rev. 2 →

I learned about things like this by reading articles and figure things out by myself. To get a performance estimate, https://uops.info/ has a lot of performance data for instructions. Also https://godbolt.org/ is another useful tool for inspecting the code.

→ Reply

virinci

7 months ago, hide # ^ |

https://en.algorithmica.org/hpc/ is an excellent resource by sslotin (the author of the blog linked in the post).

→ Reply

QedDust413

8 months ago, hide # |

Nice work! I think this kind of speed mainly from your excellent kernel.

→ Reply

i_love_sqrt_decomp

7 months ago, hide # |

An extension of this to FP32: https://judge.yosupo.jp/submission/314813. It gets about 91% efficiency ($$$102/112$$$ GFLOPS for $$$m=n=k=5440$$$, assuming multiplication and addition are 2 distinct operations), on a 3.5 GHz Zen 3.

→ Reply

The only programming contests Web 2.0 platform

Server time: Apr/18/2026 11:22:53 (g2).

Desktop version, switch to mobile version.

Supported by