Gornak40's blog

By Gornak40, history, 2 years ago, translation, In English

Compiler GCC provides the ability to use assembler inserts. This can be useful, for example, for multiplying two 64-bit numbers by a 64-bit module.

The fact is that multiplying two 64-bit registers, the processor stores the result in a pair of registers rdx (upper part) and rax (lower part). Division works in a similar way: the divisible is taken from the registers rdx and rax, after which the quotient is stored in rax, and the remainder is stored in rdx.

Using this knowledge, you can implement an analog of the following function:

inline long long mul(long long a, long long b) {
	return (__int128)a * b % 1000000014018503;
}

In this way:

inline long long mul(long long a, long long b) {
	long long res;
	asm(
		"mov %1, %%rax\n"
		"mov %2, %%rbx\n"
		"imul %%rbx\n"
		"mov $1000000014018503, %%rbx\n"
		"idiv %%rbx\n"
		"mov %%rdx, %0\n"
		:"=res"(res)
		:"a"(a), "b"(b)
	);
	return res;
}

We indicate the use of variables res for writing, a and b for reading. They accordingly receive designations %0, %1, %2. Operations are written using the standard AT&T syntax.

Now you can write hashes using a 64-bit module, which is equivalent to using a pair using a 32-bit module, without using __int128.

  • Vote: I like it
  • +42
  • Vote: I do not like it

»
2 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Auto comment: topic has been updated by Gornak40 (previous revision, new revision, compare).

»
2 years ago, # |
  Vote: I like it +32 Vote: I do not like it

You haven't declared any of the fixed registers you clobber with this code, so it's terrible undefined behavior: If the compiler was using rax for anything you are toast. Also, 64-bit idiv is very slow on some systems: You may find a floating-point-based method much faster. (And for hashing applications you can probably use Montgomery reduction instead of "ordinary" modmul for even better performance.)

»
2 years ago, # |
Rev. 3   Vote: I like it +16 Vote: I do not like it

I see a few problems with this code:

  • Codeforces runs on Windows, so rbx should be preserved (source), otherwise it may cause troubles when combined with GCC-generated code.
  • The assembly causes many things to be moved around often if you see the compiled code, making it inefficient.
    • If you want to write an entire function in asm, I suggest using GCC's __attribute__((naked)) (source).
  • Integer division instructions are very slow, and dividing by a constant can be optimized a lot. You can find many resources for fast division on Codeforces (like this blog or this).

That being said, using x86 instructions directly is significantly faster than running __int128 division (which calls a large, slower function __modti3) when you only need 64 bits of modulus and output.

Here is my version of the function in assembly (Windows call convention):

__attribute__((naked)) long long modmul(long long, long long, long long) {
    asm(R"(
        mov %rcx, %rax
        imul %rdx
        idiv %r8
        mov %rdx, %rax
        ret
    )");
}
for sysv users
»
2 years ago, # |
  Vote: I like it 0 Vote: I do not like it

Codeforces has the gym named "Fast modular multiplication", where I have tested how fast the assembler insertion is.

Assembler insertion ~1326 ms
unsigned __int128 multiplication ~1482 ms

So, the assembler insertion is slightly faster, but not significantly faster.

  • »
    »
    2 years ago, # ^ |
      Vote: I like it 0 Vote: I do not like it

    It's also possible to implement inline assembly version without any function call overhead:

    inline uint64_t mulmod64(uint64_t a, uint64_t b, uint64_t c) {
    	uint64_t res;
    	asm(
    		"mul %2\n"
    		"div %3\n"
    		:"=&d"(res), "+a"(a)
    		:"r"(b), "r"(c)
    		:"cc"
    	);
    	return res;
    }
    

    But almost all CPU cycles are spent on executing the super slow division instruction in all code variants. The __int128 variant without inline assembly is also using the same division instruction (after some extra checks to ensure that division overflows won't be triggered).

    • »
      »
      »
      2 years ago, # ^ |
        Vote: I like it 0 Vote: I do not like it

      Yes, and it is faster, ~1248 ms. And your function can be inlined (this removes one jmp and two mov) despite the same assembler output (with exactness up to swapping commands' order and ud2 opcode) (on Linux): https://godbolt.org/z/r5MeMxdbM

      On Windows your function produces one extra mov, but inlining removes one jmp and one mov.

      Dump (on Windows)
»
21 month(s) ago, # |
  Vote: I like it +3 Vote: I do not like it

If you really need int64 multiplication, better consider this variant:

using uint64 = unsigned long long;
uint64 modmul(uint64 a, uint64 b, uint64 M) {
	ll ret = a * b - M * uint64(1.L / M * a * b);
	return ret + M * (ret < 0) - M * (ret >= (ll)M);
}

There is a proof that it works here: https://github.com/kth-competitive-programming/kactl/blob/main/doc/modmul-proof.pdf

This is much faster, and is not correct only for some int64 values, that are almost certainly much bigger than any modulo you choose