Is builtinpopcount O(1) or O(log_2 k) ? - Codeforces

→ Обратите внимание

До соревнования
Codeforces Round 995 (Div. 3)
44:10:12
Зарегистрироваться »

→ Трансляции

Advent of Code problem-solving

Errichto

Трансляция идет

Всё →

→ Лидеры (рейтинг)

№	Пользователь	Рейтинг
1	tourist	3985
2	jiangly	3814
3	jqdai0815	3682
4	Benq	3529
5	orzdevinwang	3526
6	ksun48	3517
7	Radewoosh	3410
8	hos.lyric	3399
9	ecnerwala	3392
9	Um_nik	3392

Страны | Города | Организации

→ Лидеры (вклад)

№	Пользователь	Вклад
1	cry	169
2	maomao90	162
2	Um_nik	162
4	atcoder_official	161
5	djm03178	158
6	-is-this-fft-	157
7	adamant	155
8	awoo	154
8	Dominater069	154
10	luogu_official	150

Всё →

→ Найти пользователя

→ Прямой эфир

Детальнее →

Блог пользователя kayak

Is builtinpopcount O(1) or O(log_2 k) ?

Автор kayak, 7 лет назад, По-английски

По-английски

In this comment, it's mentioned that the complexity of __builtin__popcount for any integer j with j = O(2^N) is O(N) (i.e $\text{[math]}$ ) instead of O(1). So to count the number of one in a large binary string of length n with n > > 64, if I split n into $\text{[math]}$ substrings (with N = 64 / 32 / 16) and apply builtin popcount to each of the substrings and add them up, then the total time complexity should be $\text{[math]}$ instead of $\text{[math]}$ .

But in page 101 of Competitive programmers handbook on the topic Counting Subgrids, based on the time taken to compute the results, the time should be same no matter if N = 64 for N = 32. But it turns out that they're different as "the bit optimized version only took 3.1 seconds with N = 32 (int numbers) and 1.7 seconds with N = 64 (long long numbers)".

Why N = 64 takes less time ?

Теги

c++, __builtin_popcount

+6

kayak
7 лет назад
11

Комментарии

Комментарии (9)

Показать архивные | Написать комментарий?

»

7 лет назад, # |

Проголосовать: нравится

+5

Проголосовать: не нравится

The big O notation doesn't handle constants. Technically the complexity of __builtint_popcount is indeed the O(number of bits) but the constant is very small and much much smaller than a for loop checking each bit one by one although both have the same complexity, O(number of bits). So when you are using int numbers every for loop has to go twice the usual and a for loop has a larger constant than __builtin_popcount on a long long.

→ Ответить

»

»

4 года назад, # ^ |

Проголосовать: нравится

0

Проголосовать: не нравится

what do you mean by "the constant is very small"? I mean to say that if some func has complexity O(N) and the other has a for loop running for N times both can differ in constant? and what is the use of constant?

→ Ответить

»

7 лет назад, # |

Проголосовать: нравится

0

Проголосовать: не нравится

Whatever the case of the complexity of __builtin__popcount it does not surprise me that N = 64 is faster. Given that you are effectively calling the function $\text{[math]}$ times you would have twice as many function calls which can have some overhead which could be noticeable in huge tests like it seems to be the case with yours. It's important to note that your results could be slightly skewed depending on how you ran your testing — for example if you have run it singificantly large amount of times or not.

Also, depending on the compiler built in functions can have significant differences in performance

→ Ответить

»

»

7 лет назад, # ^ |

Проголосовать: нравится

0

Проголосовать: не нравится

I think it's inlined anyways so overhead is negligible.

→ Ответить

»

7 лет назад, # |

Проголосовать: нравится

0

Проголосовать: не нравится

It's because of a false data dependency that the compiler isn't aware of. There is no actual computational advantage of a 64-bit type, but because of how the compiler works it's less likely to hit this error when you use them.

https://stackoverflow.com/questions/25078285/replacing-a-32-bit-loop-count-variable-with-64-bit-introduces-crazy-performance

→ Ответить

»

7 лет назад, # |

Проголосовать: нравится

+51

Проголосовать: не нравится

On modern hardware, there is a POPCNT processor instruction to count the number of set bits.

To utilize this instruction, the GCC compiler should be run with an option to enable the respective set of instructions. It is part of SSE4. Here is how to enable it from source:

#pragma GCC target ("sse4.2")
int s;
int main (void) {
	for (int i = 0; i < 1000000000; i++)
		s += __builtin_popcount (i);
	return 0;
}

In Codeforces custom test, I just checked with the GNU G++11 5.1.0 compiler.
With the #pragma, the code runs in under ~560ms.
Without it, the time increases to ~2370ms, which is four times slower.

→ Ответить

»

»

5 лет назад, # ^ |

← Rev. 3 →

Проголосовать: нравится

0

Проголосовать: не нравится

Gassa I saw another implementation of __builtin_popcount in this comment. Could you please tell me which one would be better to you? Are they essentially the same? Although running the popcount function using your test gives a worse runtime.

→ Ответить

»

»

»

5 лет назад, # ^ |

Проголосовать: нравится

+11

Проголосовать: не нравится

If it was a bottleneck in a piece of code which runs for hours or days, I'd take various implementations (intrinsic, assembler, O(log log n), O(1) with precomputed tables), measure the time for each one in my particular use case, and settle on a winner.

Incidentally, it's what I did back in 2007, which then resulted in the following piece of code — don't laugh, there was no popcnt instruction in my processor back then:

res = (res & 0x55555555) + ((res >> 1) & (0x55555555));
res = (res & 0x33333333) + ((res >> 2) & (0x33333333));
res = ((res + (res >> 4)) & 0x0F0F0F0F);
res += (res >> 8) + (res >> 16) + (res >> 24);

However, I expect the answer to vary between architectures and use cases. So, for a one-off program, such as a solution in the contest, I'd use the thing which is (1) easy to write and (2) usually not much slower than the other approaches. The __builtin_popcount intrinsic seems to be designed with these exact goals in mind — please correct me if I'm wrong!

→ Ответить

»

»

»

»

5 лет назад, # ^ |

← Rev. 2 →

Проголосовать: нравится

+10

Проголосовать: не нравится

Thank you for your reply. I found this gcc bugzilla link earlier. The proposed implementation looks very similar to what you mentioned here!

→ Ответить

Codeforces (c) Copyright 2010-2024 Михаил Мирзаянов

Соревнования по программированию 2.0

Время на сервере: 20.12.2024 21:24:49 (i2).

Десктопная версия, переключиться на мобильную.

При поддержке