defnotmee's blog

By defnotmee, history, 3 years ago, In English

Hello codeforces! I've come to ask help in a time of great mystery.

I was helping to debug a problem where the code would pass the 2s TL when on a 64-bit compiler but would get confortably under 1s on a 32-bit one when some really bizarre things started to happen.

First, pinning down where the problematic parts were, it turns out this was the thing consuming practically the whole time limit:

set<ll> s;
for (int i(0); i < k; ++i) {
    int pp;
    cin >> pp;
    s.insert(pp);
}

Im not joking. Also k <= 3e5 btw.

It really shouldnt be a big deal, and normally that part should take less than 140ms, but both on C++20 (64) and C++17 (64) make it take from 1900ms to 2000ms.

I made a submission that outputs the time taken to do the input, the inserting, and the ammount of elements in the set; and with that even more weird stuff appears. Look at the result:

142730404

Output

In this test case until the 90000-ish iteration there are distinct elements and after that the input time starts to raise aggressively, while the insert time grows at a basically unaltered pace. Maybe they start repeating a relatively big number, but even then, it goes from 30ms to read 50000 guys to 500ms. It's insane.

If you want, take a look at the version of the output when using 32-bit for comparison (the numbers will actually be a bit inflated because you need to get the timings with std::chrono):

142730436

Output

Ending thoughts

  • The problematic test case is case 6, however case 5 has the same input size and the set ends up with more items (see 142732612). Why is the problem only with test 6???

  • If instead of reading a variable and instantly inserting it into the set, you first input an array and then insert it into the set, no tragedy happens (142732784).

  • How can the input go so wrong? A 20x slowdown out of nowhere should not be happening.

  • Is there a way to get the raw test data from the authors or something similar to see what actually was on test 6 so we can try to replicate it?

  • Thanks qwexd for doing nothing.

  • Vote: I like it
  • +192
  • Vote: I do not like it

»
3 years ago, # |
  Vote: I like it +37 Vote: I do not like it

Bumping post. I personally don't see any plausible reason why cin would randomly slow down. My best hypothesis would be that undefined behaviour likely edited some internal variable in cin, which could cause the slowdown, but that would imply there is a bug in the implementation of std::set. Just to rule out some other factors I would recommend checking the number of digits read in each block and printing that as well.

  • »
    »
    3 years ago, # ^ |
      Vote: I like it +35 Vote: I do not like it

    It seems that problem happens when set reaches size of exactly 99904. The numbers in set don't matter. Check this submission: https://mirror.codeforces.com/contest/796/submission/142771971

    And I have no idea why this happens.

    • »
      »
      »
      3 years ago, # ^ |
        Vote: I like it +43 Vote: I do not like it

      Ok i made the debug trigger on test 5 instead with your code and now it TLEs like when on test 6, so not only do the numbers on the set dont matter but also the numbers on the input 142784418

      I also tried changing it to other size to other sizes like 100000, 99904+1, 99904-1, 99904+2, 99904/2, 99904/7, 99904*2, but none of them TLE like 99904

»
3 years ago, # |
Rev. 4   Vote: I like it +36 Vote: I do not like it

Honestly, I can’t think of any reason outside of cache locality, but it is not super convincing to me either.

Observations:

  1. Swapping set for unordered_set actually makes C++x86 2x slower: 142768060 142767613. This seems to all be accounted for in the timing for cin >>.
  2. Using custom buffer + fread alleviates the time diff: 142776426 142776447.
  3. Note that the C++x64 compilers run on Windows while the x86 compilers do not; there may be system differences.

Problems:

  1. C++x64 seems to be slower for some inputs.
  2. Using unordered_set reverses the time diff.
  3. Slowness is always accounted for in the cin >> operation.
  4. Using array intermediary solves the problem (1).
  5. Using fread with custom cache solves problem (1).

Some possible answers:

  1. C++x64 runs on a system with smaller cache than C++x86 and misses far more for certain inputs.

  2. unordered_set prefers a larger cache than set (which might make sense considering it is a hash table). But this cache preference is larger than both system’s cache sizes. So when both C++x64 and C++x86 see lots of cache misses, C++x64 starts performing better…

  3. As we insert more into the set or unordered_set, eventually the cache is too small, and we always kick the cin buffer out of the cache when doing the insert. So timing cin >> accounts for loading into the cache at that point and looks much slower when the set/unordered_set reaches a large size. (3.1) But why do we not get the same slowdown for .insert when we reach cache limits? Likely because .insert nearly always cache misses as it traverses different paths in the underlying BBST regardless of the number of nodes. So there is no slowdown as it gets large, since it is always slow.

  4. Maybe the array is much smaller than the cin buffer and easier to fit in cache. But I think streambuf is fairly light on memory… maybe cin has a large putback buffer too. (4.1) This is a stretch, but the compiler could have optimized the code here to sort the array and then construct a BBST out of a sorted array, which is much faster than constructing a BBST from random order. godbolt doesn’t seem to support this hypothesis though, so take it lightly.

  5. Honestly the reason for this is probably similar to the reason why using an array intermediary solves the problem.

  • »
    »
    3 years ago, # ^ |
      Vote: I like it +25 Vote: I do not like it

    Thanks for your insight! It seems that MrDindows's observation also supports your idea of super weird cache misses being the problem, since the TLE happens only for specific set size/sizes.

    I also tried priority_queue (142783105) and (funnily) the pbds set (142782176) and they also both pass.

    A couple of points id like to address:

    C++x64 runs on a system with smaller cache than C++x86 and misses far more for certain inputs.

    The particular input doesnt seem like the problem. Forcing the size of the set to be 99904 also makes it TLE on another test case: 142784418

    This is a stretch, but the compiler could have optimized the code here to sort the array and then construct a BBST out of a sorted array, which is much faster than constructing a BBST from random order. godbolt doesn’t seem to support this hypothesis though, so take it lightly.

    I think it's possible the compiler tries to optimize something by ordering things differently (maybe reading like 50 numbers at a time and then inputting those 50 numbers into the set, idk) and the fact that trying to isolate the timing with chrono like in the blog makes the total time be really inflated (142730436 shows it took around 500ms to input and insert, while 142729216 says 137ms). Even if this is true i dont think its the sole reason tho

    • »
      »
      »
      3 years ago, # ^ |
      Rev. 2   Vote: I like it +20 Vote: I do not like it

      In general the compiler is allowed to reorder those timing calls so microbenchmarking is pretty hard: https://stackoverflow.com/questions/37786547/enforcing-statement-order-in-c (there's a nice DoNotOptimize helper in the first answer that can help with this)

      I wonder if this test case works on other problems where the natural solution is to read the input into a set. This might start a new era of std::set being unusable... (like how vanilla std::unordered_map is unusable due to hacking hashes today)

      • »
        »
        »
        »
        3 years ago, # ^ |
          Vote: I like it 0 Vote: I do not like it

        Sorry for bumping this post. But, has anyone found a way to reproduce this time limit exceeded on other problems?

        I tried to read input 99904 times this way in other (permutation) problems, but got no time out. 157125980 157127092

        In OP's solution, I changed cin to scanf and it also worked somehow. 157000430

»
15 months ago, # |
  Vote: I like it +112 Vote: I do not like it

I just ran into the same issue. My solution to today's F got TLE49. However, if I read and store the array before adding indices to corresponding vectors, it passes in about half of the time limit. In fact, when I ran the first code in a mashup, it took 14s! In 32-bit compilers, both run in time.