Блог пользователя Arpa

Автор Arpa, 9 лет назад, По-английски

Hi!

I thought that unordered_set is faster than set, because set complexity is logarithmic in size and unordered_set complexity is constant time in average.(source: cplusplus.com)

But today I saw this:

TLE(>2000 MS) with unordered_set: 15494816

Accepted(155 MS) with set: 15494846

Also my other solution(witch I submitted during the contest) with unordered_map hacked :'(

Now my question is : Is unordered_set faster than set?

UPD: I have accepted(15497282).unordered_set time was better than set(15494846) : 93<155.

Only with adding this line:s.max_load_factor(0.25);s.reserve(500);.

UPD2: it seems that it is better to use a power of 2 in reserve(i.e. s.reserve(512)).(Thanks from keyvankhademi)

  • Проголосовать: нравится
  • +9
  • Проголосовать: не нравится

»
9 лет назад, # |
  Проголосовать: нравится -8 Проголосовать: не нравится

The default hash function for integers does not perform very well. I suggest trying to look up other hash function alternatives.

In my experience, both unordered set and map have worked well for strings.

  • »
    »
    9 лет назад, # ^ |
      Проголосовать: нравится +5 Проголосовать: не нравится

    Any hash function does not perform well unless it is chosen randomly from a big set.

    • »
      »
      »
      9 лет назад, # ^ |
        Проголосовать: нравится +10 Проголосовать: не нравится

      there's a difference between functions that work badly when someone want them too and that just work badly.

»
9 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится

unordered_set is not always faster than set, it's like comparing complexity between quick sort and merge sort.

  • »
    »
    9 лет назад, # ^ |
    Rev. 2   Проголосовать: нравится 0 Проголосовать: не нравится

    to be fair, quick sort is O(n lg n) in average and O(n^2) in worth case,but marge sort is O(n lg n)in average and worth case.

    But unordered_set is O(1) in average and O(n) in worth case, but set is O(lg n) in average and worth case.

    quick sort and marge sort average is equal, so it is better to use marge sort.

    But because unordered_set average complexity is better than set, it seems that unordered_set must be better.

    • »
      »
      »
      9 лет назад, # ^ |
        Проголосовать: нравится +5 Проголосовать: не нравится

      adding pos.reserve(500); reduces running time better then tree set for the current tests: http://mirror.codeforces.com/contest/620/submission/15496964

      i have no idea though why setting to a much higher value in place of 500 again increases time

    • »
      »
      »
      9 лет назад, # ^ |
        Проголосовать: нравится +5 Проголосовать: не нравится

      Yeah, sorry for the bad intepretation. But my point still stands, unordered_set complexity is unstable so the matter of is unordered_set is faster than set depends on the test case.

    • »
      »
      »
      9 лет назад, # ^ |
        Проголосовать: нравится 0 Проголосовать: не нравится

      it is better to use marge sort.

      Umm, no, it's better to use quicksort. Proof — this text from Skienna, Algorithm Design Manual:

      "What we can say is that experiments show that where a properly implemented quicksort is implemented well, it is typically 2-3 times faster than mergesort or heapsort. The primary reason is that the operations in the innermost loop are simpler."

      Basically quicksort has better cache locality. Most C++ standard library implementations use a combination of quicksort and insertion sort as their implementation for std::sort.

»
9 лет назад, # |
Rev. 3   Проголосовать: нравится +8 Проголосовать: не нравится

Well, that depends on the implementation of lib (And specially-crafted data will cause that. Check the UPD behind). Theoretically and most of the times, the hash ones are faster than tree ones.(And that's why hashing is quite popular in actual work.)

And as it turns out on Microsoft .Net Framework implements, the answer is yes.

15474905 the HashSet one with 218ms.

15496423 the SortedSet one with 311ms.

As for their implementation in Microsoft .Net Framework, it's already open-sourced. corefx/src/System.Collections/src/System/Collections/Generic/HashSet.cs corefx/src/System.Collections/src/System/Collections/Generic/SortedSet.cs

As for the GNU C++ inplementations(hash table implementations in <bits/hashtable.h>, and the default hash function for int is just static_cast(__val) ), I'm surprised to find a try-catch block in it(and not so easy to understand how it works).


And thank you for the heads up. I'll think twice before using unordered_set/unordered_map on GNU C++.


UPD:

The problem just reminds me of a famous security problem: Hash Collision DoS(a hot problem during 2011~2012), which, according to some reports, a lot of popular languages ware(maybe even are still) affected, including PHP<=5.3.8 , Perl, Java, ASP.NET, V8 JavaScript Engine,Ruby <= 1.8.7-p356 ,etc. A specially-crafted set of keys could trigger hash function collisions, which can degrade performance of HashMap or Hashtable by changing hash table operations complexity from an expected/average O(1) to the worst case O(n).

Some information here:https://bugzilla.redhat.com/show_bug.cgi?id=750533

(Some discussion shows that some language just adds some limits to the number of parameters to "hide" the problem inside the implements of hash table strategy (Source: http://stackoverflow.com/questions/8672985/how-was-the-hash-collision-issue-in-asp-net-fixed-ms11-100). You may want a try to hack my C# solution. )

»
9 лет назад, # |
  Проголосовать: нравится +5 Проголосовать: не нравится

Sometimes it is.

I have once got a problem AC by changing ordered_set to unordered_set in C++.

However it depends on the hashing. I always go with ordered set, and if it's tle and optimization, I try the unordered one.

»
9 лет назад, # |
Rev. 2   Проголосовать: нравится +11 Проголосовать: не нравится

Looks like someone went through the trouble to generate an extremely bad test case input for hacking 620C - Pearls in a Row. Now the hack is included in the system tests as Test 42. It causes a huge number of collisions in that specific G++ implementation of unordered map.

  • a submission with unordered_map fails with TL: 15496737
  • a submission with map gets AC (only one line of code changed!): 15496747
  • a submission with unordered_map and custom hash function: AC again! 15497036. Bit slower compared to map, anyway.

To the OP: good idea to change the max load factor! Looks like a simple solution compared to writing a custom hash function.

Tweaking the load factor + pre-creating many buckets (with reserve) solves the problem because you only get bad performance when there are many items per bucket. The worst-case is when all items are in a single bucket: then lookup in the hash map is reduced to list lookup.

  • »
    »
    9 лет назад, # ^ |
    Rev. 2   Проголосовать: нравится +14 Проголосовать: не нравится

    Yes I'm afraid that was me.

    To sum it up, unordered_set is generally much faster than set. Set gurantees O(log n) but unordered_set is O(1) in average. However while set ensures O(log n), unordered_set can take O(n) time per insertion. This is very rare in practice unless the dataset is deliberately chosen so that unordered_set causes large number of collisions, degrading to O(n).