My issue concerns 1840D - Wooden Toy Festival. Within the contest I used an approach similar to the one given by the authors of the problem in tutorial, and the first step was to sort an input array and don’t consider duplicates. As usual, I used a combination sorted(set(a))
, which always worked well and seemed to me quite reliable. The submission 208933459 successfully passed all pretests. However, then I got a hack with TL exceed. After the contest I submitted an almost identical submission 208932147 with the only one exception: I replaced the combination above with the following block:
a.sort()
b = [a[0]]
for i in a:
if i != b[-1]:
b.append(i)
As I understand it, it performs no better. The first step of sorting the array works with $$$O(n \log(n))$$$ complexity, and the second one of removing duplicates with $$$O(n)$$$. However, with the such replacement my solution passed all the tests. Could you explain why this is so? And does it mean that henceforth it’s better to use something like the block above instead of using combination sorted(set(a)))
? I would appreciate a lot!
As a python user, my advice is to avoid using set() in CP. It doesn't perform well on large datasets
Then what do you suggest? Sometimes we have no choice but to use a hashset structure?
When O(1) lookup is required, I use a dict, never got an issue with it. defaultdict performs bad as well, I use dict.get(x, y) instead.
Your code is working fine in python3 (287 ms). Its only giving tle for pypy3. Eager to see if someone can tell why this is so :)
pypy3 and python use different hashing algorithms for set(), which is why the testcase engineered to kill pypy3 set() solutions will only work on pypy3.
This is due to the input being crafted to make python set() have collisions which will make the code run in n^2 time for set().
I fixed it by shuffling the array before i do sorted(set(a)): https://mirror.codeforces.com/contest/1840/submission/208991870
This would likely not happen in div1 / div2 rounds, only where rounds with a lot of time to hack (12h open hacking phase), people would try to hash collide. I still feel that this is quite dumb as this will be added to the system tests?
To the rest reading this, what are your opinions?
Nice!
It was indeed very informative, thanks a lot! But I didn't quite grasp an idea about how shuffling actually affects this situation? Since Python set() is implemented with using hash-table, lookups work with O(1) in average, but when it's up to collisions, according to Fluent python, its handling works somewhat like an Open addressing method, doesn't it? And does it make any difference in which order to get the hashes if it leads to collisions anyway?