Hi Everyone, i have been seeing this for some days now on codeforces. In yesterday's educational round, I submitted problem D with https://mirror.codeforces.com/contest/1107/submission/49013323 solution. Later I realized that it takes minimum 5040*5040*60 operations in worst case. After contest, i tried hacking my solution but it passed successfully. Also I think testcase 20 is having same worst case input (all 0's and last rows last element if say 1), but it passes there in 732 ms. In my local env it takes more than 4s for this to pass.
Also in CF#534, I tried hacking this O(n^2) solution https://mirror.codeforces.com/contest/1104/submission/48743647, which was doing more than 10^9 operation on worst case (ababa....aa..bababababa string). But it passed in 640 ms. From my understanding it takes 1s for approx 10^8 operations.
Has codeforces upgraded there testing servers or I am missing something here?
I don't know about your code( didn't even opened it ) but I used bitset, which divides the operations by 32. And btw in worse case it takes 5200*5200*48( not 60 ).
5040 has 60 divisors. It takes 1524096000 number of operations (+ some other computations) :) While 5200 has 30 divisors and it takes 811200000 operations. Worst case is 5040 from my understanding.
OK. Anyway, you don't need to post a blog for such things!!!
Ohh. sorry didn't know exact way to ask this. I thought other people may also be seeing this and it would be good to ask if something is missing. Sorry for all the inconveniences it caused to you
I don't care, but as I know they'll downvote, maybe a comment somewhere on tutorial or announcement could be enough...
Seems they just downvote me... Why always me?? What have I done to you?? OK downvote, I love downvotes!!!
Just to add to his answer that 108 is a heuristic bound and must in no way understood as a hard cut. Sometimes, if operation per loop iterations are heavy, 107 might be tricky to squeeze in, whereas 109 may pass if per iteration work is simple.
Also, cache plays a big role where memory accesses are linear, one after the other and can reduce time by factors. I believe in the question above, array accesses were quite linear and operations were just simple comparison.
Yes that is correct that it's only heuristic bound but thing is that most of the time when i see these solutions passing, there is always a better solution which fits into 1s approx = 10^8 operations time limit. I have seen a comment from Tanya_Romanova and many other people about this 10^8 operations thing but not able to find it now. One more ref: https://www.quora.com/How-many-instructions-can-be-executed-in-Codeforces-and-TopCoder-judge-PCs-in-one-second. Also there is one blog on TC https://www.topcoder.com/community/competitive-programming/tutorials/computational-complexity-section-1/.
Also concern is that should these solutions really pass for respective problems? It makes tough to understand what exactly is expected from solution when these solution passes.
I'm also interested in such question.
What has happened with C++ STL string?
Why solutions where string length is about 300 000 characters and which erase small peace (of 2 characters) in the center of the string while it is not empty passes in 1 second on Codeforces?
Isn't there about 3*1.5*10^10 operations? Or string is not a continuous memory block now, it is like an amazing "list" with indexation of elements?
Examle of 2 different problems and their solutions that I can't hacked and even final tests can't do it too:
1) https://mirror.codeforces.com/contest/990/submission/39115847 (passed in 1 second for 300 000 string!)
2) https://mirror.codeforces.com/contest/1104/submission/48742260 (passed in 0.078 second for 100 000 string!!!)
Are you kidding me? :))
What has happened?
You forgot to divide number of operations by
sizeof(YMM-register) / sizeof(char)
const for a string. So, erasing fromstd::string
takesn / sizeof(YMM-register) * sizeof(char) = n / 256 * 8 = n / 32
operations. Estimate isn*(n+1)/2 / 32 = 300000^2 / 64 = 1.4 * 10^9
operations. Most costly operation in this case is loading / storing data to / from registers. It is1s = 10^9
in aligned by cache line size case and2s = 10^9
in unaligned case on codeforces with sequential access and cache prefetching.So, memmove() can process 256 bits in one processor clock cycle?
Yes, it is.
As additional, you can see AVX, AVX2 and FMA in my solution for gym problem G. Underpalindromity. With n =
200000
andk = 100000
it isn * k / 2 / sizeof(YMM-register) * sizeof(int) = 200000 * 100000 / 8 = 2.5 * 10^9
operations, I think.If submission not available for you, this is source code
OMG, problem setters, let our restrictions for string/array length be 1e6 or 1e7 to avoid passing solutions with N^2 and architecture level optimizations.
For example, you can't increase
n
andk
in problem G. Underpalindromity, because there are solutions with correct algorithm inO(n log(n))
with working time0.7s-1s
.Learning algorithms and data structures — one way, learning architecture level's optimizations — another way. When you are using array instead list and iterate over it, you already using architecture level optimizations. Both knowledge can be combined, for example, you can use sqrt-decomposition with such optimizations and sqrt-list and it will be faster, than segment tree and cartesian tree.
A bit simpler and faster solution: 49057315
Fantastic, speed up in 2 times and gcc can vectorize even with addition to
int64_t
!If you are interested, there are another problem 101341I - Matrix God "Matrix God" with 1000 x 1000 matrix multiplication by modulo 10^9+7 and fantastic solution 45798437 by Constantine Drozdov in
374 ms
MikeMirzayanov, consider dmkz to be the tester of CF rounds!