Блог пользователя Borisp

Автор Borisp, 14 лет назад, По-английски

Topcoder Open 2011 started recently. This weekend was the first qualification round of the algorithm competition. I had an interesting experience during it, which I want to share.

The second problem was not that hard, though my solution seems to be too stupid for it. You can check out the statement here. I solve it with quite naive approach as follows:

double dp[80002][50];
class FoxListeningToMusic {
public:
    vector <double> getProbabilities(vector <int> length, int T)  {   
        memset(dp, 0, sizeof(dp));
        int n = length.size();
        for(int i = 0; i < n; i++)
            dp[0][i] = 1.0 / (double)n;

        double mul = 1.0 / (double)n;
        int idx ;
        for(int i = 1; i <= T; i++) {
            for(int j = 0; j < n; j++)  {
                idx = i - length[j];
                if(idx >= 0)  {
                    for(int k = 0; k < n; k++)
                        dp[i][k] += mul * dp[idx][k];
                }
                else
                    dp[i][j] += mul;
                }
            }
        }

        vector<double> v(n);
        for(int i = 0; i < n; i++)
            v[i] = dp[T][i];
        return v;
    }

};

It is not important weather the code is solving the problem with correct answers, at least for what I am going to discuss. The fact is that I got time limit with this code. It was somehow expected as the complexity here is O(T * length.size() ^ 2), which becomes 2 * 108 if we take into account the problem constraints. However, the interesting thing is that I tested into the arena before submitting. The case I used seems to be "worst case" for my solution: 50 1s given in length and T = 80000. The code ran for 0.365 seconds. This is quite below the time limit of 2 seconds. I submitted and then was pretty surprised to see I almost did not manage to qualify to Round 1, with 250 ptr only.

I say the case I used is the worst case, because the number of instructions what will be executed depends only on the branching condition idx >= 0 in the inner for. If this is true one more cycle is to be executed (the cycle is of complexity O(n)). In the other case only a single operation O(1) will be executed. And as you can see the less the elements in length the more times this becomes true.

Even though this reasoning, my problem fails the following case:
length = {1, 1, 1, 1, 3, 3, 3, 3, 1, 3, 3, 2, 3, 2, 3, 3, 1, 2, 3, 1, 2, 3, 2, 1, 3, 1, 1, 1, 2, 3, 2, 3, 2, 2, 1, 3, 1, 1, 3, 1, 3, 1, 3, 2, 3, 1, 1, 3, 2, 76393} T= 77297.
For this case my program runs for 5.204000 seconds on my computer (my computer is significantly slower than Topcoder's, but I am going to show you runtimes on my machine, as here the ratio is what matters), whereas
length = {1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1} T = 80000 runs for 0.750000 on my machine.

My first assumption was that the reason for this unexpected ratio of runtime measurements (as long as we should expect that in the first case quite fewer processor instructions are to be executed) was that the processor caches somehow the similar calculations: in my example the calculations are symmetric with regard to all the elements of length and really clever processor can use this to spare repeating the same sequence of instructions. So I tried composing another example: this time with different values in length array:

length = {1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49, 77943}  T=80000 runs for  0.813000 seconds on my machine.

After this example I am no longer able to say how come these time measures - my second example seems to require more processor instructions than the test that failed me and does not allow the caching I thought was happening in the first example. Actually I have not been able to define the cause of this behavior, but I am quite sure it should have something to do either with processor caches or conveyors. I am very curios how those experiments will behave on different chipsets, so feel free to comment here.

Also, if there is anyone more knowledgeable in hardware than me and he/she can explain this behavior it will be appreciated.

Until then there is a note I should make for myself - when estimating the algorithm complexity do not underestimate the processor optimizations. Sometimes, they seem to decrease/increase significantly the amortized speed of specific examples.

PS: I wonder have Topcoder guys specifically thought of such "processor breaking" examples?

  • Проголосовать: нравится
  • +28
  • Проголосовать: не нравится

14 лет назад, # |
Rev. 4   Проголосовать: нравится 0 Проголосовать: не нравится
Lol, looks like there's something on my message that is bugging CFs. I was giving an hyperlink, maybe it's that.

So, There's an interesting discussion about it on TC's forum. Seems that the main issue is caused by Denormal Numbers (the link was pointing to wikipedia)

Also, when working with doubles it's good to be more prudent and keep below 100 million. I say this based in geometric problems that uses to be much slower.

14 лет назад, # |
Rev. 2   Проголосовать: нравится +23 Проголосовать: не нравится

Lol, your code didn't compiled. :D Have to fix them.

Just add a line
if( dp[idx][k] > 1e-300 )
before
dp[i][k] += mul * dp[idx][k];
and get solution in 867 ms.

And the reason is subnormal numbers below 1e-306, that makes program work about 50-100 times slower.
  • 14 лет назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится
    Probably the reason why the code did not compile is that it is just a cut of the relevant code (though I have forgotten to put "};" at the end of the class).

    Also I still do not understand why their test introduces more subnormal values than mine, which should be taking into account the performance difference.

    Still, thanks for the response, now I know more :)
  • 14 лет назад, # ^ |
      Проголосовать: нравится 0 Проголосовать: не нравится
    Cool! Even a "naive" solution in Java passes the system tests when rounding subnormal numbers to zeros. Thanks!
14 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится
The exact same thing happened to me. I have learned from this experience on what to test for time limits. Even more importantly is we both qualified for round 1.
14 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится
I thought that the reason my program failed was memorisation, that's why I started this thread on Topcoder. It is very interesting, that it is not the problem, but actually the use of doubles.
14 лет назад, # |
  Проголосовать: нравится 0 Проголосовать: не нравится
It's interesting. 
I used 50 '1' with T = 80000 to challenge two O(T * n^2) solutions, but no one succeeded ... 
I was shocked by the speed of Topcoder server at that time, but now I know it's the reason of double.
14 лет назад, # |
  Проголосовать: нравится +14 Проголосовать: не нравится
If I understood the problem correctly, we have discussed it in this post. The fact is, operations with denormalized doubles, such as 1e-304, are 30 times slower than with "normal" ones:

public class DoubleSpeed {
    static long test(double theValue) {
        long t0 = System.currentTimeMillis();
        double sum = 0;
        for (int i = 0; i < 100000000; ++i) {
            sum += theValue / 10000.0;
        }
        return (int) (sum * 0.0) + System.currentTimeMillis() - t0;
    }
    public static void main(String[] args) {
        for (int i = 10; i <= 307; ++i) {
            if (i > 290 || i % 10 == 0) {
                double theValue = Double.parseDouble("1e-" + i);
                System.out.print(theValue + ": ");
                System.out.println(test(theValue));
            }
        }
    }
}


...
1.0E-298: 121
1.0E-299: 122
1.0E-300: 121
1.0E-301: 122
1.0E-302: 121
1.0E-303: 121
1.0E-304: 6374
1.0E-305: 6373
1.0E-306: 6377
1.0E-307: 6375