Blog entries - Codeforces

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3611
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	164
2	adamant	150
3	Um_nik	146
4	Dominater069	144
5	errorgorn	141
6	cry	139
7	Proof_by_QED	136
8	YuukiS	135
9	chromate00	134
9	TheScrasse	134

Hello there, welcome to my first Codeforces blog post! Today I’d like to share an educational — and frankly a fun observation — on optimizing the process of finding the square root of large integers using binary search. This may not be something you’d use in your everyday code (after all the built-in sqrtll() is hard to beat in speed), but it’s a neat mathematical trick that can literally save you seconds when processing many numbers.

Before we dive in, a huge thanks to chromate00. An answer of mine getting hacked in one of his contests sparked this discovery.

Aim

In a typical binary search algorithm for computing the integer square root of a number S, we initialize the search range with:

low = 0
high = S

This approach finds the square root in O(log(S)) iterations. But if we have to compute square roots for too many numbers (like $$$10^8$$$), it can become laggy. However, if we can choose low and high more intelligently based on the structure of S, we can reduce the search range even further, which can be a significant improvement for very large numbers.

Code

Here is the optimized version of finding the integer square root for some integer val:

Optimized Code

long long custom_sqrt(long long val)
{
    // optimized binary search
    if (val < 0)
        return -1;
    if (val < 2)
        return val;

    // msb of val
    int msb_Val = 63 - __builtin_clzll(val);

    long long low, high, ans;

    if (__builtin_popcountll(val) == 1)
    {
        // when theta is integer i.e.
        // when S is a power of 2
        // high = 1 << ceil(msb_Val/2)
        low = 1LL << (msb_Val >> 1);
        high = 1LL << ((msb_Val + 1) >> 1);
    }
    else
    {
        // when theta is not an integer
        // high = 1 << ceil(msb_Val/2 + fraction)
        // fraction part can never be equal to 1 so we add + 1
        low = 1LL << (msb_Val >> 1);
        high = 1LL << ((msb_Val >> 1) + 1);
    }
    ans = low;
    while (low <= high)
    {
        long long mid = low + high >> 1;
        if (mid <= val / mid) // avoid overflow
        {
            ans = mid, low = mid + 1;
        }
        else
        {
            high = mid - 1;
        }
    }
    return ans;
}

Concepts Needed (the Geeky Part)

Exponential Representation:
Any number x can be written as:
x = $$$e^{\ln(x)}$$$ (Or, equivalently, x = $$$2^{\log_2(x)}$$$)
Floor and Ceiling:
We have:
$$$\text{floor}(x)$$$ $$$\leq$$$ $$$x$$$ $$$\leq$$$ $$$\text{ceil}(x)$$$
Bit Shifting and Powers of 2:
For valid integers, $$$2^x$$$ is equivalent to 1 << x.
Integer and Fractional Parts:
Every real number on the number line x can be decomposed as:
x = [x] + {x} where [x] is the integer part and {x} is the positive fractional part.
Scaling the Fraction: Dividing the fractional part by a constant k (greater than or equal to 1) is equivalent to {x}/k = {x/k}

The Proof

Let’s denote: - $$$\text{S}$$$ as our input number, - $$$\text{A}$$$ as its square root, so $$$\text{A}$$$ = $$$\sqrt{S}$$$

Since any number can be expressed using logarithms, we have:

$$$\text{A}$$$ = $$$\sqrt{S}$$$ = $$$2^{\log_2(\sqrt{S})}$$$

Because we are interested in the integer part of $$$\text{A}$$$ (let’s call it $$$\text{P}$$$), by applying floor/ceiling properties we get:

$$$2^{\text{floor}(\log_2(\sqrt{S}))}$$$ $$$\leq$$$ $$$\text{P}$$$ $$$\leq$$$ $$$2^{\text{ceil}(\log_2(\sqrt{S}))}$$$

Using the equivalence with bit shifting, we can rewrite these bounds as:

$$$\text{(1}$$$ << $$$\text{floor}(\log_2(\sqrt{S}))$$$) $$$\leq$$$ $$$\text{P}$$$ $$$\leq$$$ $$$\text{(1}$$$ << $$$\text{ceil}(\log_2(\sqrt{S})))$$$

Now, here’s the key insight:
When $$$\text{S}$$$ is represented in binary, the floor of $$$\log_2(S)$$$ corresponds to the index of the most significant set bit $$$\text{(MSB)}$$$. Now, $$$\text{MSB}(x)$$$ can be found as follows:

$$$\text{MSB}(S)$$$ = $$$63 -$$$ __builtin_clzll(S)

then we can estimate:

Lower bound:
$$$\text{P}$$$ $$$\geq$$$ $$$\text{1}$$$ << $$$floor(\left(\text{MSB}(S) \, \text{divided by } 2\right))$$$

(Using bit-shift: $$$\text{1}$$$ << $$$(\text{MSB}(S)$$$ >> $$$\text{1})$$$

Upper bound:
$$$\text{P}$$$ $$$\leq$$$ $$$\text{1}$$$ << $$$floor(\left(\text{MSB}(S) \, \text{divided by } 2\right)) + 1$$$

(Using bit-shift: $$$\text{1}$$$ << $$$((\text{MSB}(S)$$$ >> $$$\text{1}) + 1)$$$

These bounds are computed in O(1) time and reduce the range for binary search from $$$[0, S]$$$ to a much smaller interval around the actual square root.

Results

I benchmarked the square root calculation for all integers less than $$$2 \times 10^8$$$:

Standard binary search (with range 0 to S): 29 seconds
Optimized binary search (with our computed bounds): 12 seconds
Built-in function (sqrtll()): 5 seconds

While the built-in function remains the fastest (thanks to hardware acceleration), the optimized binary search is significantly faster than the standard approach. Here is the time duration:

Hell, you run this on your machine and see the difference:

Code Snippet

#include <bits/stdc++.h>
#include <chrono>
using namespace std;
#define ll long long int
#define vll vector<ll>
#define kxxprintln(x) cout << x << endl;

ll custom_sqrt(ll val)
{
    // optimized binary search
    if (val < 0)
        return -1;
    if (val < 2)
        return val;

    // msb of val
    int msb_Val = 63 - __builtin_clzll(val);

    ll low, high, ans;

    if (__builtin_popcountll(val) == 1)
    {
        // when theta is integer
        // when S is a power of 2
        // high = 1 << ceil(msb_Val/2)
        low = 1LL << (msb_Val >> 1);
        high = 1LL << ((msb_Val + 1) >> 1);
    }
    else
    {
        // when theta is not an integer
        // high = 1 << ceil(msb_Val/2 + fraction)
        // fraction part can never be equal to 1 so we add + 1
        low = 1LL << (msb_Val >> 1);
        high = 1LL << ((msb_Val >> 1) + 1);
    }
    ans = low;
    while (low <= high)
    {
        ll mid = low + high >> 1;
        if (mid <= val / mid) // avoid overflow
        {
            ans = mid, low = mid + 1;
        }
        else
        {
            high = mid - 1;
        }
    }
    return ans;
}
ll sqt(ll val)
{
    // standard binary search
    if (val < 2)
        return val;
    ll low = 0, high = val, ans = 0;
    while (low <= high)
    {
        ll mid = low + high >> 1;
        if (mid <= val / mid)
        {
            ans = mid;
            low = mid + 1;
        }
        else
        {
            high = mid - 1;
        }
    }
    return ans;
}
int main()
{
    ios::sync_with_stdio(false);
    cin.tie(nullptr);
    int till = 2e8 + 5e5;

    vll arr(till, 0);

    kxxprintln("time taken to calculate square root of all integers less than " << till << " now starting:");

    auto start1 = chrono::high_resolution_clock::now();

    for (int i = 0; i < till; i++)
    {
        // builtin
        arr[i] = sqrtl(i);
    }

    auto end1 = chrono::high_resolution_clock::now();

    auto duration1 = chrono::duration_cast<chrono::milliseconds>(end1 - start1);

    kxxprintln("time taken: " << duration1.count() << " ms using inbuilt function");

    auto start2 = chrono::high_resolution_clock::now();

    for (int i = 0; i < till; i++)
    {
        // optimized
        ll x = custom_sqrt(i);
        // if precision error, print index of non-precision
        if (x - arr[i])
        {
            kxxprintln(i);
            return 0;
        }
    }

    auto end2 = chrono::high_resolution_clock::now();

    auto duration2 = chrono::duration_cast<chrono::milliseconds>(end2 - start2);

    kxxprintln("time taken: " << duration2.count() << " ms using optimized binary search");

    auto start3 = chrono::high_resolution_clock::now();

    for (int i = 0; i < till; i++)
    {
        // standard
        ll x = sqt(i);
        // if precision error, print index of non-precision
        if (x - arr[i])
        {
            kxxprintln(i);
            return 0;
        }
    }

    auto end3 = chrono::high_resolution_clock::now();

    auto duration3 = chrono::duration_cast<chrono::milliseconds>(end3 - start3);

    kxxprintln("time taken: " << duration3.count() << " ms using binary search");

    return EXIT_SUCCESS;
}

P.S.

I realize that the proof might be a bit dense at first glance — so I’ve attached a detailed and easy-to-understand handwritten version of proof that for sure WILL help clarify the details further. If you find any flaw in this observation, please do let me know. Up to $$$2 \times 10^8$$$, there is no noticeable precision error—even when testing edge cases like Int64_max.

PROOF

Final Thoughts

While this optimized approach isn’t likely to replace the built-in functions in production code, it’s a fun and educational exercise that showcases how a deep understanding of binary representations and bitwise operations can lead to practical performance improvements. If you’re interested in algorithm optimization or just love geeky math, I hope you found this exploration as exciting as I did.

Happy coding and keep optimizing!

Feel free to leave your comments, suggestions, or critiques below. I’d love to hear your thoughts and any further optimizations you might suggest!

Full text and comments »