#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	strapple	3515
6	tourist	3470
7	dXqwq	3436
8	Radewoosh	3415
9	Otomachi_Una	3413
10	Um_nik	3376

#	User	Contrib.
1	Qingyu	158
2	adamant	152
3	Proof_by_QED	146
3	Um_nik	146
5	Dominater069	144
6	errorgorn	141
7	cry	139
8	YuukiS	135
9	chromate00	134
9	TheScrasse	134

Ok, so first of all i don't know if it exist. But i will share my algorithm here only.

MOTIVATION!

Radix sort with base > 10?

so my motivation came when i though of why can't we use base 1e5 or 1e6 in radix sort? and i came up with following idea.

Most of the questions on CF have limit of values <=1e9. So my algorithm focuses majorily on that thing.

Alogirthm.

let us define a constant base, now we will make groups on the basis of ai % base,

group[j] -> contains ai / base of those whoose ai % base == j.
counts[j] -> it will store total number of times ai / base is equal to j. Note that both arrays will be of length base.

Next question — Now, how can we find the index(in sorted array) of any value of any group?

Observe, let x be any value of group[i], then value = x * base + i, its index will be atleast bigger than prefix Sum of counts[i] upto x — 1, also value will be bigger than those value whoose group number is smaller with same x.

Final touch

After making groups, and counts make prefix array of counts(let us call that array pref).
Define another array Live of length base, (initate all elements with 0).
Iterate from group 0 -> base. Now for every element(x) in this group(i) do the following
- value = x * base + i
- the index of this value = pref[x — 1] + live[x]
- now do live[x] += 1
- Update array[index] = value.
the sorting is done!

Implementation!

void sortit(vector<int> &arr, int base = 1 << 15) {
    int n = arr.size();
    
    vector<vector<int>> grps(base);
    vector<int> cnts(base + 1), live(base);
    for (int i = 0; i < n; i++) {
        int z = arr[i] / base;
        grps[arr[i] % base].push_back(z);//adding each element in respective group
        cnts[z + 1]++;//increasing count
    }
    for (int i = 1; i < base; i++) cnts[i] += cnts[i - 1]; //making prefix array
    
    for (int i = 0; i < base; i++) {
        for (int &x : grps[i]) arr[cnts[x] + live[x]++] = x * base + i;
    }
}

Analysis.

For all values less than base^2(base = 2^15), the time complexity comes out to be O(n + base), 2 * n(2 for loops of n) + base(memory definition and prefix array). you can choose base depending on max value of array, N and memory.

For n > base it starts becoming faster and faster than cpp sort.

Results.

These results are when i ran them on my laptop. UPD- Added radix sort comparison also!

Size 1000 Sorting....

Time taken by sortit() -> 1ms
Time taken  by cpp sort() -> 0ms
Time taken by radix sort(base 256)(4 rounds) -> 0ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 4ms

Size 10000 Sorting....

Time taken by sortit() -> 1ms
Time taken  by cpp sort() -> 1ms
Time taken by radix sort(base 256)(4 rounds) -> 1ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 5ms

Size 100000 Sorting....

Time taken by sortit() -> 9ms
Time taken  by cpp sort() -> 15ms
Time taken by radix sort(base 256)(4 rounds) -> 5ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 18ms

Size 1000000 Sorting....

Time taken by sortit() -> 40ms
Time taken  by cpp sort() -> 185ms
Time taken by radix sort(base 256)(4 rounds) -> 54ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 90ms

Size 10000000 Sorting....

Time taken by sortit() -> 316ms
Time taken  by cpp sort() -> 2171ms
Time taken by radix sort(base 256)(4 rounds) -> 572ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 508ms

Size 100000000 Sorting....

Time taken by sortit() -> 3056ms
Time taken  by cpp sort() -> 24149ms
Time taken by radix sort(base 256)(4 rounds) -> 5747ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 4749ms

Size 1000000000 Sorting....

Time taken by sortit() -> 35071ms
Time taken  by cpp sort() -> 277987ms
Time taken by radix sort(base 256)(4 rounds) -> 65696ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 50732ms

For higher values of n, it tends to become faster. Hence, this is suitable for higher n values.

If you find any similar or same algorithm, please let me know!

void radixSort(vector<int> &arr, int base = 1 << 8, int rounds = 4) { int div = 1; for (int round = 0; round < rounds; round++) { vector<vector<int>> grps(base); for (int &x : arr) grps[(x / div) % base].push_back(x); int idx = 0; for (int i = 0; i < base; i++) { for (int &x : grps[i]) arr[idx++] = x; } div *= base; } }

// Increase vector size while (a.size() < i) { // rand() % (1 << 30) generates a positive 30-bit integer int x = rand() % (1 << 30); a.push_back(x); b.push_back(x); }

Comments (31)

Write comment?

Jrke

6 months ago, hide # |

Auto comment: topic has been updated by Jrke (previous revision, new revision, compare).

→ Reply

Upone

← Rev. 2 →

std::sort() after this blog: bro what did I do :(

griffith911

+11

https://www.youtube.com/watch?v=Y95a-8oNqps a little late bro.. but keep up the good work!!

little-star

Cool, but usually sort is enough

yashkumar22072002

will it work with non positive integers?

int23_t

6 months ago, hide # ^ |

if you want, it can. You just have to define negative indices in an array, which you can do by defining the middle element as zero or something. Or you can just sort negatives and positives separately, as you are just adding N operations doing that anyways

oxidisingojha

logn never hurt nobody

OtterZ

I don't want to hurt you,but the algorithm is already exist for a long time(also called radix sort).It has been used to get SA.

MIRAJ12

I don’t think it’s Radix Sort. It’s more like he’s placing values at their corresponding indices, but in an optimized way.He needs roughly √(max(val)) of storage. So I’d say it’s closer to square root decomposition.

← Rev. 4 →

There are three problems if we want to use it on Cf:

If the base is such that value / base >= base, it will fail. So we always need to fix base >= 2^15 for any value (0 to 1e9).

If the base is fixed, it will always take contest time, so most Codeforces problems will give TLE because of multiple test cases.

It’s also not going to work for negative numbers, but it can be fixed by splitting positive and negative numbers.

The good thing is that it doesn’t depend on the size of the array, it depends on the maximum value.
EDIT: Also, you can check out the video:FASTEST sorting. Also you can also learn more about Tim Sort.

for bigger n this is better i think, also for negative numbers you can add 1e9 in every element first while making grps, and then later subtract 1e9 from each value while placing values in array.

If you add 1e9 beforehand, you’ll need a base of √(2×1e9). Also, it’s only efficient for a single test case, not for multiple test cases. You could also say the time complexity is √(max × 2), not O(n).

arvindf232

+75

You seemed to have provided a new radix sort implementation that seeming saves 1 array sweep, but additionally costs in using vectors and growing them.

Your benchmark should be against the standard base 256 (4 cycles) or base 1<<11 (3 cycles) 1<<16 radix sort (2 cycles). It is clear that radix sort is faster than standard sort by a lot on large arrays. I fail to see the vector creation and capacity expansion are faster than just doing a radix sort normally with 2 cycles, and empirically 4 cycles seem to do better than 2 cycles anyways.

In summary, please prove it is superior to radix sort base 256 and 1<<16, but I don’t think this is the case.

+17

with this implementation of radix sort.

I got these outputs —

+32

Wrong implementation, radix sort can be implemented without growing vectors. Also for both of your implementations, replace modulo/ division with appropriate bit AND/ bit shift.

in each round: count up (x % base), do one prefix sum on the bin size, then directly write to the second array.

123asdf123

Well, radix sort can be even faster.

There's a hard problem in Chinese: https://www.luogu.com.cn/problem/P4604.

Its first task is to sort $$$2\times 10^8$$$ unsigned int in 6s, and even need loop unrolling and other tricks to pass it...

You can try it if you want :)

AksLolCoding

This is amazing! One issue is that $$$O(n+\sqrt{max(a_i)})$$$ is very hackable: since many CF problems have multiple testcases, you can just spam [1e9] a bunch of times and cause TLE

neal

+16

Try comparing to the implementation at the top of https://github.com/nealwu/competitive-programming/blob/69e8834975dbd55bbfb3b715caeb925a705fde48/miscellaneous/radix_sort.cc (until line 72)

I just copy pasted this code till line 72 didn't changed anything. I also optimized my code as said by arvindf232, regarding bit wise AND and bit shifts thanks!

These were the time taken —

Time taken by sortit() -> 1ms
Time taken by radix sort -> 0ms

Time taken by sortit() -> 2ms
Time taken by radix sort -> 0ms

Time taken by sortit() -> 9ms
Time taken by radix sort -> 3ms

Time taken by sortit() -> 40ms
Time taken by radix sort -> 40ms

Time taken by sortit() -> 291ms
Time taken by radix sort -> 403ms

Time taken by sortit() -> 2870ms
Time taken by radix sort -> 4234ms

Time taken by sortit() -> 46141ms
Time taken by radix sort -> 54124ms

maxplus

Weird, in my local tests neal's implementation is ~3x faster than yours with radix 256 on uniform random inputs, and can be optimized to ~4x by removing the min-max calculation that is not necessary in your case and overlapping the first and last inner loops of adjacent passes. On codeforces it's harder to benchmark precisely, but results are pretty similar.

How do you generate inputs? How do you measure execution time?

i ran this code -> https://pastebin.com/sr3zS8HT

← Rev. 7 →

+10

Maybe you didn't enable O2...But when O2 is disabled, radix sort just 100ms slower than sortit() in $$$10^8$$$ sorting...

This is my result with O2:

Spoiler

Size 1000 Sorting....
        Time taken by sortit() -> 0ms
        Time taken by radix sort -> 0ms

Size 10000 Sorting....
        Time taken by sortit() -> 2ms
        Time taken by radix sort -> 0ms

Size 100000 Sorting....
        Time taken by sortit() -> 0ms
        Time taken by radix sort -> 0ms

Size 1000000 Sorting....
        Time taken by sortit() -> 31ms
        Time taken by radix sort -> 0ms

Size 10000000 Sorting....
        Time taken by sortit() -> 167ms
        Time taken by radix sort -> 47ms

Size 100000000 Sorting....
        Time taken by sortit() -> 1459ms
        Time taken by radix sort -> 518ms

Without O2:

Size 1000 Sorting....
        Time taken by sortit() -> 1ms
        Time taken by radix sort -> 1ms

Size 10000 Sorting....
        Time taken by sortit() -> 2ms
        Time taken by radix sort -> 0ms

Size 100000 Sorting....
        Time taken by sortit() -> 5ms
        Time taken by radix sort -> 0ms

Size 1000000 Sorting....
        Time taken by sortit() -> 52ms
        Time taken by radix sort -> 25ms

Size 10000000 Sorting....
        Time taken by sortit() -> 351ms
        Time taken by radix sort -> 285ms

Size 100000000 Sorting....
        Time taken by sortit() -> 2771ms
        Time taken by radix sort -> 2841ms

What 123asdf123 said, and don't use rand().

lrvideckis

I also optimized my code as said by arvindf232

bruh, your code still has push_back's

Could you share your implementation for "can be optimized to ~4x by removing the min-max calculation that is not necessary in your case and overlapping the first and last inner loops of adjacent passes"?

It's a bit ugly, sorry for that. I've hoped to further improve performance on large uniform arrays with the (MSB-LSB-)hybrid version because of cache locality, but it only pays off if the LSB version doesn't fuse scattering with the next histogram computation.

Code

template <uint32_t radix = 1 << 8, uint32_t mx = -1u, uint32_t presorted = 1>
void radix_sort_lsb(auto& src, auto& dst, int sz, auto& cnt, auto& cnt2) {
  constexpr bool first = presorted == 1;
  constexpr bool last = presorted > mx / radix;
  auto extract = [](auto x) { return x / presorted % radix; };
  if (first) for (int i = 0; i < sz; ++i) ++cnt[extract(src[i])];
  for (int i = 0, p = 0; i < radix; ++i) p += exchange(cnt[i], p);
  if constexpr (last) {
    for (int i = 0; i < sz; ++i) dst[cnt[extract(src[i])]++] = src[i];
    swap(src, dst);
  } else {
    auto extract2 = [](auto x) { return x / presorted / radix % radix; };
    for (int i = 0; i < sz; ++i) dst[cnt[extract(src[i])]++] = src[i], ++cnt2[extract2(src[i])];
    fill_n(&cnt[0], radix, 0);
    swap(src, dst);
    swap(cnt, cnt2);
    radix_sort_lsb<radix, mx, presorted * radix>(src, dst, sz, cnt, cnt2);
  }
}

template <uint32_t radix = 1 << 8, uint32_t mx = -1u> void radix_sort_lsb(vector<uint32_t>& src) {
  vector<uint32_t> dst(src.size()), cnt(radix), cnt2(radix);
  radix_sort_lsb<radix, mx, 1>(src, dst, src.size(), cnt, cnt2);
}

template <uint32_t radix = 1 << 8, uint32_t mx = -1u> void radix_sort_hybrid(vector<uint32_t>& src) {
  vector<uint32_t> dst(src.size()), cnt(radix), ccnt(radix), ccnt2(radix);
  constexpr auto nmx = (mx - 1) / radix + 1;
  int sz = src.size();
  for (int i = 0; i < sz; ++i) ++cnt[src[i] / nmx];
  for (int i = 0, p = 0; i < radix; ++i) p += exchange(cnt[i], p);
  for (int i = 0; i < sz; ++i) dst[cnt[src[i] / nmx]++] = src[i];
  for (int i = 0, p = 0; i < radix; p = cnt[i++]) {
    uint32_t* csrc = &dst[p];
    uint32_t* cdst = &src[p];
    radix_sort_lsb<radix, nmx - 1, 1>(csrc, cdst, cnt[i] - p, ccnt, ccnt2);
    ccnt.assign(radix, 0);
    if (csrc != &src[p]) copy_n(csrc, cnt[i] - p, &src[p]);
  }
}