Ok, so first of all i don't know if it exist. But i will share my algorithm here only.
MOTIVATION!
Radix sort with base > 10?
so my motivation came when i though of why can't we use base 1e5 or 1e6 in radix sort? and i came up with following idea.
Most of the questions on CF have limit of values <=1e9. So my algorithm focuses majorily on that thing.
Alogirthm.
let us define a constant base, now we will make groups on the basis of ai % base,
- group[j] -> contains ai / base of those whoose ai % base == j.
- counts[j] -> it will store total number of times ai / base is equal to j. Note that both arrays will be of length base.
Next question — Now, how can we find the index(in sorted array) of any value of any group?
Observe, let x be any value of group[i], then value = x * base + i, its index will be atleast bigger than prefix Sum of counts[i] upto x — 1, also value will be bigger than those value whoose group number is smaller with same x.
Final touch
- After making groups, and counts make prefix array of counts(let us call that array pref).
- Define another array Live of length base, (initate all elements with 0).
- Iterate from group 0 -> base. Now for every element(x) in this group(i) do the following
- value = x * base + i
- the index of this value = pref[x — 1] + live[x]
- now do live[x] += 1
- Update array[index] = value.
- the sorting is done!
Implementation!
void sortit(vector<int> &arr, int base = 1 << 15) {
int n = arr.size();
vector<vector<int>> grps(base);
vector<int> cnts(base + 1), live(base);
for (int i = 0; i < n; i++) {
int z = arr[i] / base;
grps[arr[i] % base].push_back(z);//adding each element in respective group
cnts[z + 1]++;//increasing count
}
for (int i = 1; i < base; i++) cnts[i] += cnts[i - 1]; //making prefix array
for (int i = 0; i < base; i++) {
for (int &x : grps[i]) arr[cnts[x] + live[x]++] = x * base + i;
}
}
Analysis.
For all values less than base^2(base = 2^15), the time complexity comes out to be O(n + base), 2 * n(2 for loops of n) + base(memory definition and prefix array). you can choose base depending on max value of array, N and memory.
For n > base it starts becoming faster and faster than cpp sort.
Results.
These results are when i ran them on my laptop. UPD- Added radix sort comparison also!
Size 1000 Sorting....
Time taken by sortit() -> 1ms
Time taken by cpp sort() -> 0ms
Time taken by radix sort(base 256)(4 rounds) -> 0ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 4ms
Size 10000 Sorting....
Time taken by sortit() -> 1ms
Time taken by cpp sort() -> 1ms
Time taken by radix sort(base 256)(4 rounds) -> 1ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 5ms
Size 100000 Sorting....
Time taken by sortit() -> 9ms
Time taken by cpp sort() -> 15ms
Time taken by radix sort(base 256)(4 rounds) -> 5ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 18ms
Size 1000000 Sorting....
Time taken by sortit() -> 40ms
Time taken by cpp sort() -> 185ms
Time taken by radix sort(base 256)(4 rounds) -> 54ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 90ms
Size 10000000 Sorting....
Time taken by sortit() -> 316ms
Time taken by cpp sort() -> 2171ms
Time taken by radix sort(base 256)(4 rounds) -> 572ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 508ms
Size 100000000 Sorting....
Time taken by sortit() -> 3056ms
Time taken by cpp sort() -> 24149ms
Time taken by radix sort(base 256)(4 rounds) -> 5747ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 4749ms
Size 1000000000 Sorting....
Time taken by sortit() -> 35071ms
Time taken by cpp sort() -> 277987ms
Time taken by radix sort(base 256)(4 rounds) -> 65696ms
Time taken by radix sort(base 1 << 16)(2 rounds) -> 50732ms
For higher values of n, it tends to become faster. Hence, this is suitable for higher n values.
If you find any similar or same algorithm, please let me know!








Auto comment: topic has been updated by Jrke (previous revision, new revision, compare).
std::sort() after this blog: bro what did I do :(
https://www.youtube.com/watch?v=Y95a-8oNqps a little late bro.. but keep up the good work!!
Cool, but usually sort is enough
will it work with non positive integers?
if you want, it can. You just have to define negative indices in an array, which you can do by defining the middle element as zero or something. Or you can just sort negatives and positives separately, as you are just adding N operations doing that anyways
logn never hurt nobody
I don't want to hurt you,but the algorithm is already exist for a long time(also called radix sort).It has been used to get SA.
I don’t think it’s Radix Sort. It’s more like he’s placing values at their corresponding indices, but in an optimized way.He needs roughly √(max(val)) of storage. So I’d say it’s closer to square root decomposition.
for bigger n this is better i think, also for negative numbers you can add 1e9 in every element first while making grps, and then later subtract 1e9 from each value while placing values in array.
If you add 1e9 beforehand, you’ll need a base of √(2×1e9). Also, it’s only efficient for a single test case, not for multiple test cases. You could also say the time complexity is √(max × 2), not O(n).
You seemed to have provided a new radix sort implementation that seeming saves 1 array sweep, but additionally costs in using vectors and growing them.
Your benchmark should be against the standard base 256 (4 cycles) or base 1<<11 (3 cycles) 1<<16 radix sort (2 cycles). It is clear that radix sort is faster than standard sort by a lot on large arrays. I fail to see the vector creation and capacity expansion are faster than just doing a radix sort normally with 2 cycles, and empirically 4 cycles seem to do better than 2 cycles anyways.
In summary, please prove it is superior to radix sort base 256 and 1<<16, but I don’t think this is the case.
with this implementation of radix sort.
I got these outputs —
Size 1000 Sorting....
Size 10000 Sorting....
Size 100000 Sorting....
Size 1000000 Sorting....
Size 10000000 Sorting....
Size 100000000 Sorting....
Wrong implementation, radix sort can be implemented without growing vectors. Also for both of your implementations, replace modulo/ division with appropriate bit AND/ bit shift.
in each round: count up (x % base), do one prefix sum on the bin size, then directly write to the second array.
Well, radix sort can be even faster.
There's a hard problem in Chinese: https://www.luogu.com.cn/problem/P4604.
Its first task is to sort $$$2\times 10^8$$$
unsigned intin 6s, and even need loop unrolling and other tricks to pass it...You can try it if you want :)
This is amazing! One issue is that $$$O(n+\sqrt{max(a_i)})$$$ is very hackable: since many CF problems have multiple testcases, you can just spam [1e9] a bunch of times and cause TLE
Auto comment: topic has been updated by Jrke (previous revision, new revision, compare).
Try comparing to the implementation at the top of https://github.com/nealwu/competitive-programming/blob/69e8834975dbd55bbfb3b715caeb925a705fde48/miscellaneous/radix_sort.cc (until line 72)
I just copy pasted this code till line 72 didn't changed anything. I also optimized my code as said by arvindf232, regarding bit wise AND and bit shifts thanks!
These were the time taken —
Size 1000 Sorting....
Size 10000 Sorting....
Size 100000 Sorting....
Size 1000000 Sorting....
Size 10000000 Sorting....
Size 100000000 Sorting....
Size 1000000000 Sorting....
Weird, in my local tests neal's implementation is ~3x faster than yours with radix 256 on uniform random inputs, and can be optimized to ~4x by removing the min-max calculation that is not necessary in your case and overlapping the first and last inner loops of adjacent passes. On codeforces it's harder to benchmark precisely, but results are pretty similar.
How do you generate inputs? How do you measure execution time?
i ran this code -> https://pastebin.com/sr3zS8HT
Maybe you didn't enable O2...But when O2 is disabled, radix sort just 100ms slower than
sortit()in $$$10^8$$$ sorting...This is my result with O2:
Without O2:
What 123asdf123 said, and don't use rand().
bruh, your code still has push_back's
Could you share your implementation for "can be optimized to ~4x by removing the min-max calculation that is not necessary in your case and overlapping the first and last inner loops of adjacent passes"?
It's a bit ugly, sorry for that. I've hoped to further improve performance on large uniform arrays with the (MSB-LSB-)hybrid version because of cache locality, but it only pays off if the LSB version doesn't fuse scattering with the next histogram computation.
You're right, but I'm not even sure if my program could take the whole input of 1000000000 numbers.
He used this:
However, rand() % (1 << 30) doesn't generates a positive 30-bit integer,it only generate [0,32767] Jrke
based on my testing this is faster link (implementation)
Isn't this at larger values of n just bucketing or values in frequency table and just shifting them into a vector then?