It is a well-known problem for beginners (after learning searching minimum through array): to choose M
smallest elements of an array of size N
.
Here is an evolution of this problem on bigdata scale:
We have terabytes of log-files, say with
N
records. Each record contains value for some parameterA
(among other auxiliary data). We want to chooseM
(say, million) records with smallest values of this parameter. How to manage with it?
There are two difficulties:
- source data could not be stored into RAM, so we could not access arbitrary element in
O(1)
— instead we rely on sequential processing; - algorithm should be parallelizable since
N
is about1e12
and single pass with a single thread will took about a day (however source data are stored in distibuted file system and are accessible by fragments of about100Mb - 1Gb
so several processors may work with different fragments simultaneously).
What approaches there are?
I came up with few basic ideas:
Go through source data and store them into binary heap of maximum size
M
— i.e. next element is skipped if is greater than maximum of the heap — and if it is otherwise stored, then the maximum is popped out of a heap to preserve the size. It will work with aboutN + M*log(M)
but I know not how to parallelize the binary heap... So probably it will not do.If
M
itself is so large that could not be stored in RAM we can store elements in the list instead of binary heap. After list grows to size2M
for example, we sort it and cut to the sizeM
throwing away unnecessary elements and proceeding. This is well paralelizable but speed is aboutN*log(M)
I think.I like stochastic variant: just pass through data and choose all for which
A
is less than some thresholdk
. How to determine this threshold? For example we can make pre-pass and choose randomly as many records as could be stored into RAM. Then chose among them few minimal (in proportion toM/N
) and assign the maximum of these minimums as thresholdk
. Of course we can end up with few or less thanM
elements (but we can artificially increasek
and at the end throw away the surplus).
However all these variants are not extremely good. Could you propose something better, please?