2 sum problem expanded to distributed systems/map reduce

So recently I was interviewing for a big data engineer role, I had cleared the first round which was just DSA, SQL, and discussion on the big data tech that I had worked on, in the second round I came across an interesting problem so I wanted to share it with you all, it was same as 2 sum but i was expected to use map-reduce:

Problem statement:

There are a large number of numbers residing on multiple nodes (storage + commodity compute), you have one main compute server where you can perform your computing, find all distinct pairs which equal to a sum of "target" and the count of those pairs, now you can't load all the numbers on your main compute as you do not have enough storage to do so.

find all x+y=target

The solution the interviewer expected:

approach: Map-Reduce:

send a map program to all the nodes where the map function calculates the following :

(number,count(number))

next is the reduce phase where we use the following hashing algorithm in order to assign the above-calculated info in various nodes into buckets in our main compute

The bucket a given number is assigned to, is calculated as follows: min(number, target-number)

in that way, a pair, if they have a sum equal to target, will be assigned to the same bucket and we can compute our answers easily (edge case for equal numbers but yeah that can be handled somehow)

Now I know this question has a bunch of loop holes but the I hope the intent of the question was understood.

This was the first time I came across such a problem, can someone recommend me resources to learn/practice such problems?

#	User	Rating
1	Benq	3792
2	VivaciousAubergine	3647
3	Kevin114514	3603
4	jiangly	3583
5	turmax	3559
6	tourist	3541
7	strapple	3515
8	ksun48	3461
9	dXqwq	3436
10	Otomachi_Una	3413

#	User	Contrib.
1	Qingyu	157
2	adamant	153
3	Um_nik	147
3	Proof_by_QED	147
5	Dominater069	145
6	errorgorn	142
7	cry	139
8	YuukiS	135
9	TheScrasse	134
10	chromate00	133

Saucemaster102's blog