Rapleaf Array Absurdity or On streaming problems in disguise
From the interview challenges of an up and coming web startup, three problems that range from the trivial to the impossible. The key to the the solution is to recognize that the setting is close to that of streaming algorithms, which allows for very limited space resources compared to the size of the input. This challenge assumes the data is in an array, an additional benefit not assumed in the streaming literature, but it's close enough to use some of those ideas, since random access to the array doesn't seem to help for the specific questions.
Imagine we have an immutable array of size N which we know to be filled with integers ranging from $0$ to $N-2$, inclusive.
Imagine we have an immutable array of size N which we know to be filled with integers ranging from $0$ to $N-2$, inclusive.
- Write a function that checks to see if the array contains a duplicated entry. This function should run as quickly as possible. That is, complete the follow ing function:
- Suppose we know that the array contains exactly one duplicated entry and that duplicate appears exactly twice. Find, in constant space and time proportional to $N$, the duplicated entry. That is, complete the following function:
- Suppose that we drop the guarantee that the array contains exactly one duplicated entry. Write another find_duplicate function so that it returns one of the duplicated entries. It should still run in constant space and time proportional to $N$.
bool contains_duplicate(int* array, int N) { // ... }Solution: Always true by the pigeonhole principle.
int find_unique_duplicate(int* array, int N) { // ... }Solution: \[\sum_i \mathrm{array}[i] - \frac{(N-2)(N-1)}{2}\] This is a known trick in the field of streaming algorithms.
int find_duplicate(int* array, int N) { // ... }Solution or lack thereof: It's impossible in the asymptotic complexity sense. Check J. Tarui, "Finding a duplicate and a missing item in a stream". In TAMC, pages 128–135, 2007 -- specifically Theorem 1 in Section 3. There is $O(N log N)$ solution, see P. Gopalan, J. Radhakrishnan "Finding duplicates in a data stream". Briefly, one can do a binary search for the value of a duplicated item by splitting the current search interval and counting in one array pass if there are more items above or below the threshold. Pick the interval with the larger count and repeat. If you make the size of the array and size of the range of elements independent, then this solution is O(N log M) and since C ints are a finite set, you could see this as $O(N)$, but if $M = N - 2$ then this is a finite problem and there is no asymptotic analysis one can apply. I guess this is more about clarifying the problem then actually solving it
-- Graham
Here's the question http://domino.research.ibm.com/comm/wwwr_ponder.nsf/Challenges/January2004.html , you can also find a solution that works there.