From the interview challenges of an up and coming web startup, three problems that range from the trivial to the impossible. The key to the the solution is to recognize that the setting is close to that of streaming algorithms, which allows for very limited space resources compared to the size of the input. This challenge assumes the data is in an array, an additional benefit not assumed in the streaming literature, but it's close enough to use some of those ideas, since random access to the array doesn't seem to help for the specific questions.

Imagine we have an immutable array of size N which we know to be filled with integers ranging from $0$ to $N-2$, inclusive.
  1. Write a function that checks to see if the array contains a duplicated entry. This function should run as quickly as possible. That is, complete the follow ing function:
  2. bool contains_duplicate(int* array, int N) { // ... }
    Solution: Always true by the pigeonhole principle.
  3. Suppose we know that the array contains exactly one duplicated entry and that duplicate appears exactly twice. Find, in constant space and time proportional to $N$, the duplicated entry. That is, complete the following function:
  4. int find_unique_duplicate(int* array, int N) { // ... }
    Solution: \[\sum_i \mathrm{array}[i] - \frac{(N-2)(N-1)}{2}\] This is a known trick in the field of streaming algorithms.
  5. Suppose that we drop the guarantee that the array contains exactly one duplicated entry. Write another find_duplicate function so that it returns one of the duplicated entries. It should still run in constant space and time proportional to $N$.
  6. int find_duplicate(int* array, int N) { // ... }
    Solution or lack thereof: It's impossible in the asymptotic complexity sense. Check J. Tarui, "Finding a duplicate and a missing item in a stream". In TAMC, pages 128–135, 2007 -- specifically Theorem 1 in Section 3. There is $O(N log N)$ solution, see P. Gopalan, J. Radhakrishnan "Finding duplicates in a data stream". Briefly, one can do a binary search for the value of a duplicated item by splitting the current search interval and counting in one array pass if there are more items above or below the threshold. Pick the interval with the larger count and repeat. If you make the size of the array and size of the range of elements independent, then this solution is O(N log M) and since C ints are a finite set, you could see this as $O(N)$, but if $M = N - 2$ then this is a finite problem and there is no asymptotic analysis one can apply. I guess this is more about clarifying the problem then actually solving it

Comments

Anonymous
The trick for this problem is to start from the N'th location in the array: the preconditions ensure that there is no legitimate cycle which can bring you back there, hence you will find a duplicate element via Floyd's alg.
-- Graham
piccolbo
Actually, thinking a bit harder about what constant space implies, what we have here is a read only Turing machine that can only recognize regular languages. Since the language of duplicate-free strings is not closed under concatenation (with itself, for instance), it is not regular and not recognizable by a finite state machine, which is equivalent to a read only Turing Machine. Which prompts a correction about problem 2, since to be able to perform that sum one needs at least log space. Constant space is a very severe restrictions and i am not sure people really mean it when they formulate this kind of problems. A constant sized set of integer variables is not the same thing unless we have a bound on the range of those variables.
piccolbo
Thanks for the pointer. That article reproduces Floyd's cycle finding algorithm (http://en.wikipedia.org/wiki/Floyd%27s_cycle-finding_algorithm#Tortoise_and_hare) with a wrong attribution. Cycle finding is not the same as duplicate finding. 2 1 represents a cycle but has no duplicates. In the IBM version it seems we are guaranteed there is at least one duplicate, so this counterexample doesn't work, but 2 1 3 3 does (the algorithm, depending on the initialization, could get stuck in the first cycle and never find the repeating element). It's surprising that this is is reported as the correct solution, and the attribution is also surprising. Maybe I am missing something, but I think the error comes from confusing the statement "if there is a duplicate, there is a cycle", which is true, with the converse "if there is a cycle, there is a duplicate" or the related "if we find a cycle, it must contain a duplicate". You are correct in that I did not prove formally that for the purpose of duplicate finding a read only array with linear time and constant read/write memory is equivalent to the streaming setting. Random access to the read only array alone could prove important. So the negative result by Tarui might or might not apply. My intuition is that since every permutation P(A) has the same number of duplicates, random access won't help. But since one of those permutations is the sorted permutation, which makes the problem trivial, maybe my intuition is wrong. Actually, would love to see an algorithm, but the one you pointed to is not it.
Anonymous
It's possible because you're working with an array not a stream.

Here's the question http://domino.research.ibm.com/comm/wwwr_ponder.nsf/Challenges/January2004.html , you can also find a solution that works there.