# Find the odd bag

From a job interview challenge, an interesting probability exercise in two parts. One of the themes here is pretty standard fare. You are given a clearly defined random procedure whose outcome is a mixture of two distributions. The problem is, given a certain set of outcomes, find which of the two distributions it is coming from. For instance, imagine you have to assign one of two classes to an item based on repeated noisy measurements and you know the relative size of the two classes (a priori probability of belonging to one of the two). The second part of the challenge is a bit more interesting but also eccentric. It is asking for a best case outcome that would make it easiest (smallest sample) to detect the class of the item with a certain error probability. I am not aware of any practical statistical question where such a best case problem arises, even if we consider the converse, the worst case outcome. But being used to worst case analysis from my CS training, I came up with an optimality proof based on induction and manipulation of binomial coefficients, which confirms the intuition that a very unlikely, extreme outcome is the best one. The main idea is that when lower bounding an expression including binomial coefficients, it is somehow easier to prove a tight lower bound because the binomial coefficients on the two sides of the inequality are very similar in that case and one can simplify a lot and then use simple algebra. It won’t set the world of Mathematics abuzz, but it seemed interesting enough to share.

Part 1 The following random experiment is described. There are 5 identical bags, 4 of which contain 4 read beads and 96 black ones, the 5th instead has 7 and 93 resp. Select one bag according to the uniform distribution and sample three beads, 1 red and 2 black. What is the probability that the selected bag was the 5th?

Part 2 Let’s go back to the initial condition, pick one bag and then pick one bead at a time from that bag and stop when the probability of having picked the 5th bag is greater than 1/2. In the best case, how many beads do you need to pick?

Solution

Let *N* be the number of beads in each bag, *n* the size of the sample, *m* and *m*′ the number of red beads in each bag with *m*′ > *m* and *k* the number of read beads in the sample. *N* = 100, *n* = 3, *m*′ = 7, *m* = 4 and *k* = 1 in the first part of the interview challenge, with *n* and *k* becoming variable in the second part. Let *X* be the random variable corresponding to the number of red beads present in a sample. Conditional to the knowledge of the bag from which the extraction occurred, this variable has an hypergeometric distribution. Let M be the random variable corresponding to the number of red beads in the chosen bag.

$$
P(X = k|M=m) = \frac{\binom{m}{m}\binom{N-m}{n-k}}{\binom{N}{n}}
$$

that is *X* follows the hypergeometric distribution with parameters *N*, *n*, *m* conditional to having selected a type of bag and

*P*(*M* = *m*) = 4/5

*P*(*M* = *m*′) = 1/5

assuming the uniform distribution in bag selection. We are interested in:

*P*(*M* = *m*′|*X* = *k*)

that is distribution over bag types conditional to the outcome of a random draw. Using the definition of conditional probability we have

$$
P(M = m'|X = k) = \frac{P(M = m' \wedge X = k)}{P(X = k)}
$$

and applying the same definition again we have

$$
P(M = m'|X = k) = \frac{P(X = k | M = m')P(M = m')}{P(X = k)}
$$

The numerator is the product of a hypergeometric distribution with parameters *N*, *n*, *m*′ and a constant. At the numerator, we apply the law of alternatives to get

*P*(*X* = *k*) = *P*(*X* = *k*|*M* = *m*′)*P*(*M* = *m*′) + *P*(*X* = *k*|*M* = *m*)*P*(*M* = *m*)

Combining the last two we have

*P*(*M* = *m*′|*X* = *k*) =

$$
\frac{P(X = k | M = m')P(M = m')}{P(X = k| M= m') P(M = m')+P(X = k| M= m) P(M = m)} =$$

$$
\frac{1}{1+\frac{P(X = k| M= m) P(M = m)}{P(X = k| M= m') P(M = m')}} =$$

$$
\frac{1} {1+ \frac{\frac{\binom{m}{m}\binom{N-m}{n-k}}{\binom{N}{n}} \frac{4}{5}} {\frac{\binom{m}{m}\binom{N-m'}{n-k}}{\binom{N}{n}} \frac{1}{5} }}
$$

which can be simplified to

$$\frac{1}{1+\frac{4\binom{m}{k}\binom{N-m}{n-k}}{\binom{m'}{k}\binom{N-m'}{n-k}}}\qquad(1)
$$

We need only to substitute in the values to obtain the desired probability.

Now to the second part of the challenge, whereby one needs to find the smallest *n* such that for some *k* the above expression is greater than 1/2 (with the other parameters as before and within the allowed range for *n* and *k*). We want to show that the solution is *n* = *k* = 3 and we will do it in two parts. First we will show that Eq. (1) is maximized, for any given *n*, when *k* = *n*, which supports the intuition that the red bead rich bag will be more promptly identified when all the sampled beads are red. Then we will show that the smallest *n* such that Eq. (1) with *k* = *n* is greater than 1/2 is 3. To establish the first part we will show that

$$\frac{1}{1+\frac{4\binom{m}{k'}\binom{N-m}{n-k'}}{\binom{m'}{k'}\binom{N-m'}{n-k'}}} \ge
\frac{1}{1+\frac{4\binom{m}{k}\binom{N-m}{n-k}}{\binom{m'}{k}\binom{N-m'}{n-k}}} \qquad (2)
$$

if and only if *k*′ ≥ *k*. We will prove the special case *k*′ = *k* + 1 from which the general case follows by induction.

By simple algebraic manipulation and substituting *k*′ Eq. (2) is equivalent to:

$$\frac{\binom{m}{k+1}\binom{N-m}{n-k-1}}{\binom{m'}{k+1}\binom{N-m'}{n-k-1}} < \frac{\binom{m}{k}\binom{N-m}{n-k}}{\binom{m'}{k}\binom{N-m'}{n-k}}$$

Expanding the binomial coefficients we get:

$$\frac{\frac{m!}{(k+1)!(m-k-1)!}\frac{(N-m)!}{(n-k-1)!(N-m-n+k+1)!}} {\frac{m'!}{(k+1)!(m'-k-1)!}\frac{(N-m')!}{(n-k-1)!(N-m'-n+k+1)!}} < \frac{\frac{m!}{k!(m-k)!}\frac{(N-m)!}{(n-k)!(N-m-n+k)!}} {\frac{m'!}{k!(m'-k)!}\frac{(N-m')!}{(n-k)!(N-m'-n+k)!}}$$

By simple algebraic manipulations we have:

$$\frac{N-m'-n+k+1}{N-m-n+k+1} < \frac{m'-k}{m-k}$$

Since *m*′ > *m* we can upper bound the left side with 1 and lower bound the left side with 1, which completes this part of the proof.

Now we have established Eq. (2), we know that Eq. (1) is maximized, for every *n*, by setting *k* = *n*. With this substitution our goal becomes:

$$\frac{1}{1+\frac{4\binom{m}{n}}{\binom{m'}{n}}} \ge \frac{1}{2}
$$

which is equivalent to

$$
4\binom{m}{n}\le \binom{m'}{n}
$$

Substituting in the values of *m* and *m*′ and trying *n* ∈ {1, 2, 3} we find that *n* = 3 is the solution.