Skip to main content

Randomised Algorithms and Order Statistics

As an example of randomized algorithms, we shall analyze the time complexity of quicksort in detail (for both cases: fixed pivot and random pivot).

Quicksort is a classic divide-and-conquer sorting algorithm proposed by C.A.R. Hoare in 1961. It sorts “in place” and is very fast in practice as it makes use of locality of reference (caching on modern processors).

The algorithm for quicksort is described as such:

  1. Divide: Partition the array into 2 subarrays around a pivot xx such that elements in lower subarray \leq xx \leq elements in upper subarray. This is done by comparing each element with the pivot and performing swaps as necessary. It takes Θ(n)\Theta(n) time to partition an array of nn elements since exactly n1n - 1 comparisons are performed (i.e, comparing each element with the pivot).
  2. Conquer: Recursively sort the 2 subarrays
  3. Combine: Trivial.

There are different variants of quicksort: one that uses a fixed pivot (say the first element as the pivot) and another that uniformly at random picks an element as the pivot. We shall discuss both versions separately but it is important to note the difference:

  • In the case of fixed pivot, there is no randomness inherent in the algorithm. Given the same input, the algorithm executes exactly in the same manner every time it is run. Here, the randomness lies in the distribution of the input. So, we talk about average case analysis which describes the behaviour of the algorithm over all possible inputs, assuming each of the input permutations are equally likely. Note that this version is NOT a randomised algorithm.
  • In the case of random pivot, even for the same input, the algorithm can execute differently each time it is run. This means that we don’t have to consider the randomness of the input, since randomness is now transferred to be within the algorithm itself. Any input can be “good” or “bad” depending on the choice of pivot for that particular run of the algorithm (so there’s no point talking about the input distribution). In such cases, we are interested in calculating the expected running time, i.e., how long our algorithm takes on average, given any input (might even be fixed input - doesn’t matter).

It is important to understand where the randomness lies.

think

Q. Why do we use often random pivots in quicksort?

Ans: Using random pivots makes quicksort insensitive to the input distribution → time taken does not depend on the initial permutation but on the choice of pivot. So, even if the inputs are not chosen uniformly at random but by an adversary, the algorithm is robust in terms of the running time since there is no “bad input” anymore.

Analysis of Fixed-Pivot QuickSort

Worst Case Analysis

Denote T(n)T(n) to be the worst case running time on an array of nn elements. Denote A(n)A(n) to be the average-case running time on an array of nn elements.

In the worst case, we always pick the smallest element (or largest element) as the pivot (e.g. quicksort trying to sort an already sorted array by picking the first element as the pivot). Then, T(n)=T(0)+T(n1)+Θ(n)T(n) = T(0) + T(n-1) + \Theta(n).

This can be easily solved to give: T(n)=Θ(n2)T(n) = \Theta(n^2) (using arithmetic series).

So, the worst case running time of a fixed pivot quicksort algorithm is O(n2)O(n^2).

Average Case Analysis

Let’s now perform average-case analysis of QuickSort (this is not so trivial and the proof is quite involving).

Our analysis assumes all input elements are distinct (but if duplicates exist, we can use 3-way partitioning to improve the running time too).

First, given an inpiut array size of nn, how many possible inputs are there? Well, infinite! Because there are infinite numbers!

But we observe that the execution of quicksort does not depend on the absolute numbers themselves but on the relative ordering of numbers within the array. In particular, if eie_i denotes the ithi^{th} smallest element of the input array AA, then the execution of quicksort depends upon the permutation of eise_i's and not on the values taken by the eise_i's. For example, the 2 arrays below will have the same execution steps of quicksort since their permutation of ese's is the same.

QuickSort Permutations with Same Execution

Since we’re dealing with average running time, we need to have some assumption about the distribution of possible inputs. In our case, we define the average running time for quicksort on input of size nn to be the expected running time when the input is chosen uniformly at random from the set of all n!n! permutations (of eise_i's), i.e., the expectation is over the random choices of the input.

In other words, A(n)=1n!πQ(π)A(n) = \frac{1}{n!} \sum_\pi Q(\pi), where Q(π)Q(\pi) is the time complexity (or number of comparisons) when the input permutation is π\pi.

We define P(i)P(i) to be the set of all those permutations of {e1,e2,,en}\{e_1, e_2, \dots, e_n\} that begin with eie_i. Clearly each iP(i)=\bigcup_i P(i) = set of all permutations, and for each ij,P(i)P(j)=ϕi \neq j, P(i) \cap P(j) = \phi. This means that the P(i)sP(i)'s form a partition of the input space.

Partition of Input Space

Since each of the n!n! permutations are equally likely, the size of each of the P(i)P(i) is also the same and equal to (n1)!(n-1)!. So, each P(i)P(i) constitutes 1/n1/n of all the permutations of the total input space.

We define G(n,i)G(n, i) to be the average running time of QuickSort over P(i)P(i). Clearly, A(n)=1ni=1nG(n,i)A(n) = \frac{1}{n} \sum_{i=1}^n G(n, i) (that is, the average running time over all inputs, is the average of the average running time for each group of (n1)!(n-1)! permutations (since each is equally likely). We can prove this formally as such:

Observe that G(n,i)=1(n1)!πP(i)Q(π)G(n, i) = \frac{1}{(n-1)!} \sum_{\pi \in P(i)} Q(\pi)

Now,

A(n)=1nπQ(π)=1n!(πP(1)Q(π)+πP(2)Q(π)++πP(n)Q(π))=1n(1(n1)!πP(1)Q(π)+1(n1)!πP(2)Q(π)++1(n1)!πP(n)Q(π))=1ni=1nG(n,i)\begin{equation} \begin{split} A(n) &= \dfrac{1}{n} \sum_\pi Q(\pi) \\ &= \dfrac{1}{n!} \left ( \sum_{\pi \in P(1)} Q(\pi) + \sum_{\pi \in P(2)} Q(\pi) + \dots + \sum_{\pi \in P(n)} Q(\pi) \right ) \\ &= \dfrac{1}{n} \left ( \dfrac{1}{(n - 1)!}\sum_{\pi \in P(1)} Q(\pi) + \dfrac{1}{(n - 1)!} \sum_{\pi \in P(2)} Q(\pi) + \dots + \dfrac{1}{(n - 1)!} \sum_{\pi \in P(n)} Q(\pi) \right ) \\ &= \dfrac{1}{n} \sum_{i=1}^n G(n, i) \end{split} \end{equation}

Now, we need to derive an expression for G(n,i)G(n, i). For this purpose, we need to have a closer look at the execution of QuickSort over P(i)P(i)

Mapping Permutation

We define S(i)S(i) to be the set of all possible permutations of ese's after partitioning takes place on an input from P(i)P(i)

Consider what happens after partitioning on an input from P(i)P(i). There will be (i1)(i-1) elements less than the pivot that will be placed in the first i positions and the remaining (ni)(n-i) elements will be placed after the pivot. It should be obvous that the number of permutations possible after partitioning is less than the number of input permutations (since we’re imposing more constraints and decreasing the entropy of the space). In particular, there are (i1)!×(ni)!(i-1)! \times (n-i)! permutations.

Observe that many different input permutations will give rie to the same permutation after partitioning takes place. The real question now is: Are the number of permutations that map to a single permutation after partitioning the same? (In a mathemetical sense, let partitionpartition be a function. Then, we’re interested in finding out whether for each yS(i)y \in S(i), whether the size of {x:xP(i)partition(x)=y}\{ x : x \in P(i) \land partition(x) = y \} is the same) This is called a uniform mapping.

And the answer is yes! There’s no reason why some permutations will have more preimages than other because each input permutation in P(i)P(i) is chosen uniformly at random (equally likely). In fact, there are exactly (n1i1)\binom{n-1}{i-1} permutations from P(i)P(i) that get mapped to a single permutation in S(i)S(i) (why? 🤔)

Recall that G(n,i)G(n, i) is the average running time of QuickSort over P(i)P(i). Then, G(n,i)=A(i1)+A(ni)+(n1)G(n, i) = A(i - 1) + A(n - i) + (n - 1) (The last term comes because we perform exactly (n1)(n-1) comparisons in the partitioning steps, comparing each element with the pivot eie_i)

We have already seen that A(n)=1ni=1nG(n,i)A(n) = \frac{1}{n} \sum_{i=1}^n G(n, i)

So, A(n)=1ni=1n(A((i1)+A(ni)+n1)A(n) = \frac{1}{n} \sum_{i=1}^n (A((i-1) + A(n-i) + n - 1). We observe that each of the A(i1)A(i-1) will appear twice in the sum → once from the first term, once from the second term (e.g. A(0)A(0) will appear when i=1i = 1 as well as when i=ni = n).

So, A(n)=2ni=1n(A(i1)+n1)=2ni=1nA(i1)+n1A(n) = \frac{2}{n} \sum_{i=1}^n (A(i-1) + n-1) = \frac{2}{n} \sum_{i=1}^n A(i-1) + n-1 (We removed the (n1)(n-1) out of the summation, since it is added nn times, but then divided by nn)

Multiplying both sides by nn, we get:

nA(n)=2i=1nA(i1)+n(n1)\begin{equation} nA(n) = 2 \sum_{i=1}^n A(i-1) + n(n-1) \end{equation}

We don’t know how to solve the above recurrence relation easily, but we can try to get rid of the summation and proceed from there.

We know that this must hold true for any value of nn. In particular, the same recurrence relation should also be true for (n1)(n-1). So,

(n1)A(n1)=2i=1n1A(i1)+(n1)(n2)\begin{equation} (n-1)A(n-1) = 2 \sum_{i=1}^{n-1} A(i-1) + (n-1)(n-2) \end{equation}

Subtracting equation (2)(2) and (3)(3), we get: nA(n)(n1)A(n1)=2A(n1)+2(n1)nA(n) - (n-1)A(n-1) = 2A(n-1) + 2(n-1) (since all other terms in the summation get cancelled).

We can rewrite this in the following form using partial fractions (so that we can apply the telescoping method later):

nA(n)(n+1)A(n1)=2(n1)    A(n)n+1A(n1)n=2(n1)n(n+1)=4n+12nnA(n) - (n+1)A(n-1) = 2(n-1) \\ \implies \dfrac{A(n)}{n + 1} - \dfrac{A(n-1)}{n} = \dfrac{2(n-1)}{n(n+1)} = \dfrac{4}{n+1} - \dfrac{2}{n}

We expand the series using telescoping method as such:

A(n)n+1A(n1)n=4n+12nA(n1)nA(n2)n1=4n2n1A(n2)n1A(n3)n2=4n12n2A(1)2A(0)1=4221\begin{equation} \begin{split} \dfrac{A(n)}{n + 1} - \dfrac{A(n-1)}{n} &= \dfrac{4}{n+1} - \dfrac{2}{n} \\ \dfrac{A(n-1)}{n} - \dfrac{A(n-2)}{n-1} &= \dfrac{4}{n} - \dfrac{2}{n-1} \\ \dfrac{A(n-2)}{n-1} - \dfrac{A(n-3)}{n-2} &= \dfrac{4}{n-1} - \dfrac{2}{n-2} \\ \dots \\ \dfrac{A(1)}{2} - \dfrac{A(0)}{1} &= \dfrac{4}{2} - \dfrac{2}{1} \end{split} \end{equation}

Summing up all the above equations (and splitting each term of the form 4i\frac{4}{i} to be 2i+2i\frac{2}{i} + \frac{2}{i} so that one of them cancels out with the other negative 2i\frac{2}{i} term), we get:

A(n)n+1A(0)1=4n+12+i=2n2i\dfrac{A(n)}{n+1} - \dfrac{A(0)}{1} = \frac{4}{n+1} - 2 + \sum_{i=2}^n \frac{2}{i}

Obviously A(0)=0A(0) = 0 (it takes 00 comparisons to sort an array of size 00)

So, we have: A(n)n+1=4n+12+i=2n2i\dfrac{A(n)}{n + 1} = \dfrac{4}{n+1} - 2 + \sum_{i=2}^n \dfrac{2}{i}.

Then adding 44 and subtracting 44 to get the harmonic series form:

A(n)n+1=4n+14+i=1n2i\dfrac{A(n)}{n + 1} = \dfrac{4}{n+1} - 4 + \sum_{i=1}^n \dfrac{2}{i}

Factoring out the 22 from the series,

A(n)n+1=4n+14+2i=1n1i\dfrac{A(n)}{n + 1} = \dfrac{4}{n+1} - 4 + 2 \sum_{i=1}^n \dfrac{1}{i}

This gives us the harmonic series, and so,

A(n)n+1=4n+14+2H(n)\dfrac{A(n)}{n + 1} = \dfrac{4}{n+1} - 4 + 2 H(n)

where H(n)H(n) is the harmonic series up to nn terms.

Multiplying both sides by (n+1)(n+1),

A(n)=44(n+1)+2(n+1)H(n)A(n) = 4 - 4(n+1) + 2(n+1)H(n)

And we know that H(n)=ln(n)+γH(n) = ln(n) + \gamma where γ0.5772\gamma \approx 0.5772 is the Euler-Mascheroni constant.

Substituting the exact values, we find that A(n)1.39nlog2nA(n) \approx 1.39nlog_2n.

We care about the constant in this case because we want to show that the constant factor is not very large, hence, QuickSort is fast.

QuickSort vs. MergeSort

QuickSort is still the most preferred algorithm in practice. This is because of the overhead of copying the array in MergeSort (to perform the merging operation) and higher frequency of cache misses in case of MergeSort.

What makes QuickSort popular is that as nn increases, the chances of deviation from average case decreases (which means that it performs close to average-case nlognn\log n with very high probability as nn increases). Hence, the reliability of QuickSort increases as nn increases.

However, there is a strong assumption here: the input permutations must be random for QuickSort to be reliably fast. In other words, Fixed Pivot QuickSort is distribution sensitive, i.e., time taken depends on the initial input permutation (not just input size!).

Oftentimes, real data is not random and it’ll be quite unfortunate if QuickSort performs in O(n2)O(n^2) on real data. So, we face a difficult question: can we make QuickSort distribution-insensitive?

And, the answer is YES! We’ll take a look at how to achieve this next.

Analysis of Random-Pivot QuickSort

We have seen that using a fixed pivot makes the running time dependent on the input permutation, which can be set by an adversary. Our goal by using a randomised version of QuickSort is to be able to guarantee an expected running time of O(n)O(n) regardless of the input permutation. In some sense, we are transferring the randomness from the choice of input to the choice of pivot, which is internal to the algorithm.

This is necessary because often, real data is not random → suppose we have a nearly sorted array, then if we naively pick the first element as the pivot, we might end up with O(n2)O(n^2) time (in the case of an already sorted array).

Till now, all algorithms we’ve seen so far have been deterministic. That is, given a fixed input, the algorithm follows the same sequence of execution of instructions. Now, we’re looking at a randomised algorithm, i.e., the sequence of instructions depends not only on the input but also on some random bits, which potentially change each time the function is called or the program is run. Since the sequence of instructions executed is different each time, the output as well as the running time is a function of the random bits as well as the input chosen.

This means that some times, the output produced may be incorrect/unexpected, and some times the running time may be longer than other times.

theorem [Colin McDiarmid, 1991]:

Probability that the run time of Randomized Quick Sort exceeds the average by xx% is nx100ln(ln(n))n^{-\frac x {100} ln(ln(n))}.

In particular, the probability that the run time of randomisd quicksort is double the average for n106n \geq 10^6 is 101510^{-15}. This is a very very low probability. So, with a very high confidence/probability, we can say that QuickSort will perform well in nearly all cases.

Very often, randomised algorithms are much simpler to understand (and code!) than regular algorithms. And they’re also often much faster!

⭐ There are 2 types of randomised algorithms:

  1. Randomised Las Vegas Algorithms - Output is always correct, but the running time is a random variable (e.g. QuickSort, QuickSelect)
  2. Randomised Monte Carlo Algorithms - Output may be incorrect with some small probability, running time is deterministic (e.g. Freivald’s algorithm for matrix multiplication)

Let's revisit a familiar problem, Matrix Multiplication, to see the power (and elegance) of randomised algorithms.

Matrix Multiplication

Problem Definition: Given 3 matrices, A,B,CA, B, C, decide whether AB=CA \cdot B = C

Deterministic algorithm: Multiply AA and BB (which takes O(n2.37)O(n^{2.37}) time) and then check equality (which takes O(n2)O(n^2) time)

Randomised algorithm (Freivald’s algorithm): Pick a random x{0,1}nx \in \{0, 1\}^n uniformly at random, and check whether A(Bx)=CxA \cdot (B \cdot x) = C \cdot x in O(n2)O(n^2) time. We can repeat this kk times independently. Then, the time complexity is O(kn2)O(kn^2). It can be shown that the error probability is 2k2^{-k} for any kk we desire, i.e., the probability that the algorithm will claim that AB=CA \cdot B = C even when they’re actually unequal is 2k2^{-k} (false positive rate). Obviously, it can never have false negatives because if indeed, AB=CA \cdot B = C then for any vector xx, the equality will hold true and the algorithm will always output “YES” (and so it’ll be correct).

Observe that Freivald’s algorithm is very easy to code and understand, and it also has lower time complexity than the best known matrix multiplication algorithm. So, in practice, we often use it instead (when we can afford to be wrong with a very small probability and time is more important then being 100% correct).

So, Freivald’s algorithm is a randomised monte carlo algorithm.

think

Question: Why do we need to take x={0,1}nx = \{0, 1\}^n? We can take any vector of the correct dimension, i.e., 1×n1 \times n right? In fact, if the elements of xx are chosen uniformly at random, the probability of being wrong will be even lower right?

Ans: YES! You can pick any vector xx. In fact, if you don’t limit yourself to only 00 and 11s then the probability of false positives is much much lower. In theory and literature, we use a 0/1 vector as the probability is easier to analyse - but in practice, to increase the probability of successfully classifying an equation, we can use any range of values of xx (the larger the better)

Analysis of QuickSort using Balls into Bins

We’ll now start to analyse the running time of randomised quicksort by trying to understand the balls and bins experiment. Suppose there are mm balls and nn bins and each ball selects its bin uniformly at random and independent of other balls and falls into it. There are certain questions that we might have:

  1. What is the expected number of balls in any bin? (obviously m/nm/n)
  2. What is the expected number of balls in the heaviest bin? (You can show that if m=nm = n, then this is Θ(lognloglogn)\Theta(\frac {\log n} {\log \log n}) as discussed in tutorial)
  3. What is the probability that there is at least one empty bin? (You can use Principal of Inclusion-Exclusion here)

Note that all these have very important applications in Computer Science, especially in hashing and load balancing in distributed environments.

Le’s try to answer the third question (though we won’t since we’ll get side-tracked midway and start focusing more on QuickSort once we understand the concept behind how to solve such problems)

What is the probability that there is at least one empty bin? Well, the size of the sample space S=nmS = n^m (each of the mm balls can go into any of the nn bins). Let us define the event Eji:jthE_j^i: j^{th} ball falls into ithi^{th} bin. Then,

  1. EjiE_j^i and EjkE_j^k are disjoint when iki \neq k (the jthj^{th} ball can only go into one bin - either ii or kk but not both)
  2. EjiE_j^i and ElkE_l^k are independent (by the question)
  3. EjiE_j^i and EliE_l^i are independent (by the question)

Then, P[Eji]=1nP[E_j^i] = \frac 1 n. So, P[Ejiˉ]=11nP[\bar {E_j^i}] = 1 - \frac 1 n.

Then, the probability that the ithi^{th} bin is empty: P[E1iˉE2iˉEmiˉ]=P[E1iˉ]×P[E2iˉ]×P[Emiˉ]=(11n)mP[\bar {E_1^i} \cap \bar {E_2^i} \cap \dots \cap \bar {E_m^i} \cap] = P[\bar {E_1^i}] \times P[\bar {E_2^i}] \times P[\bar {E_m^i}] = \left ( 1 - \frac 1 n \right )^m

Similarly, the probability that the ithi^{th} and kthk^{th} bins are empty = Probability that all balls fall into neither of those bins = (12n)m\left (1 - \frac 2 n \right ) ^ m

In general, the probability that a specific set of ll bins are empty = (1ln)m\left ( 1 - \frac l n \right ) ^ m

Now, we try to express the event that there is at least one empty bin as the union of some events. Define Ai:A_i : ithi^{th} bin is empty. Then, the event “there is at least one empty bin” = i=1nAi\bigcup_{i=1}^n A_i.

Then, we can just apply inclusion-exclusion to solve this problem! (Obviously this will give a very complicated expression - because the probability itself is not so simple to calculate - there is no other simple way of doing this)

For now, let’s look at a different question: In the execution of randomised QuickSort, what is the probability that eie_i is compared with eje_j? We can assume that all elements are distinct (if not, break the ties arbitrarily). Recall that eie_i denotes the ithi^{th} smallest element in the array AA.

Obviously we cannot explore the sample space associated with Randomised QuickSort because the sample space consists of all recursion trees (rooted binary trees on nnnodes). We’d have to add the probability of each recursion tree in which eie_i is compared with eje_j which is not feasible!

Instead, we’ll try to view randomised quicksort from the perspective of eie_i and eje_j. To do this, we visualize elements e1,,ene_1, \dots , e_n arranged from left to right in increasing order of values. This visualization ensures that the two subarrays which we sort recursively lie to the left and right of the pivot. In this way, we can focus on the subarray containing eie_i and eje_j easily. Note that this visualisation is just for the sake of analysis. It will be grossly wrong if you interpret this as if we are sorting an already sorted array.

Recall that during the partitioning subroutine of QuickSort, elements are only compared with the pivot. In particular, if we are partitioning an array of size nn, there will be exactly n1n - 1 comparisons performed - between the pivot and each of the other elements (to decide where that element should be placed).

Also, notice that if we pick a pivot eke_k such that i<k<ji < k < j then after partitioning, eie_i and eje_j will be on opposite sides of the pivot and so they will never be compared.

Let SijS_{ij} denote the subarray of elements from index ii to jj (inclusive). Then, we observe that eie_i and eje_j get compared during an instance of randomised quicksort if and only if the first pivot from SijS_{ij} is either eie_i or eje_j.

Another observation (which should be trivial) is that any 2 elements are compared at most once (because after partitioning, the pivot is not included in either subarray and so, it can never be compared again - which makes sense since it’s absolute position in the sorted array is already fixed so there’s no point in comparing it to other elements after partitioning anyway)

We define Bi:B_i : first pivot element selected from SijS_{ij} during randomised quicksort is eie_i and we define Bj:B_j : first pivot element selected from SijS_{ij} during randomised quicksort is eje_j. Then, (as we’ve seen), P[eiP[e_i and eje_j get compared]=P[BiBj]] = P[B_i \cup B_j]

Obviously BiB_i and BjB_j are disjoint (since only one of eie_i and eje_j can be chosen as the first pivot in the subarray SijS_{ij}). So, P[BiBj]=P[Bi]+P[Bj]P[B_i \cup B_j] = P[B_i] + P[B_j].

We’re assuming that pivots are chosen uniformly at random. There are ji+1j - i + 1 elements in SijS_{ij} and so P[Bi]=1ji+1=P[Bj]P[B_i] = \frac 1 {j - i + 1} = P[B_j]. Then, we have P[BiBj]=2ji+1P[B_i \cup B_j] = \frac 2 {j - i + 1}

In conclusion, P[eiP[e_i and eje_j get compared]=2ji+1] = \frac 2 {j - i + 1}. This is quite an interesting result! It suggest that:

  1. The probability depends upon the rank separation ji+1j -i + 1 (how far elements are from each other in the sorted array)
  2. The probability does NOT depend on the size of the array
  3. Probability that eie_i and ei+1e_{i+1} are compared is 11 (which should seem obvious because there is no other way to tell which element is smaller than the other, i.e., there is no middle element that can serve as pivot to partition eie_i and ei+1e_{i+1}
  4. Probability of comparison of e0e_0 and en1e_{n-1} is 2n\frac 2 n (if in the first partitioning, any other pivot other than e0e_0 or en1e_{n-1} is chosen, they will definitely be on opposite sides of the pivot and will never be compared subsequently)

Notice that the probability that 2 elements are directly compared decreases with increase in rank separation as there is an increasing chance that their ordering can be inferred by comparing each of them to any of the elements between them. This is the key intuitiion.

We’ve now found an equation for the probability of any 2 elements being compared or not. But we haven’t found the expected number of comparisons performed by QuickSort. Let’s proceed with our analysis by considering 2 similar problems:

  1. Balls and bins Experiment: What is the expected number of empty bins if mm balls are thrown into nn bins?
  2. Randomised QuickSort: What is the expected number of comparisons performed?

Both these problems are equivalent in the sense that solving one will give us an idea on how to solve the other. So, let’s start with the balls and bins problem.

We can directly apply the definition of expectation but we’ll get stuck because we cannot calculate the probability that exactly ii bins are empty in simple terms - due to inclusion-exclusion, the formula gets really messy. So, we need to come up with a clever idea.

Since viewing the random experiment in its entirety looks so complex, let’s take a microscopic view. Let’s focus on one particular box at a time → This is the key shift in perspective that we need to solve this problem!

We define XiX_i to be a random variable as follows:

Xi={1if bin i is empty0otherwiseX_i = \begin{cases} 1 \quad \text{if bin i is empty} \\ 0 \quad \text{otherwise} \end{cases}

We call such variables that take only values 00 and 11 indicator random variables (as they indicate the presence or absence of a certain event).

Then, E[Xi]=1×P(ithE[X_i] = 1 \times P(i^{th} bin is empty)+0×P(ith) + 0 \times P(i^{th} bin is not empty)). We already know that P(ithP(i^{th} bin is empty)=(11n)m) = \left ( 1 - \frac 1 n \right ) ^m.

So, E[Xi]=(11n)mE[X_i] = \left ( 1 - \frac 1 n \right ) ^m.

Also observe by linearity of expectation that E[X]=E[X1+X2++Xn]=E[X1]+E[X2]++E[Xn]E[X] = E[X_1 + X_2 + \dots + X_n] = E[X_1] + E[X_2] + \dots + E[X_n], where X=X = total number of empty bins.

Recall that for any 2 events X,YX, Y (does not matter whether dependent of independent), E[X+Y]=E[X]+E[Y]E[X + Y] = E[X] +E[Y]

So, E[X]=n(11n)mE[X] = n \left ( 1 - \frac 1 n \right ) ^m (since each of the bins have the same probability of being empty)

We solved our balls and bins problem! (observe that we found E[X]E[X] but to find the actual probability of there being at least one empty bin, we cannot escape using the messy inclusion-exclusion principle)

We return to solve our main problem of calculating the expected number of comparisons, E[Y]E[Y] (YY is the number of comparisons). Using our newfound perspective, we view randomised quicksort from the perspective of eie_i and eje_j.

Recall that we’ve already established a key observation: eie_i and eje_j get compared during an instance of randomised quicksort if and only if the first pivot from SijS_{ij} is either eie_i or eje_j. And we’ve also seen that this probability is, P[eiP[e_i and eje_j get compared]=2ji+1] = \frac 2 {j-i+1}.

Now, we define YijY_{ij} for any 1i<jn1 \leq i < j \leq n to be an indicator random variable that indicates whether eie_i and eje_j are compared during the execution of randomised quicksort. So,

Yij={1if ei and ej are compared during randomised quicksort0otherwiseY_{ij} = \begin{cases} 1 \quad \text{if }e_i \text{ and } e_j \text{ are compared during randomised quicksort} \\ 0 \quad \text{otherwise} \end{cases}

Then, E[Yij]=2ji+1E[Y_{ij}] = \frac 2 {j - i + 1}.

By definition of YY, we have: Y=i=1nj=i+1nYijY = \sum_{i=1}^n \sum_{j=i+1}^n Y_{ij}.

By linearity of exepctation, E[Y]=i=1nj=i+1nE[Yij]E[Y] = \sum_{i=1}^n \sum_{j=i+1}^n E[Y_{ij}]

We can rewrite the sum as: E[Y]=2i=1n[12+13++1ni+1]E[Y] = 2 \sum_{i=1}^n [ \frac 1 2 + \frac 1 3 + \dots + \frac 1 {n-i+1} ]. We can add and subtract 2n2n to the series (so that we get the harmonic series, whose bound we know). We get, E[Y]=2i=1n[1+12+13++1ni+1]2nE[Y] = 2 \sum_{i=1}^n [1 + \frac 1 2 + \frac 1 3 + \dots + \frac 1 {n-i+1}] - 2n.

Now we use the fact that i=1n1i=logen+0.58..\sum_{i=1}^n \frac 1 i = \log_en + 0.58.. where γ=0.58..\gamma = 0.58.. is the Euler-Mascheroni constant. We can neglect the constant since we’re looking for a Θ()\Theta() bound. So, E[Y]=2[logen+loge(n1)+loge(n2)++loge1]2nE[Y] = 2[\log_en + \log_e(n-1) + \log_e(n-2) + \dots + \log_e1] - 2n.

Let’s get a tight bound on i=1nlogei\sum_{i=1}^n \log_ei. Using stirling’s approximation, (ne)nn!nn(\frac n e)^n \leq n! \leq n^n.

i=1nlogei=loge(n!)\sum_{i=1}^n \log_ei = \log_e(n!)

Now, loge(ne)nlogen!logenn    nlogennlogen!nlogen\log_e(\frac n e)^n \leq \log_e n! \leq \log_e n^n \implies n\log_e n- n \leq \log_e n! \leq n \log_e n. We can see that logen!\log_e n! is upper-bounded and lower-bounded by nlogenn\log_en. In particular, logen!=Θ(nlogn)\log_e n! = \Theta(n\log n).

Hence, we get E[Y]=2[nlogn]2n=Θ(nlogn)E[Y] = 2[n \log n] - 2n = \Theta(n\log n)

Hence, we see that in expectation, randomised QuickSort performs Θ(nlogn)\Theta(n \log n) comparisons and hence runs in expected Θ(nlogn)\Theta(n \log n ) time complexity. Notice that we didn’t consider any kind of input here → because the input really doesn’t matter here, only the choice of pivot matters, which is randomised anyway.

There’s a subtle leap from “QuickSort performs Θ(nlogn)\Theta(n \log n) comparisons in expectation” to “QuickSort runs in Θ(nlogn)\Theta(n \log n) expected time” but it’s easy to see once you realise that all the time spent by QuickSort is in comparing elements (and performing swaps, but that can be consumed within the “compare” operation) and so there’s no other time spent.

We’re done with the analysis of randomised QuickSort! We’ll now look at another common problem - order statistics - and some common algorithms to solve it.

Order Statistics

Problem Definition: Given an unsorted list of nn elements, the ithi^{th} order statistic is the ithi^{th} smallest element (that element is said to have rank ii)

Algorithms we can use:

  1. Naive: Sort the array and then report the ithi^{th} element → Θ(nlogn)\Theta(n \log n)

It can be shown that order statistics requires Ω(n)\Omega(n) beause you need to “see” each element at least once, what if you miss an element and that happened to be the required order statistic?

So, we can we do it in O(n)O(n)? YES!

We’ll look at 2 solutions - one is a worst-case linear time algorithm, called median of medians, and the other is a randomised divide-and-conquer algorithm that takes expected O(n)O(n) time, QuickSelect.

QuickSelect

  1. Pick a pivot at random and partition about it (all elements on left side of pivot are less than the pivot, and all elements on the right side of the pivot are more than the pivot)
  2. If the position of the pivot is >> target rank, recurse on left half
  3. If the position of the pivot is << target rank, recurse on right half

Each partitioning is O(n)O(n) and it can be shown that if our choice of pivot is “decent”, the algorithm runs in O(n)O(n), which is why the expected running time is O(n)O(n). In the worst case, say the array is already sorted and our algorithm keeps picking the first element as the pivot and our target rank is nn, then it will take us O(n2)O(n^2) time.

It’s very easy to implement → the partitioning algorithm can be slightly tweaked to handle duplicates too (just have 3 regions instead of 2 → < pivot, = pivot, > pivot).

Median of Medians

Worst case time complexity: O(n)O(n)

Median of medians solves the problems of “bad pivots” from QuickSelect. Before partitioning, we ensure that our chosen pivot is “good” such that it divides the array into 2 subarrays of roughly similar sizes so that no matter which side we recurse on, our subproblem is much smaller than before.

Algorithm:

Say we’re searching for the ithi^{th} order statistic.

  1. Divide the nn elements into groups of 55 (Why 55? Why not 33 or 77? We’ll discuss this later)
  2. Find the median of each 5-element group by brute force (just use if-else or you can even sort those 5 elements and report median)
  3. Recursively use median of medians algorithm to find the median xx of the n/5\lfloor n/5 \rfloor group medians to be the pivot
  4. Partition around the pivot xx. Let k=rank(x)k = rank(x)
  5. If kk is the target rank, return xx.
  6. If k>ik > i, recursively use median of medians on the left subarray
  7. If k<ik < i, recursively use median of medians on the right subarray to find the (ik)th(i-k)^{th} smallest element.

How can we prove that this algorithm works in O(n)O(n)?

It’s best to visualise the size of the subarrays based on our chosen pivot:

Median of Medians Median of Medians Median of Medians

So, the size of each subarray is no more than 7n/107n/10. Since 7n/10<3n/47n/10 < 3n/4, we can write our recurrence relation as:

T(n)T(n/5)+T(3n/4)+Θ(n)T(n) \leq T(n/5) + T(3n/4) + \Theta(n)

Explanation of the time complexity:

  1. Finding median of each group of 55 is O(1)O(1) and there are n/5n/5 such groups → which takes O(n)O(n) time in total
  2. We use median of medians to find the median of the n/5n/5 group medians → which takes T(n/5)T(n/5) time
  3. Partitioning about the chosen pivot takes O(n)O(n)
  4. Recursively calling median of medians on either half - since we’re considering worst case, we always consider bigger side.

We can solve the above recurrence relation by using substitution. T(n)cnT(n) \leq cn

After substituting we get, T(n)15cn+34cn+Θ(n)=1920cn+Θ(n)=cn(120cnΘ(n))cnT(n) \leq \frac 1 5 cn + \frac 3 4 cn + \Theta(n) = \frac {19} {20} cn + \Theta(n) = cn - (\frac 1 {20}cn - \Theta(n)) \leq cn if our chosen cc is large enough to handle both Θ(n)\Theta(n) as well as the initial conditions.

So, we have shown that T(n)=O(n)T(n) = O(n) in the worst case (notice that this is a stronger guarantee on the running time as compared to QuickSelect). In practice, this algorithm runs slowly because of the large constant factor in front of nn and the randomised QuickSelect algorithm is far more practical (and easier to code).

Coming back to the question of group-size of 55:

  1. Can we use a group-size of 33 instead? No! Then the algorithm will run in O(nlgn)O(n \lg n) as the recurrence relation becomes: T(n)=T(n/3)+T(2n/3)+O(n)T(n) = T(n/3) + T(2n/3) + O(n), which should remind you of MergeSort (except that the halves are not equal) → notice that n/3+2n/3n/3 + 2n/3 sum to nn and in general, the recurrences of this form take O(nlgn)O(n\lg n) time.
  2. Can we use a group-size of 77 instead? Yes! The algorithm will still run in O(n)O(n) time (but the constant factor may be large since you need to find the median of 7 elements by using if-else statements or by sorting them)