Kth order statistic
Wikipedia defines the kth order statistic as
In statistics, the kth order statistic of a statistical sample is equal to its kth-smallest value.
A trivial way to find the kth order statistic is to simply sort an input vector and get the kth element. This approach takes \( O(n \log{n}) \).
A better approach exploits the fact that heaps can be constructed in linear time from an input array.
Efficient O(n) heapify
There are two ways to recursively construct a heap: siftDown or siftUp.
-
siftUp is the easiest one and just adds one element at a time to the heap and moves it up the tree to the right position (possibly till position 0 - the root). Since each insertion can require a maximum of \( \log{n} \) swaps, this approach for \( n \) elements takes \( O(n \log{n}) \).
-
siftDown is slightly more complicated but ensures better performances. It starts with a broken heap in array form and, since the last child of the heap at position
i
has its parent at indexp = i / 2
, calling siftDown on indexp
recursively until the heap property is verified grants a different complexity.
Suppose we have a binary tree of height \( h = 3 \). There are \( 2^{h-j} \) nodes at every level \( j \)
Since every node can descend up to \( j \) levels due to siftDown, we have that at each level \( j \) there are \( 2^{h-j} \) nodes that can descend \( j \) times
\[T(n) = \sum_{j=0}^{h}{j2^{h-j}} = \sum_{j=0}^{h}{j \frac{2^{h}}{2^{j}}} = 2^{h}\sum_{j=0}^{h}{ \frac{j}{2^{j}}}\]so the total time is equivalent to that sum. In order to solve the total time required, a few observations are necessary. The geometric series for any constant \( x < 1 \) is
\[\sum_{j=0}^\infty{x^j} = \frac{1}{1-x}\]Now if we take the derivative of this formula and multiply both sides by \( x \) we get
\[\sum_{j=0}^\infty{jx^{j-1}} = \frac{1}{(1-x)^2} \ \sum_{j=0}^\infty{jx^{j}} = \frac{x}{(1-x)^2}\]now by plugging \( x = 1/2 \) into this formula we get exactly the sum we were searching for
\[\sum_{j=0}^\infty{\frac{j}{2^j}} = 2\]We can use this result as an approximation since the sum is bounded to that value thus we get
\[T(n) = 2^{h}\sum_{j=0}^{h}{ \frac{j}{2^{j}}} \le 2^{h}\sum_{j=0}^{\infty}{ \frac{j}{2^{j}}} \le 2^h \cdot 2 = 2^{h+1}\]Since the number of nodes in the tree is \( 2^{h+1} - 1 \) we have \( T(n) \le n + 1 \in O(n) \).
Bottom line is: heapify by siftDown is faster. Also notice that since child with index \( i \) has parent \( p = i / 2 \), the siftDown recursion needs to be executed only from \( \text{array_size} / 2 \to 0 \).
Heap-powered kth order statistic
Exploiting this fact, the kth order statistic can be extracted by building a heap in \( O(n) \) and repeatedly extracting the minimum and reheapifying afterwards. This last process for a single element from the root is in the order of \( O(\log{n}) \) and thus the total time is \( O(n + k \log{n}) \)
Efficient linear-time Kth order statistic
One last method to get the kth order statistic proves to be even faster than the ones analyzed few paragraphs ago. The algorithm builds on quickselect and enhances it by improving its worst-case of \( O(n^2) \). A simple quickselect would look like the following
The critical point of the algorithm regards the choice of the pivot element. A bad pivot element would increase running time while a good pivot element (i.e. one that evenly divides the elements) would reduce it.
The median of medians variant of the above algorithm exploits finding an approximate median of the elements in linear time to gain a good pivot point from where to start partitioning.
The pseudocode of the algorithm is roughly as follows
quickSelect(vector, k)
subgroups = divide vector into subgroups of 5 elements each
medians = for every subgroup find its median
// Grab the median of medians
medianOfMedians = quickSelect(medians, medians.size() / 2)
int mid = partition(vector, medianOfMedians)
if mid is k
return vector[mid]
if k is less than mid
return quickSelect(lower_half(vector), k)
else
return quickSelect(upper_half(vector), k)
Since finding the median of each subgroup of 5 elements is \( O(n) \) and partitioning also takes \( O(n) \), the interesting analysis takes place in the recursive steps. The question to ask is: what is the maximum number of elements which can be greater or less than the medianOfMedians?
Let’s focus on the maximum number of elements which are greater than it: approximately half of the \( \lceil \frac{n}{5} \rceil \) groups have a median greater than medianOfMedians; that means, since we have 5 elements groups, each one of them contributed with 3 elements greater than medianOfMedians. Thus we have
\[3 \Bigg (\Bigg \lceil \frac{1}{2} \bigg \lceil \frac{n}{5} \bigg \rceil \Bigg \rceil \Bigg) - lg \ge \frac{3n}{10} - lg\]where \( lg \) indicates the number of elements of a less-than-5-elements group. In the worst case the function recurs for \( n - (\frac{3n}{10} - lg) \). This quantity is an upper-bound if the number of elements is greater than a constant \( c \) so
\[T(n) \le T(\frac{n}{5}) + T(\frac{7n}{10}) + O(n) \ if \ n \ge c\]therefore the runtime is \( O(n) \) in the worst case. Complete code follows.