For the second edition of the Weekend Warrior Series, we have a project that is slightly more technical – a couple implementations some random sampling techniques! I was trying to learn about database query optimizations and came across this algorithm called reservoir sampling. Since I haven’t coded anything outside of work in a while and this looked pretty straightforward, I decided to implement the technique and its weighted companion.
Projecting Time: 5 hours for research/implementation + 3 hours for blog post
When the data is BIG
In today’s world, there exists an astronomical amount of data and efficiently analyzing such large volumes of data is virtually impossible with the tools we currently have. Because of that, a common approach we take is to use random samples to make estimates or projections, and for the part, these approximates suffice. However, the approximation is only as good as the sample data, so how do we efficiently and truly randomly sample from enormous amounts of data? Enter reservoir sampling.
This algorithm allows us to randomly choose k samples from a dataset which size is unknown or from a stream of elements. It’s commonly used in places like database query planning, search engine request optimization, audits of financial records, medical studies, and more!
Implementation
Even though it is far easier to implement these algorithms in python, I decided to go with Go just out of familiarity. One nice part of Go is that go get
supports installing projects as an executable as well as a package in code. I took advantage of that and wrote the project in a way that allows users to also do both. Users can run a rsampling
binary or use the Reservoir
and WeightedReservoir
structs their projects.
Before I talk about the algorithms, I want to touch quickly on iterators. Random sampling is most often used on a dataset of unknown size, oftentimes sampling from a stream, so I felt that having the reservoir sample from an Iterator
fit best. Supporting iterators of many types was a bit cumbersome, so I’m quite excited about the latest proposal for introducing generics into the language.
Simple Reservoir Sampling
The most obvious, naive approach to retrieve k samples from N elements is to store all of the elements in an array and then sample from random indices in that array. However, once the size of the dataset exceeds memory or the input is a stream of data, we need a more scalable solution. In comes the reservoir. Since we only care about k samples, we can build a reservoir of size k and randomly choose if elements from the data stream should enter the reservoir. This solves the memory limits problem and is quite fast with a time complexity of O(n).
The full algorithm is as follows:
create a reservoir of size k and load it with first k elements the stream
for each incoming element:
j := a uniform integer from [0:i) where i is the current index of element
if j is between [0 .. k-1], replace reservoir[j] with stream[i]
We can also prove that the probability for every element 1 <= i < n
has equal probability of entering the reservoir with a simple proof by induction:
let n = number of elements in stream
let x(i) = element at index i of stream
base case:
for first k elements, Pr(k element is places in reservoir) = k/n = 1
inductive hypothesis:
Pr(element is in reservoir) = k/n
induction: for all n -> n + 1, prove that Pr(selection) = k/n+1
Pr(element is in reservoir) = k/n and Pr(element is replaced by x(i)) = 1 - 1/n+1
therefore, Pr(x(i) enters reservoir) = k/n * (1 - 1/n+1) = k/n * n/n+1 = k/n+1
There is also a slightly optimized version of this algorithm that I implemented that brings the theoretical runtime down to O(k * (1 + log(n/k))) by only computing random values if the x(i) element should be considered for the reservoir. Feel free to check out the implemenation and read more about the algorithm.
Weighted Reservoir Sampling
There are some situations where elements from a stream have a weight associated with its sampling probability. In the real world, we might be required to sample search engine queries or look into cache activity to analyze user experience and service performance. We can achieve an efficient solution by extending the algorithm from above with a heap. The idea is that associating each element with a uniformly generated random number will produce a random sample; the heap always holds the k-smallest (or largest) elements of weighted probabilities.
Here is an overview of the algorithm:
let node = (weighted probability, element)
let h = min-heap<node>
for all elements s in stream:
key = random from [0,1) ^ 1/weight
if there are not k elements in heap
h.push(node(key, element))
else
if the key is larger than the smallest key in h, replace the h's top element
the samples are all the elements in h
There are also handful of variations to this weighted reservoir sampling algorithm, but I feel like its best to simply read the original white paper.
References
Cover photo: not exactly a reservoir but the Charles Bridge, Prague, Czechia