# Reservoir Sampling

22 Sep 2024I have a little library called sampling floating around for
general-purpose sampling from arbitrary foldable collections. It’s a
bit of a funny project: I originally hacked it together quickly, just
to get something done, so it’s not a very good library *qua* library –
it has plenty of unnecessary dependencies, and it’s not at all tuned
for performance. *But* it was always straightforward to use, and good
enough for my needs, so I’ve never felt any particular urge to update
it. Lately it caught my attention again, though, and I started thinking
about possible ways to revamp it, as well as giving it some much-needed
quality-of-life improvements, in order to make it more generally useful.

The library supports sampling with and without replacement in both the
equal and unequal-probability cases, from collections such as lists,
maps, vectors, etc. – again, anything with a Foldable instance.
In particular: the equal-probability, sampling-without-replacement
case depends on some code that Gabriella Gonzalez wrote for
*reservoir sampling*, a common online method for sampling from a
potentially-unbounded stream. I started looking into alternative
algorithms for reservoir sampling, to see what else was out there, and
discovered one that was *extremely* fast, *extremely* clever, and that
I’d never heard of before. It doesn’t seem to be that well-known, so I
simply want to illustrate it here, just so others are aware of it.

I learned about the algorithm from Erik Erlandson’s blog, but, as he points out, it goes back almost forty years to one J.S Vitter, who apparently also popularised the “basic” reservoir sampling algorithm that everyone uses today (though it was not invented by him).

The basic reservoir sampling algorithm is simple. We’re sampling
some number of elements from a stream of unknown size: if we haven’t
collected enough elements yet, we just dump everything we see into our
“reservoir” (i.e., the collected sample); otherwise, we just generate a
random number and do a basic comparison to determine whether or not to
eject a value in our reservoir in favour of the new element we’ve just
seen. This is evidently also known as *Algorithm R*, and can apparently
be found in Knuth’s *Art of Computer Programming*. Here’s a basic
implementation that treats the reservoir as a mutable vector:

```
algo_r len stream prng = do
reservoir <- VM.new len
loop reservoir 0 stream
where
loop !sample index = \case
[] ->
V.freeze sample
(h:t)
| index < len -> do
VM.write sample index h
loop sample (succ index) t
| otherwise -> do
j <- S.uniformRM (0, index - 1) prng
when (j < len)
(VM.write sample j h)
loop sample (succ index) t
```

Here’s a quick, informal benchmark of it. Sampling 100 elements from a stream of 10M 64-bit integers, using Marsaglia’s MWC256 PRNG, yields the following runtime statistics on my mid-2020 MacBook Air:

```
73,116,027,160 bytes allocated in the heap
14,279,384 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17639 colls, 0 par 0.084s 0.123s 0.0000s 0.0004s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.006s elapsed)
MUT time 18.794s ( 18.651s elapsed)
GC time 0.084s ( 0.123s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 18.885s ( 18.781s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,890,333,621 bytes per MUT second
Productivity 99.5% of total user, 99.3% of total elapsed
```

It’s fairly slow, and pretty much all we’re doing is generating 10M random numbers.

In my experience, the most effective optimisations that can be made to a numerical algorithm like this tend to be “mechanical” in nature – avoiding allocation, cache misses, branch prediction failures, etc. I find it exceptionally pleasing when there’s some domain-specific intuition that admits substantial optimisation of an algorithm.

Vitter’s optimisation is along these lines. I find it as ingenious as
e.g. the kernel trick in the support vector machine context. The
crucial observation Vitter made is that one doesn’t need to consider
whether or not to add *every single element* he encounters to the
reservoir; the “gap” *between* entries also follows a well-defined
probability distribution, and one can just instead sample *that* in
order to determine the next element to add. Erik Erlandson points out
that when the size of the reservoir is small relative to the size of the
stream, as is typically the case, this distribution is well-approximated
by the geometric distribution – that which describes the number of coin
flips required before one observes a head (it has come up in a
couple of my previous log posts).

So: the algorithm is adjusted so that, while processing the stream, we sample how many elements to skip, and do so, then adding the next element encountered after that to the reservoir. Here’s a version of that, using Erlandson’s fast geometric approximation for sampling what Vitter calls the skip distance:

```
data Loop =
Next
| Skip !Int
algo_r_optim len stream prng = do
reservoir <- VM.new len
go Next reservoir 0 stream
where
go what !sample !index = \case
[] ->
V.freeze sample
s@(h:t) -> case what of
Next
-- below the reservoir size, just write elements
| index < len -> do
VM.write sample index h
go Next sample (succ index) t
-- sample skip distance
| otherwise -> do
u <- S.uniformDouble01M prng
let p = fi len / fi index
skip = floor (log u / log (1 - p))
go (Skip skip) sample index s
Skip skip
-- stop skipping, use this element
| skip == 0 -> do
j <- S.uniformRM (0, len - 1) prng
VM.write sample j h
go Next sample (succ index) t
-- skip (d - 1) more elements
| otherwise ->
go (Skip (pred d)) sample (succ index) t
```

And now the same simple benchmark:

```
1,852,883,584 bytes allocated in the heap
210,440 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 445 colls, 0 par 0.002s 0.003s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.007s elapsed)
MUT time 0.286s ( 0.283s elapsed)
GC time 0.002s ( 0.003s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.296s ( 0.293s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,477,345,638 bytes per MUT second
Productivity 96.8% of total user, 96.4% of total elapsed
```

Both Vitter and Erlandson estimated orders of magnitude improvement in sampling time, given we need to spend much less time iterating our PRNG; here we see a 65x performance gain, with 40x less allocation. Very impressive, and again, the optimisation is entirely probabilistic, rather than “mechanical,” in nature (indeed, I haven’t tested any mechanical optimisations to these implementations at all).

It turns out there are extensions in this spirit to unequal-probability
reservoir sampling as well, as is the method of “exponential jumps”
described in a 2006 paper by Efraimidis and Spirakis. I’ll
probably benchmark that algorithm too, and, if it fits the bill, and I
ever really *do* get around to properly updating my ‘sampling’ library,
I’ll refactor things to make use of these high-performance algorithms.
Let me know if you could use this sort of thing!