07 Oct 2024
Just FYI, I’ve dropped a few simple libraries supporting SHA-{256,512}, HMAC-SHA{256, 512}, and HMAC-DRBG. You can find the repos here:
Each is packaged there as a Nix flake, and each is also
available on Hackage.
This is the first battery of a series of libraries I’m writing that were
primarily inspired by noble-cryptography after the death (or
at least deprecation) of cryptonite. The libraries are pure,
readable, concise GHC Haskell, having minimal dependencies, and aim
for clarity, security, performance, and user-friendliness.
I finally got around to going through most of the famous cryptopals
challenges last year and have since felt like writing some
“foundational” cryptography (and cryptography-adjacent) libraries that
I myself would want to use. I’d like to understand them well, test and
benchmark them myself, eke out performance and UX wins where I can get
them, etc. etc.
An example is found in the case of ppad-hmac-drbg – I want to use
this DRBG in the manner I’m accustomed to using generators from e.g.
mwc-random, in which I can utilise any PrimMonad to handle the
generator state:
ghci> :set -XOverloadedStrings
ghci> import qualified Crypto.DRBG.HMAC as DRBG
ghci> import qualified Crypto.Hash.SHA256 as SHA256
ghci>
ghci> let entropy = "very random"
ghci> let nonce = "very unused"
ghci> let personalization_string = "very personal"
ghci>
ghci> drbg <- DRBG.new SHA256.hmac entropy nonce personalization_string
ghci> bytes <- DRBG.gen mempty 32 drbg
ghci> more_bytes <- DRBG.gen mempty 16 drbg
I haven’t actually tried the other DRBG library I found on
Hackage, but it has different UX, a lot of dependencies, and has since
apparently been deprecated. The generator in ppad-hmac-drbg matches my
preferred UX, passes the official DRBGVS vectors, depends only
on ‘base’, ‘bytestring’, and ‘primitive’, and can be used with arbitrary
appropriate HMAC functions (and is maintained!).
The ppad-sha256 and ppad-sha512 libraries depend only on ‘base’ and
‘bytestring’ and are faster than any other pure Haskell SHA-2
implementations I’m aware of, even if the performance differences are
moral, rather than practical, victories:
ppad-sha256’s SHA-256, vs SHA’s, on a 32B input:
benchmarking ppad-sha256/SHA256 (32B input)/hash
time 1.898 μs (1.858 μs .. 1.941 μs)
0.997 R² (0.996 R² .. 0.999 R²)
mean 1.874 μs (1.856 μs .. 1.902 μs)
std dev 75.90 ns (60.30 ns .. 101.8 ns)
variance introduced by outliers: 55% (severely inflated)
benchmarking SHA/SHA256 (32B input)/sha256
time 2.929 μs (2.871 μs .. 2.995 μs)
0.997 R² (0.995 R² .. 0.998 R²)
mean 2.879 μs (2.833 μs .. 2.938 μs)
std dev 170.4 ns (130.4 ns .. 258.9 ns)
variance introduced by outliers: 71% (severely inflated)
And the same, but now a HMAC-SHA256 battle:
benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
time 7.287 μs (7.128 μs .. 7.424 μs)
0.996 R² (0.995 R² .. 0.998 R²)
mean 7.272 μs (7.115 μs .. 7.455 μs)
std dev 565.2 ns (490.9 ns .. 689.7 ns)
variance introduced by outliers: 80% (severely inflated)
benchmarking SHA/HMAC-SHA256 (32B input)/hmacSha256
time 11.42 μs (11.09 μs .. 11.80 μs)
0.994 R² (0.992 R² .. 0.997 R²)
mean 11.36 μs (11.09 μs .. 11.61 μs)
std dev 903.5 ns (766.5 ns .. 1.057 μs)
variance introduced by outliers: 79% (severely inflated)
The performance differential is larger on larger inputs; I think the
difference between the two on a contrived 1GB input was 22 vs 32s, all
on my mid-2020 MacBook Air. I haven’t bothered to implement e.g. SHA-224
and SHA-384, which are trivial adjustments of SHA-256 and SHA-512, but
if anyone could use them for some reason, please just let me know.
Anyway: enjoy, and let me know if you get any use out of these. Expect
more releases in this spirit!
22 Sep 2024
I have a little library called sampling floating around for
general-purpose sampling from arbitrary foldable collections. It’s a
bit of a funny project: I originally hacked it together quickly, just
to get something done, so it’s not a very good library qua library –
it has plenty of unnecessary dependencies, and it’s not at all tuned
for performance. But it was always straightforward to use, and good
enough for my needs, so I’ve never felt any particular urge to update
it. Lately it caught my attention again, though, and I started thinking
about possible ways to revamp it, as well as giving it some much-needed
quality-of-life improvements, in order to make it more generally useful.
The library supports sampling with and without replacement in both the
equal and unequal-probability cases, from collections such as lists,
maps, vectors, etc. – again, anything with a Foldable instance.
In particular: the equal-probability, sampling-without-replacement
case depends on some code that Gabriella Gonzalez wrote for
reservoir sampling, a common online method for sampling from a
potentially-unbounded stream. I started looking into alternative
algorithms for reservoir sampling, to see what else was out there, and
discovered one that was extremely fast, extremely clever, and that
I’d never heard of before. It doesn’t seem to be that well-known, so I
simply want to illustrate it here, just so others are aware of it.
I learned about the algorithm from Erik Erlandson’s blog, but,
as he points out, it goes back almost forty years to one J.S
Vitter, who apparently also popularised the “basic” reservoir sampling
algorithm that everyone uses today (though it was not invented by him).
The basic reservoir sampling algorithm is simple. We’re sampling
some number of elements from a stream of unknown size: if we haven’t
collected enough elements yet, we just dump everything we see into our
“reservoir” (i.e., the collected sample); otherwise, we just generate a
random number and do a basic comparison to determine whether or not to
eject a value in our reservoir in favour of the new element we’ve just
seen. This is evidently also known as Algorithm R, and can apparently
be found in Knuth’s Art of Computer Programming. Here’s a basic
implementation that treats the reservoir as a mutable vector:
algo_r len stream prng = do
reservoir <- VM.new len
loop reservoir 0 stream
where
loop !sample index = \case
[] ->
V.freeze sample
(h:t)
| index < len -> do
VM.write sample index h
loop sample (succ index) t
| otherwise -> do
j <- S.uniformRM (0, index - 1) prng
when (j < len)
(VM.write sample j h)
loop sample (succ index) t
Here’s a quick, informal benchmark of it. Sampling 100 elements from a
stream of 10M 64-bit integers, using Marsaglia’s MWC256 PRNG,
yields the following runtime statistics on my mid-2020 MacBook Air:
73,116,027,160 bytes allocated in the heap
14,279,384 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17639 colls, 0 par 0.084s 0.123s 0.0000s 0.0004s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.006s elapsed)
MUT time 18.794s ( 18.651s elapsed)
GC time 0.084s ( 0.123s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 18.885s ( 18.781s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,890,333,621 bytes per MUT second
Productivity 99.5% of total user, 99.3% of total elapsed
It’s fairly slow, and pretty much all we’re doing is generating 10M
random numbers.
In my experience, the most effective optimisations that can be made to
a numerical algorithm like this tend to be “mechanical” in nature –
avoiding allocation, cache misses, branch prediction failures,
etc. I find it exceptionally pleasing when there’s some domain-specific
intuition that admits substantial optimisation of an algorithm.
Vitter’s optimisation is along these lines. I find it as ingenious as
e.g. the kernel trick in the support vector machine context. The
crucial observation Vitter made is that one doesn’t need to consider
whether or not to add every single element he encounters to the
reservoir; the “gap” between entries also follows a well-defined
probability distribution, and one can just instead sample that in
order to determine the next element to add. Erik Erlandson points out
that when the size of the reservoir is small relative to the size of the
stream, as is typically the case, this distribution is well-approximated
by the geometric distribution – that which describes the number of coin
flips required before one observes a head (it has come up in a
couple of my previous blog posts).
So: the algorithm is adjusted so that, while processing the stream,
we sample how many elements to skip, and do so, then adding the next
element encountered after that to the reservoir. Here’s a version of
that, using Erlandson’s fast geometric approximation for sampling what
Vitter calls the skip distance:
data Loop =
Next
| Skip !Int
algo_r_optim len stream prng = do
reservoir <- VM.new len
go Next reservoir 0 stream
where
go what !sample !index = \case
[] ->
V.freeze sample
s@(h:t) -> case what of
Next
-- below the reservoir size, just write elements
| index < len -> do
VM.write sample index h
go Next sample (succ index) t
-- sample skip distance
| otherwise -> do
u <- S.uniformDouble01M prng
let p = fi len / fi index
skip = floor (log u / log (1 - p))
go (Skip skip) sample index s
Skip skip
-- stop skipping, use this element
| skip == 0 -> do
j <- S.uniformRM (0, len - 1) prng
VM.write sample j h
go Next sample (succ index) t
-- skip (d - 1) more elements
| otherwise ->
go (Skip (pred d)) sample (succ index) t
And now the same simple benchmark:
1,852,883,584 bytes allocated in the heap
210,440 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 445 colls, 0 par 0.002s 0.003s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.007s elapsed)
MUT time 0.286s ( 0.283s elapsed)
GC time 0.002s ( 0.003s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.296s ( 0.293s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,477,345,638 bytes per MUT second
Productivity 96.8% of total user, 96.4% of total elapsed
Both Vitter and Erlandson estimated orders of magnitude improvement
in sampling time, given we need to spend much less time iterating our
PRNG; here we see a 65x performance gain, with 40x less allocation.
Very impressive, and again, the optimisation is entirely probabilistic,
rather than “mechanical,” in nature (indeed, I haven’t tested any
mechanical optimisations to these implementations at all).
It turns out there are extensions in this spirit to unequal-probability
reservoir sampling as well, as is the method of “exponential jumps”
described in a 2006 paper by Efraimidis and Spirakis. I’ll
probably benchmark that algorithm too, and, if it fits the bill, and I
ever really do get around to properly updating my ‘sampling’ library,
I’ll refactor things to make use of these high-performance algorithms.
Let me know if you could use this sort of thing!
01 Sep 2024
Some years ago I wrote about using recursion schemes
to encode stochastic processes in an embedded probabilistic
programming setting. The crux of it was that recursion schemes
allow one to “factor out” the probabilistic phenomena from the recursive
structure of the process; the probabilistic stuff typically sits in
the so-called coalgebra of the recursion scheme, while the recursion
scheme itself dictates the manner in which the process evolves.
While it’s arguable as to whether this stuff is useful from a strictly
practical perspective (I would suggest “not”), I think the intuition one
gleans from it can be somewhat worthwhile, and my curious brain finds
itself wandering back to the topic from time to time.
I happened to take a look at this sort of framework again recently
and discovered that I couldn’t easily seem to implement a Chinese
Restaurant Process (CRP) – a stochastic process famous from the
setting of nonparametric Bayesian models – via either of the “standard”
patterns I used throughout my Recursive Stochastic Processes
post. This indicated that the recursive structure of the CRP differs
from the others I studied previously, at least in the manner I was
attempting to encode it in my particular embedded language setting.
So let’s take a look to see what the initial problem was, how one can
resolve it, and what insights we can take away from it all.
Framework
Here’s a simple and slightly more refined version of the embedded
probabilistic programming framework I introduced in some of my older
posts. I’ll elide incidental details, such as common imports and noisy
code that might distract from the main points.
First, some unfamiliar imports and minimal core types:
import qualified Control.Monad.Trans.Free as TF
import qualified System.Random.MWC.Probability as MWC
data ModelF r =
BernoulliF Double (Bool -> r)
| UniformF (Double -> r)
deriving Functor
type Model = Free ModelF
As a quick refresher, an expression of type ‘Model’ denotes a
probability distribution over some carrier type, and terms that
construct, manipulate, or interpret values of type ‘Model’ constitute
those of a simple embedded probabilistic programming language.
Importantly, here we’re using a very minimal such language, consisting
only of two primitives: the Bernoulli distribution (which is a
probability distribution over a coin flip) and the uniform distribution
over the interval [0, 1]. A trivial model that draws a probability of
success from a uniform distribution, and then flips a coin conditional
on that probability, would look like this, for example:
model :: Model Bool
model = Free (UniformF (\u ->
Free (BernoulliF pure))
(We could create helper functions such that programs in this embedded
language would be nicer to write, but that’s not the focus of this
post.)
We can construct expressions that denote stochastic processes by using
the normal salad of recursion schemes. Here’s how we can encode a
geometric distribution, for example, in which one counts the number of
coin flips required to observe a head:
geometric :: Double -> Model Int
geometric p = apo coalg 1 where
coalg n = TF.Free (BernoulliF p (\accept ->
if accept
then Left (pure n)
else Right (n + 1)))
The coalgebra isolates the probabilistic phenomena (a Bernoulli draw,
i.e. a coin flip), and the recursion scheme determines how the process
evolves (halting if a Bernoulli proposal is accepted). The coalgebra is
defined in terms of the so-called base or pattern functor of the
free monad type, defined in ‘Control.Monad.Trans.Free’ (see Practical
Recursion Schemes for a refresher on base functors if you’re
rusty).
The result is an expression in our embedded language, of course, and is
completely abstract. If we want to sample from the model it encodes, we
can use an interpreter like the following:
prob :: Model a -> MWC.Prob IO a
prob = iterM $ \case
BernoulliF p f -> MWC.bernoulli p >>= f
UniformF f -> MWC.uniform >>= f
where the ‘MWC’-prefixed functions are sampling functions from the
mwc-probability library, and ‘iterM’ is the familiar monadic
catamorphism-like recursion scheme over the free monad. This will
produce a function that can be sampled with ‘MWC.sample’ when provided
with a PRNG.
Here’s what some samples look like:
ghci> gen <- MWC.create
ghci> replicateM 10 (MWC.sample (prob (geometric 0.1)) gen)
[1,9,13,3,4,3,13,1,4,17]
Chinese Restaurant Process
The CRP is described technically as a stochastic process over “finite
integer partitions,” and, more memorably, over configurations of a
particular sort of indefinitely-large Chinese restaurant. One is to
imagine customers entering the restaurant sequentially; the first
sits at the first table available, and each additional customer is
either seated at a new table with probability proportional to some
dispersion parameter, or is seated at an occupied table with probability
proportional to the number of other customers already sitting there.
If each customer is labelled by how many others were in the restaurant
when they entered, the result is, for ‘n’ total customers, a random
partition of the natural numbers up to ‘n’. A particular realisation of
the process, following the arrival of 10 customers, might look like the
following:
[[9,8,7,6,4,2,0], [1], [5,3]]
This is a restaurant configuration after 10 arrivals, where each element
is a table populated by the labelled customers.
One of the rules of explaining the CRP is that you always have to
include a visualization like this:
It corresponds to the realised sample, and we certainly wouldn’t want to
break any pedagogical regulations.
Encoding, First Attempt
A natural way to encode the process is to seat each arriving customer at
a new table with the appropriate probability, and then, if it turns out
he is to be sat at an occupied table, to do that with the appropriate
conditional probability (i.e., conditional on the fact that he’s not
going to be seated at a new table).
So let’s imagine encoding a CRP with dispersion parameter ‘a’ and total
number of customers ‘n’. Our first attempt might go something like this:
crp n a = ana coalg (1, <initial restaurant>) where
coalg (customer, tables)
| customer >= n = TF.Pure tables
| otherwise =
let p = <probability of seating customer at new table>
in TF.Free (BernoulliF p (\accept ->
if accept
then (succ customer, <seat at new table>)
else ???
))
We run into a problem when we hit the ‘else’ branch of the conditional.
Here we want to express another random choice – viz., at which occupied
table do we seat the arriving customer. We’d want to do something like
‘TF.Free (UniformF (\u -> …))’ in that ‘else’ branch and then use the
produced uniform value to choose between the occupied tables based on
their conditional probabilities. But that will prove to be impossible,
given the type requirements.
There’s similarly no way to restructure things by producing the
desired uniform value earlier, before the conditional expression. For
appropriate type ‘t’, the coalgebra used for the anamorphism must have
type:
which in our case reduces to:
coalg :: a -> TF.FreeF ModelF a a
You’ll find that there’s simply no way to add another TF.Free
constructor to the mix while satisfying the above type. So it seems that
with an anamorphism (or apomorphism, which encounters the same problem)
we’re limited to denoting a single probabilistic operation on any
recursive call.
Encoding, Correctly
We thus need a recursion scheme that allows us to add multiple levels of
the base functor at a time. The most appropriate scheme appears to me
to be the futumorphism, which I also wrote about in Time Traveling
Recursion Schemes.
(As Patrick Thomson pointed out in his sublime series on
recursion schemes, both anamorphisms and apomorphisms are special cases
of the futumorphism.)
The coalgebra used by a futumorphism has a different type than that used
by an ana- or apomorphism, namely:
coalg :: a -> Base t (Free (Base t) a)
In our case, this is:
coalg :: a -> TF.FreeF ModelF a (Free (TF.FreeF ModelF a) a)
Note that here we’ll be able to use additional ‘TF.Free’ constructors
inside other expressions involving the base functor. In Time Traveling
Recursion Schemes I referred to this as “working with values that
don’t exist yet, while pretending like they do” – but better would
be to say that one is simply defining values to be produced during
(co)recursion, using separate monadic code.
Here’s how we’d encode the CRP using a futumorphism, in pseudocode:
crp n a = futu coalg (1, <initial restaurant>) where
coalg (customer, tables)
| customer >= n = TF.Pure tables
| otherwise =
let p = <probability of seating customer at new table>
in TF.Free (BernoulliF p (\accept ->
if accept
then pure (succ customer, <seat at new table>)
else do
res <- liftF (TF.Free (UniformF (\u ->
<seat amongst occupied tables using 'u'>
)))
pure (succ customer, res)))
Note that recursive points are expressed under a ‘TF.Free’ constructor
by returning a value (using ‘pure’) in the free monad itself. This
effectively allows us to use more than one embedded language construct
on each recursive call – as many as we want, as a matter of fact. The
familiar ‘liftF’ function lifts each such expression into the free
monad for us.
I mentioned in Time Traveling Recursion Schemes that this sort of
thing can look a little bit nicer if you have the appropriate embedded
language terms floating around. If we define the following:
uniform :: Free (TF.FreeF ModelF a) Double
uniform = liftF (TF.Free (UniformF id))
then we can use it to tidy things up a bit:
crp n a = futu coalg (1, <initial restaurant>) where
coalg (customer, tables)
| customer >= n = TF.Pure tables
| otherwise =
let p = <probability of seating customer at new table>
in TF.Free (BernoulliF p (\accept ->
if accept
then pure (succ customer, <seat at new table>)
else do
u <- uniform
let res = <seat amongst occupied tables using 'u'>
pure (succ customer, res)))
In any case, the futumorphism gets us to a faithfully-encoded CRP.
It allows us to express multiple probabilistic operations in every
recursive call, which is what’s required here due to our use of
conditional probabilities.
Alternative Encodings
Now. Recall that I mentioned the following:
A natural way to encode the process is to seat each arriving customer
at a new table with the appropriate probability, and then, if it
turns out he is to be sat at an occupied table, to do that with the
appropriate conditional probability (i.e., conditional on the fact
that he’s not going to be seated at a new table).
This is probably the most natural way to encode the process, and indeed,
if we only have Bernoulli and uniform language terms (or beta, Gaussian,
etc. in place of uniform – something that can at least be transformed
to produce a uniform, in any case), this seems to be the best we can do.
But it is not the only way to encode the CRP. If we have a language
term corresponding to a categorical distribution, then we can instead
choose between a new table and any of the occupied tables simultaneously
using the unconditional probabilities for each.
Let’s adjust our ModelF functor, adding a language term corresponding
to a categorical distribution. It will have as its parameter a list of
probabilities, one for each possible outcome under consideration:
data ModelF r =
BernoulliF Double (Bool -> r)
| UniformF (Double -> r)
| CategoricalF [Double] (Int -> r)
deriving Functor
With this we’ll be able to encode the CRP using a mere anamorphism, as
the following pseudocode describes:
crp n a = ana coalg (1, <initial restaurant>) where
coalg (customer, tables)
| customer >= n = TF.Pure tables
| otherwise =
let ps = <unconditional categorical probabilities>
in TF.Free (CategoricalF ps (\i ->
if i == 0
then (succ customer, <seat at new table>)
else (succ customer, <seat at occupied table 'i - 1'>)
))
Since here we only ever express a single probabilistic term during
recursion, the “standard” pattern we’ve used previously applies here.
But it’s worth noting that we could still employ the “unconditional,
then conditional” approach using a categorical distribution – we’d just
use the conditional probabilities to sample from the occupied tables
directly, rather than doing it in more low-level fashion using the
single uniform draw. Given an appropriate ‘categorical’ term, that would
look more like:
crp n a = futu coalg (1, <initial restaurant>) where
coalg (customer, tables)
| customer >= n = TF.Pure tables
| otherwise =
let p = <probability of seating customer at new table>
in TF.Free (BernoulliF p (\accept ->
if accept
then pure (succ customer, <seat at new table>)
else do
let ps = <conditional categorical probabilities>
i <- categorical ps
pure (succ customer, <seat at occupied table 'i'>)))
The recursive structure in this case is more complicated, requiring a
futumorphism instead of an anamorphism, but typically the conditional
probabilities are easier to compute.
Fin
So, some takeaways here, if one wants to indulge in this sort of
framework:
If we want to express a stochastic process using multiple probabilistic
operations in any given recursive call, we may need to employ a
scheme that supports richer (co)recursion than a plain anamorphism
or apomorphism. Here that’s captured nicely by a futumorphism, which
naturally captures what we’re looking for.
The same might be true if our functor is sufficiently limited, as
was the case here if we supported only the Bernoulli and uniform
distributions. There we had no option but to express the recursion using
condiitonal probabilistic operations, and so needed the richer recursive
structure that a futumorphism provides.
On the other hand, it may not be the case that one needs the richer
structure provided by a futumorphism, if instead one can express the
coalgebra using only a single layer of the base functor. Adding a
primitive categorical distribution to our embedded language eliminated
the need to use conditional probabilities when describing the recursion,
allowing us to drop back to a “basic” anamorphism.
Here is a gist containing a fleshed-out version of the code
above, if you’d like to play with it. Enjoy!