19 Oct 2024
I’ve released a library supporting BIP340 Schnorr signatures
and deterministic ECDSA on the elliptic curve secp256k1.
Get it while it’s hot – for when you just aren’t feeling
libsecp256k1!
This is another “minimal” library in the ppad suite of libraries I’m
working on. Minimal in the sense that it is pure Haskell (no FFI
– you can check out ppad-csecp256k1 if you want that) and
depends only on ‘base’, ‘bytestring’, and my own HMAC-DRBG and SHA256
libraries. The feature set also intentionally remains rather lean
for the time being (though if you could use other features in there, let
me know!).
Performance is decent, though unsurprisingly it still pales in
comparison to the low-level and battle-hardened libsecp256k1 (think 5ms
vs 50μs to create a Schnorr signature, for example). There’s ample
room for optimisation, though. Probably the lowest-hanging fruit is
that scalar multiplication on secp256k1 can seemingly be made much
more efficient via the so-called wNAF method that relies on
precomputed points, such that we might be looking at more like 500μs
to create a Schnorr signature, with a similar improvement for ECDSA. It
would require slightly more annoying UX, probably warranting its own set
of user-facing functions that would also accept a context argument, but
does not appear difficult to implement.
A few things I observed or noted while writing this library:
-
The modular arithmetic functions for arbitrary-precision Integers
contained in GHC.Num.Integer can be extremely fast, compared
to hand-rolled alternatives. Things like integerPowMod# and
integerRecipMod# absolutely fly, and will probably beat any
hand-rolled variant.
-
Arbitrary-precision Integers are still slow, compared to fixed-width
stuff that modern computers can positively chew through (this is not
news to me, but still). I achieved a staggering speedup on
some basic integer parsing by using a custom Word256 type (built from
a bunch of Word64 values) under the hood, and converting to Integer
only at the end.
They can also be annoying when one wants to achieve constant-time
execution of cryptographically-sensitive functions, for obvious
reasons. It would be nice to have good fixed-width support for stuff
like Word128, Word256, Word512, and so on – I briefly considered
implementing and using a custom Word256 type for everything, but
this would be a ton of work, and I’m not sure I could beat GHC’s
native bigint support for e.g. modular multiplication, exponentiation,
and inversion anyway. We’ll stick with plain-old Integer for the time
being – it’s still no slouch.
-
Algorithmically constant-time operations can still fail to be
constant-time in practice due to factors largely outside the
programmer’s control. A good example is found in bit operations;
looping through a bytestring and performing some “equivalent-looking”
work on every byte may still result in execution time discrepancies
between zero and nonzero bytes, for example, and these can be very
hard to eliminate. This stuff can depend on the compiler or runtime,
the architecture/processor used, etc.
-
The necessary organization of this kind of “catch-all” library is kind
of unsatisfying. Rather than picking a single curve, and then
implementing every feature one can possibly think of for it, it
would intuitively be better to implement a generic curve library or
libraries (for Weierstrass, Edwards, Montgomery, etc.), and
then implement e.g. ECDSA or EdDSA or ECDH or or whatever as separate
libraries depending on those generic curves as appropriate. One could
then use everything in more of a plug-and-play fashion – this might
be a design I’ll explore further in the future.
-
ByteString seems to provide a good user-facing “interface” for libraries
like this. It’s reliable, familiar to Haskellers, has a great API, and
is very well-tuned. It’s possible one might want to standardize on
something else for internals, though; unboxed vectors are an obvious
choice, though I would actually be inclined to use the primitive
library’s PrimArrays directly, favouring simplicity, and eschewing
vector’s harder-core optimisations.
The idea here in any case would be that one would use ByteString
only at the user-facing layer, and then work with PrimArrays (or
whatever) everywhere internally. It’s perhaps worth exploring further
– bytestring is very fast (strict bytestrings are, after all,
merely pointers to cstrings), but so are PrimArrays, and mutation à
la MutablePrimArray could be very helpful to have here and there.
(FWIW, though, I’ve benchmarked PrimArray/MutablePrimArray in
ppad-sha256 and found them to yield equivalent-or-slower performance
compared to ByteString in that setting.)
The library has been tested on the Project Wycheproof and
official BIP340 test vectors, as well as noble-secp256k1’s test
suite, and care has been taken around timing of functions that operate
on secret data. Kick the tires on it, if you feel so inclined!
(I mentioned this in my last post as well, but I’m indebted to Paul
Miller’s noble cryptography project for this work, both as
inspiration and also as a reference.)
07 Oct 2024
Just FYI, I’ve dropped a few simple libraries supporting SHA-{256,512}, HMAC-SHA{256, 512}, and HMAC-DRBG. You can find the repos here:
Each is packaged there as a Nix flake, and each is also
available on Hackage.
This is the first battery of a series of libraries I’m writing that were
primarily inspired by noble-cryptography after the death (or
at least deprecation) of cryptonite. The libraries are pure,
readable, concise GHC Haskell, having minimal dependencies, and aim
for clarity, security, performance, and user-friendliness.
I finally got around to going through most of the famous cryptopals
challenges last year and have since felt like writing some
“foundational” cryptography (and cryptography-adjacent) libraries that
I myself would want to use. I’d like to understand them well, test and
benchmark them myself, eke out performance and UX wins where I can get
them, etc. etc.
An example is found in the case of ppad-hmac-drbg – I want to use
this DRBG in the manner I’m accustomed to using generators from e.g.
mwc-random, in which I can utilise any PrimMonad to handle the
generator state:
ghci> :set -XOverloadedStrings
ghci> import qualified Crypto.DRBG.HMAC as DRBG
ghci> import qualified Crypto.Hash.SHA256 as SHA256
ghci>
ghci> let entropy = "very random"
ghci> let nonce = "very unused"
ghci> let personalization_string = "very personal"
ghci>
ghci> drbg <- DRBG.new SHA256.hmac entropy nonce personalization_string
ghci> bytes <- DRBG.gen mempty 32 drbg
ghci> more_bytes <- DRBG.gen mempty 16 drbg
I haven’t actually tried the other DRBG library I found on
Hackage, but it has different UX, a lot of dependencies, and has since
apparently been deprecated. The generator in ppad-hmac-drbg matches my
preferred UX, passes the official DRBGVS vectors, depends only
on ‘base’, ‘bytestring’, and ‘primitive’, and can be used with arbitrary
appropriate HMAC functions (and is maintained!).
The ppad-sha256 and ppad-sha512 libraries depend only on ‘base’ and
‘bytestring’ and are faster than any other pure Haskell SHA-2
implementations I’m aware of, even if the performance differences are
moral, rather than practical, victories:
ppad-sha256’s SHA-256, vs SHA’s, on a 32B input:
benchmarking ppad-sha256/SHA256 (32B input)/hash
time 1.898 μs (1.858 μs .. 1.941 μs)
0.997 R² (0.996 R² .. 0.999 R²)
mean 1.874 μs (1.856 μs .. 1.902 μs)
std dev 75.90 ns (60.30 ns .. 101.8 ns)
variance introduced by outliers: 55% (severely inflated)
benchmarking SHA/SHA256 (32B input)/sha256
time 2.929 μs (2.871 μs .. 2.995 μs)
0.997 R² (0.995 R² .. 0.998 R²)
mean 2.879 μs (2.833 μs .. 2.938 μs)
std dev 170.4 ns (130.4 ns .. 258.9 ns)
variance introduced by outliers: 71% (severely inflated)
And the same, but now a HMAC-SHA256 battle:
benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
time 7.287 μs (7.128 μs .. 7.424 μs)
0.996 R² (0.995 R² .. 0.998 R²)
mean 7.272 μs (7.115 μs .. 7.455 μs)
std dev 565.2 ns (490.9 ns .. 689.7 ns)
variance introduced by outliers: 80% (severely inflated)
benchmarking SHA/HMAC-SHA256 (32B input)/hmacSha256
time 11.42 μs (11.09 μs .. 11.80 μs)
0.994 R² (0.992 R² .. 0.997 R²)
mean 11.36 μs (11.09 μs .. 11.61 μs)
std dev 903.5 ns (766.5 ns .. 1.057 μs)
variance introduced by outliers: 79% (severely inflated)
The performance differential is larger on larger inputs; I think the
difference between the two on a contrived 1GB input was 22 vs 32s, all
on my mid-2020 MacBook Air. I haven’t bothered to implement e.g. SHA-224
and SHA-384, which are trivial adjustments of SHA-256 and SHA-512, but
if anyone could use them for some reason, please just let me know.
Anyway: enjoy, and let me know if you get any use out of these. Expect
more releases in this spirit!
22 Sep 2024
I have a little library called sampling floating around for
general-purpose sampling from arbitrary foldable collections. It’s a
bit of a funny project: I originally hacked it together quickly, just
to get something done, so it’s not a very good library qua library –
it has plenty of unnecessary dependencies, and it’s not at all tuned
for performance. But it was always straightforward to use, and good
enough for my needs, so I’ve never felt any particular urge to update
it. Lately it caught my attention again, though, and I started thinking
about possible ways to revamp it, as well as giving it some much-needed
quality-of-life improvements, in order to make it more generally useful.
The library supports sampling with and without replacement in both the
equal and unequal-probability cases, from collections such as lists,
maps, vectors, etc. – again, anything with a Foldable instance.
In particular: the equal-probability, sampling-without-replacement
case depends on some code that Gabriella Gonzalez wrote for
reservoir sampling, a common online method for sampling from a
potentially-unbounded stream. I started looking into alternative
algorithms for reservoir sampling, to see what else was out there, and
discovered one that was extremely fast, extremely clever, and that
I’d never heard of before. It doesn’t seem to be that well-known, so I
simply want to illustrate it here, just so others are aware of it.
I learned about the algorithm from Erik Erlandson’s blog, but,
as he points out, it goes back almost forty years to one J.S
Vitter, who apparently also popularised the “basic” reservoir sampling
algorithm that everyone uses today (though it was not invented by him).
The basic reservoir sampling algorithm is simple. We’re sampling
some number of elements from a stream of unknown size: if we haven’t
collected enough elements yet, we just dump everything we see into our
“reservoir” (i.e., the collected sample); otherwise, we just generate a
random number and do a basic comparison to determine whether or not to
eject a value in our reservoir in favour of the new element we’ve just
seen. This is evidently also known as Algorithm R, and can apparently
be found in Knuth’s Art of Computer Programming. Here’s a basic
implementation that treats the reservoir as a mutable vector:
algo_r len stream prng = do
reservoir <- VM.new len
loop reservoir 0 stream
where
loop !sample index = \case
[] ->
V.freeze sample
(h:t)
| index < len -> do
VM.write sample index h
loop sample (succ index) t
| otherwise -> do
j <- S.uniformRM (0, index - 1) prng
when (j < len)
(VM.write sample j h)
loop sample (succ index) t
Here’s a quick, informal benchmark of it. Sampling 100 elements from a
stream of 10M 64-bit integers, using Marsaglia’s MWC256 PRNG,
yields the following runtime statistics on my mid-2020 MacBook Air:
73,116,027,160 bytes allocated in the heap
14,279,384 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17639 colls, 0 par 0.084s 0.123s 0.0000s 0.0004s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.006s elapsed)
MUT time 18.794s ( 18.651s elapsed)
GC time 0.084s ( 0.123s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 18.885s ( 18.781s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,890,333,621 bytes per MUT second
Productivity 99.5% of total user, 99.3% of total elapsed
It’s fairly slow, and pretty much all we’re doing is generating 10M
random numbers.
In my experience, the most effective optimisations that can be made to
a numerical algorithm like this tend to be “mechanical” in nature –
avoiding allocation, cache misses, branch prediction failures,
etc. I find it exceptionally pleasing when there’s some domain-specific
intuition that admits substantial optimisation of an algorithm.
Vitter’s optimisation is along these lines. I find it as ingenious as
e.g. the kernel trick in the support vector machine context. The
crucial observation Vitter made is that one doesn’t need to consider
whether or not to add every single element he encounters to the
reservoir; the “gap” between entries also follows a well-defined
probability distribution, and one can just instead sample that in
order to determine the next element to add. Erik Erlandson points out
that when the size of the reservoir is small relative to the size of the
stream, as is typically the case, this distribution is well-approximated
by the geometric distribution – that which describes the number of coin
flips required before one observes a head (it has come up in a
couple of my previous blog posts).
So: the algorithm is adjusted so that, while processing the stream,
we sample how many elements to skip, and do so, then adding the next
element encountered after that to the reservoir. Here’s a version of
that, using Erlandson’s fast geometric approximation for sampling what
Vitter calls the skip distance:
data Loop =
Next
| Skip !Int
algo_r_optim len stream prng = do
reservoir <- VM.new len
go Next reservoir 0 stream
where
go what !sample !index = \case
[] ->
V.freeze sample
s@(h:t) -> case what of
Next
-- below the reservoir size, just write elements
| index < len -> do
VM.write sample index h
go Next sample (succ index) t
-- sample skip distance
| otherwise -> do
u <- S.uniformDouble01M prng
let p = fi len / fi index
skip = floor (log u / log (1 - p))
go (Skip skip) sample index s
Skip skip
-- stop skipping, use this element
| skip == 0 -> do
j <- S.uniformRM (0, len - 1) prng
VM.write sample j h
go Next sample (succ index) t
-- skip (d - 1) more elements
| otherwise ->
go (Skip (pred d)) sample (succ index) t
And now the same simple benchmark:
1,852,883,584 bytes allocated in the heap
210,440 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 445 colls, 0 par 0.002s 0.003s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.007s elapsed)
MUT time 0.286s ( 0.283s elapsed)
GC time 0.002s ( 0.003s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.296s ( 0.293s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,477,345,638 bytes per MUT second
Productivity 96.8% of total user, 96.4% of total elapsed
Both Vitter and Erlandson estimated orders of magnitude improvement
in sampling time, given we need to spend much less time iterating our
PRNG; here we see a 65x performance gain, with 40x less allocation.
Very impressive, and again, the optimisation is entirely probabilistic,
rather than “mechanical,” in nature (indeed, I haven’t tested any
mechanical optimisations to these implementations at all).
It turns out there are extensions in this spirit to unequal-probability
reservoir sampling as well, as is the method of “exponential jumps”
described in a 2006 paper by Efraimidis and Spirakis. I’ll
probably benchmark that algorithm too, and, if it fits the bill, and I
ever really do get around to properly updating my ‘sampling’ library,
I’ll refactor things to make use of these high-performance algorithms.
Let me know if you could use this sort of thing!