Faster Signatures on secp256k1

In my last post I mentioned that both the Schnorr and ECDSA signature schemes on secp256k1 could be made faster via the so-called wNAF method for elliptic curve point multiplication (short for the cumbersome “w-ary non-adjacent form”). I implemented wNAF for ppad-secp256k1 afterwards, adding a bunch of functions that use it internally.

The only “downside” to the use of wNAF is that one needs to supply a context argument that consists of a bunch of precomputed multiples of the secp256k1 base point (this is at least one of the reasons why libsecp256k1, as well as any bindings to it, generally require you to pass a context argument everywhere). It’s not much of a downside, but it does complicate the API to some degree, so I added these functions on top of the existing ones – the original functions are preserved for when terse code, rather than raw speed, is desired.

The wNAF method really shines when it comes to ECDSA. Here are some benchmarks that illustrate the improvement – the wNAF-powered functions are differentiated by a trailing apostrophe:

benchmarking ecdsa/sign_ecdsa
time                 1.795 ms   (1.767 ms .. 1.822 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 1.806 ms   (1.785 ms .. 1.849 ms)
std dev              93.43 μs   (58.86 μs .. 163.7 μs)

benchmarking ecdsa/sign_ecdsa'
time                 243.1 μs   (237.6 μs .. 249.6 μs)
                     0.996 R²   (0.993 R² .. 0.999 R²)
mean                 241.4 μs   (238.1 μs .. 245.3 μs)
std dev              12.06 μs   (9.492 μs .. 16.17 μs)

benchmarking ecdsa/verify_ecdsa
time                 2.473 ms   (2.409 ms .. 2.541 ms)
                     0.995 R²   (0.991 R² .. 0.997 R²)
mean                 2.432 ms   (2.396 ms .. 2.480 ms)
std dev              140.2 μs   (110.4 μs .. 187.0 μs)

benchmarking ecdsa/verify_ecdsa'
time                 1.460 ms   (1.418 ms .. 1.497 ms)
                     0.994 R²   (0.989 R² .. 0.997 R²)
mean                 1.419 ms   (1.398 ms .. 1.446 ms)
std dev              80.76 μs   (66.73 μs .. 104.9 μs)

So, a 7.5x improvement on ECDSA signature creation, and almost a 2x improvement on signature verification. Not bad.

For Schnorr signatures the improvements are less pronounced, but still substantial:

benchmarking schnorr/sign_schnorr
time                 5.368 ms   (5.327 ms .. 5.423 ms)
                     0.999 R²   (0.999 R² .. 1.000 R²)
mean                 5.436 ms   (5.410 ms .. 5.464 ms)
std dev              84.83 μs   (72.17 μs .. 101.4 μs)

benchmarking schnorr/sign_schnorr'
time                 2.557 ms   (2.534 ms .. 2.596 ms)
                     0.998 R²   (0.997 R² .. 0.999 R²)
mean                 2.579 ms   (2.556 ms .. 2.605 ms)
std dev              83.75 μs   (69.50 μs .. 100.5 μs)

benchmarking schnorr/verify_schnorr
time                 2.338 ms   (2.309 ms .. 2.382 ms)
                     0.998 R²   (0.995 R² .. 0.999 R²)
mean                 2.339 ms   (2.316 ms .. 2.366 ms)
std dev              82.73 μs   (65.83 μs .. 105.9 μs)

benchmarking schnorr/verify_schnorr'
time                 1.429 ms   (1.381 ms .. 1.482 ms)
                     0.993 R²   (0.987 R² .. 0.998 R²)
mean                 1.372 ms   (1.355 ms .. 1.396 ms)
std dev              67.72 μs   (50.44 μs .. 110.5 μs)

So here about a 2x improvement, plus or minus change, across the board. I hope to hammer these down a good bit further in the future if at all possible!

Signatures on secp256k1

I’ve released a library supporting BIP340 Schnorr signatures and deterministic ECDSA on the elliptic curve secp256k1. Get it while it’s hot – for when you just aren’t feeling libsecp256k1!

This is another “minimal” library in the ppad suite of libraries I’m working on. Minimal in the sense that it is pure Haskell (no FFI – you can check out ppad-csecp256k1 if you want that) and depends only on ‘base’, ‘bytestring’, and my own HMAC-DRBG and SHA256 libraries. The feature set also intentionally remains rather lean for the time being (though if you could use other features in there, let me know!).

Performance is decent, though unsurprisingly it still pales in comparison to the low-level and battle-hardened libsecp256k1 (think 5ms vs 50μs to create a Schnorr signature, for example). There’s ample room for optimisation, though. Probably the lowest-hanging fruit is that scalar multiplication on secp256k1 can seemingly be made much more efficient via the so-called wNAF method that relies on precomputed points, such that we might be looking at more like 500μs to create a Schnorr signature, with a similar improvement for ECDSA. It would require slightly more annoying UX, probably warranting its own set of user-facing functions that would also accept a context argument, but does not appear difficult to implement.

A few things I observed or noted while writing this library:

  • The modular arithmetic functions for arbitrary-precision Integers contained in GHC.Num.Integer can be extremely fast, compared to hand-rolled alternatives. Things like integerPowMod# and integerRecipMod# absolutely fly, and will probably beat any hand-rolled variant.

  • Arbitrary-precision Integers are still slow, compared to fixed-width stuff that modern computers can positively chew through (this is not news to me, but still). I achieved a staggering speedup on some basic integer parsing by using a custom Word256 type (built from a bunch of Word64 values) under the hood, and converting to Integer only at the end.

    They can also be annoying when one wants to achieve constant-time execution of cryptographically-sensitive functions, for obvious reasons. It would be nice to have good fixed-width support for stuff like Word128, Word256, Word512, and so on – I briefly considered implementing and using a custom Word256 type for everything, but this would be a ton of work, and I’m not sure I could beat GHC’s native bigint support for e.g. modular multiplication, exponentiation, and inversion anyway. We’ll stick with plain-old Integer for the time being – it’s still no slouch.

  • Algorithmically constant-time operations can still fail to be constant-time in practice due to factors largely outside the programmer’s control. A good example is found in bit operations; looping through a bytestring and performing some “equivalent-looking” work on every byte may still result in execution time discrepancies between zero and nonzero bytes, for example, and these can be very hard to eliminate. This stuff can depend on the compiler or runtime, the architecture/processor used, etc.

  • The necessary organization of this kind of “catch-all” library is kind of unsatisfying. Rather than picking a single curve, and then implementing every feature one can possibly think of for it, it would intuitively be better to implement a generic curve library or libraries (for Weierstrass, Edwards, Montgomery, etc.), and then implement e.g. ECDSA or EdDSA or ECDH or or whatever as separate libraries depending on those generic curves as appropriate. One could then use everything in more of a plug-and-play fashion – this might be a design I’ll explore further in the future.

  • ByteString seems to provide a good user-facing “interface” for libraries like this. It’s reliable, familiar to Haskellers, has a great API, and is very well-tuned. It’s possible one might want to standardize on something else for internals, though; unboxed vectors are an obvious choice, though I would actually be inclined to use the primitive library’s PrimArrays directly, favouring simplicity, and eschewing vector’s harder-core optimisations.

    The idea here in any case would be that one would use ByteString only at the user-facing layer, and then work with PrimArrays (or whatever) everywhere internally. It’s perhaps worth exploring further – bytestring is very fast (strict bytestrings are, after all, merely pointers to cstrings), but so are PrimArrays, and mutation à la MutablePrimArray could be very helpful to have here and there.

    (FWIW, though, I’ve benchmarked PrimArray/MutablePrimArray in ppad-sha256 and found them to yield equivalent-or-slower performance compared to ByteString in that setting.)

The library has been tested on the Project Wycheproof and official BIP340 test vectors, as well as noble-secp256k1’s test suite, and care has been taken around timing of functions that operate on secret data. Kick the tires on it, if you feel so inclined!

(I mentioned this in my last post as well, but I’m indebted to Paul Miller’s noble cryptography project for this work, both as inspiration and also as a reference.)

New HMAC-DRBG and SHA-2 Libraries

Just FYI, I’ve dropped a few simple libraries supporting SHA-{256,512}, HMAC-SHA{256, 512}, and HMAC-DRBG. You can find the repos here:

Each is packaged there as a Nix flake, and each is also available on Hackage.

This is the first battery of a series of libraries I’m writing that were primarily inspired by noble-cryptography after the death (or at least deprecation) of cryptonite. The libraries are pure, readable, concise GHC Haskell, having minimal dependencies, and aim for clarity, security, performance, and user-friendliness.

I finally got around to going through most of the famous cryptopals challenges last year and have since felt like writing some “foundational” cryptography (and cryptography-adjacent) libraries that I myself would want to use. I’d like to understand them well, test and benchmark them myself, eke out performance and UX wins where I can get them, etc. etc.

An example is found in the case of ppad-hmac-drbg – I want to use this DRBG in the manner I’m accustomed to using generators from e.g. mwc-random, in which I can utilise any PrimMonad to handle the generator state:

ghci> :set -XOverloadedStrings
ghci> import qualified Crypto.DRBG.HMAC as DRBG
ghci> import qualified Crypto.Hash.SHA256 as SHA256
ghci>
ghci> let entropy = "very random"
ghci> let nonce = "very unused"
ghci> let personalization_string = "very personal"
ghci>
ghci> drbg <- DRBG.new SHA256.hmac entropy nonce personalization_string
ghci> bytes <- DRBG.gen mempty 32 drbg
ghci> more_bytes <- DRBG.gen mempty 16 drbg

I haven’t actually tried the other DRBG library I found on Hackage, but it has different UX, a lot of dependencies, and has since apparently been deprecated. The generator in ppad-hmac-drbg matches my preferred UX, passes the official DRBGVS vectors, depends only on ‘base’, ‘bytestring’, and ‘primitive’, and can be used with arbitrary appropriate HMAC functions (and is maintained!).

The ppad-sha256 and ppad-sha512 libraries depend only on ‘base’ and ‘bytestring’ and are faster than any other pure Haskell SHA-2 implementations I’m aware of, even if the performance differences are moral, rather than practical, victories:

ppad-sha256’s SHA-256, vs SHA’s, on a 32B input:

  benchmarking ppad-sha256/SHA256 (32B input)/hash
  time                 1.898 μs   (1.858 μs .. 1.941 μs)
                       0.997 R²   (0.996 R² .. 0.999 R²)
  mean                 1.874 μs   (1.856 μs .. 1.902 μs)
  std dev              75.90 ns   (60.30 ns .. 101.8 ns)
  variance introduced by outliers: 55% (severely inflated)

  benchmarking SHA/SHA256 (32B input)/sha256
  time                 2.929 μs   (2.871 μs .. 2.995 μs)
                       0.997 R²   (0.995 R² .. 0.998 R²)
  mean                 2.879 μs   (2.833 μs .. 2.938 μs)
  std dev              170.4 ns   (130.4 ns .. 258.9 ns)
  variance introduced by outliers: 71% (severely inflated)

And the same, but now a HMAC-SHA256 battle:

  benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
  time                 7.287 μs   (7.128 μs .. 7.424 μs)
                       0.996 R²   (0.995 R² .. 0.998 R²)
  mean                 7.272 μs   (7.115 μs .. 7.455 μs)
  std dev              565.2 ns   (490.9 ns .. 689.7 ns)
  variance introduced by outliers: 80% (severely inflated)

  benchmarking SHA/HMAC-SHA256 (32B input)/hmacSha256
  time                 11.42 μs   (11.09 μs .. 11.80 μs)
                       0.994 R²   (0.992 R² .. 0.997 R²)
  mean                 11.36 μs   (11.09 μs .. 11.61 μs)
  std dev              903.5 ns   (766.5 ns .. 1.057 μs)
  variance introduced by outliers: 79% (severely inflated)

The performance differential is larger on larger inputs; I think the difference between the two on a contrived 1GB input was 22 vs 32s, all on my mid-2020 MacBook Air. I haven’t bothered to implement e.g. SHA-224 and SHA-384, which are trivial adjustments of SHA-256 and SHA-512, but if anyone could use them for some reason, please just let me know.

Anyway: enjoy, and let me know if you get any use out of these. Expect more releases in this spirit!