Runtime Analysis (ppad-fixed)

As noted in my last post, I recently released ppad-fixed, a high-performance fixed-width word library meant for use in cryptographic applications.

At present I use this library to drive elliptic curve arithmetic in ppad-secp256k1, as well as stuff that depends on that, e.g. ppad-bip32. It’s safe to say I have few users at the moment, but, as they say, “the way you do anything is the way you do everything,” so in this post I want to rigorously analyze some of its runtime and execution properties, to increase assurance both in general, and especially to myself, that it’s really fit for purpose.

My goal here is to illustrate constant-resource execution of this library on aarch64 when compiled with optimizations via GHC 9.8.1’s LLVM backend, first by presentation of the code in general, then some analysis of the STG IR, and then analysis of the produced assembly. I’ll also look at overall sanity checks, i.e. Criterion benchmarks and allocation measurement via ‘weigh’.

Data Representation

To start, ppad-fixed rolls its own four-limb word type (i.e. 256-bit, on a 64-bit system), consisting of a heap-allocated data constructor wrapping a strict, unboxed tuple:

data Wider = Wider !(# Limb, Limb, Limb, Limb #)

A ‘Limb’ here is just a newtype over an unboxed word, i.e. ‘Word#’, which is register-allocated. Such a ‘Wider’ allocates precisely 40 bytes on a 64-bit system: 8 bytes for each limb, and another 8 bytes for the ‘Wider’ data constructor itself. Library functions that return a ‘Wider’ value should allocate 40 bytes exactly, no matter what intermediate calculations they perform.

Operations on individual ‘Limbs’ just use available primitives from GHC.Exts, e.g. the following carrying addition function, and otherwise are written from scratch:

add_c#
  :: Limb             -- ^ augend
  -> Limb             -- ^ addend
  -> Limb             -- ^ carry
  -> (# Limb, Limb #) -- ^ (# sum, new carry #)
add_c# (Limb a) (Limb b) (Limb c) =
  let !(# c0, s0 #) = Exts.plusWord2# a b
      !(# c1,  s #) = Exts.plusWord2# s0 c
  in  (# Limb s, Limb (Exts.or# c0 c1) #)
{-# INLINE add_c# #-}

For ‘Wider’ and friends, the library provides two API’s; each function has an unboxed variant that works with unboxed tuples, avoiding all heap allocation whatsoever, and a boxed variant that allocates data constructors only at the end of the computation, simply to box up the result. Internal computations are always performed exclusively in terms of unboxed tuples.

For example, building on the above carrying addition function on single limbs, we have the following overflowing addition function on four-limb tuples:

add_o#
  :: (# Limb, Limb, Limb, Limb #)             -- ^ augend
  -> (# Limb, Limb, Limb, Limb #)             -- ^ addend
  -> (# (# Limb, Limb, Limb, Limb #), Limb #) -- ^ (# sum, carry bit #)
add_o# (# a0, a1, a2, a3 #) (# b0, b1, b2, b3 #) =
  let !(# s0, c0 #) = L.add_o# a0 b0
      !(# s1, c1 #) = L.add_c# a1 b1 c0
      !(# s2, c2 #) = L.add_c# a2 b2 c1
      !(# s3, c3 #) = L.add_c# a3 b3 c2
  in  (# (# s0, s1, s2, s3 #), c3 #)
{-# INLINE add_o# #-}

and then the boxed variant, which just uses the above function internally:

add_o
  :: Wider
  -> Wider
  -> (Wider, Word)
add_o (Wider a) (Wider b) =
  let !(# s, Limb c #) = add_o# a b
  in  (Wider s, W# c)

This is actually the same pattern that one gets when using GHC’s UNBOX pragma, and you can typically see this kind of separation between boxed and unboxed variants generated in compiled Core. For this work I wanted to write in terms of unboxed primitives specifically, just to be sure I knew exactly what was going on (I mentioned that one may want to take this approach in Fast Haskell Redux, as well).

The crux here in any case is that, unlike GHC’s native unbounded-size ‘Integer’ or ‘Natural’, ‘Wider’ values are always represented in the exact same manner, whether they have value 2 or 2 ^ 256 - 1, just like ‘Word’ or ‘Int’ are. The same is true for the ‘Montgomery’ types that are used to represent words in Montgomery form:

data Montgomery = Montgomery !(# Limb, Limb, Limb, Limb #)

(N.b., Montgomery form is a representation typically used to avoid expensive modular reductions when working with multiplication over a prime field. The primes used to define the elliptic curve secp256k1 apparently have some special structure that allow fast reductions anyway, eschewing the need for something like Montgomery form, but it works well with them regardless.)

ppad-fixed defines two of these types at present, one for each of the secp256k1 primes. The unboxed/boxed API split present for ‘Wider’ words also exists for the ‘Montgomery’ types.

In any case: the data representation ensures that there should be absolutely no input-dependent variability in allocation when producing values. Using the “unboxed API” internally means we should allocate solely when boxing results, and nowhere else.

Algorithm-Level Semantics

The algorithm-level invariant upheld in constant-time code is that it is “straight-line” code that avoids any branching or indexing on inputs. In ppad-fixed, following crypto-bigint, all functions are constant-time by default unless marked otherwise (and again, following crypto-bigint’s convention, I suffix all variable-time functions with ‘vartime’).

A good example of this kind of algorithmic invariant is seen in the implementation of modular inverse via Fermat inversion, in which there’s an unrolled loop of Montgomery squares and multiplies that proceeds according to a fixed schedule:

-- generated by etc/generate_inv.sh
inv#
  :: (# Limb, Limb, Limb, Limb #)
  -> (# Limb, Limb, Limb, Limb #)
inv# a =
  let -- montgomery 'one'
      !t0 = (# Limb 0x1000003D1##, Limb 0##, Limb 0##, Limb 0## #)
      !t1 = sqr# t0
      !t2 = mul# a t1
      !t3 = sqr# t2
      !t4 = mul# a t3
      --
      -- many intermediate steps omitted..
      --
      !t502 = mul# a t501
      !t503 = sqr# t502
      !t504 = sqr# t503
      !t505 = mul# a t504
      !r = t505
  in  r
{-# INLINE inv# #-}

As you can see from the comment, I generated this function via a bash script that loops through the binary expansion of the (constant, public, known) exponent used for Fermat inversion, printing ‘mul#’ and ‘sqr#’ bindings as appropriate. Since the schedule of multiplies and squares is fixed, then, by construction, the whole sequence runs in constant time, assuming that ‘sqr#’ and ‘mul#’ also do.

It’s also illustrative to look at a generic Montgomery exponentiation function:

exp#
  :: (# Limb, Limb, Limb, Limb #)
  -> (# Limb, Limb, Limb, Limb #)
  -> (# Limb, Limb, Limb, Limb #)
exp# b e =
  let !o = (# Limb 0x1000003D1##, Limb 0##, Limb 0##, Limb 0## #)
      loop !r !m !ex n = case n of
        0 -> r
        _ ->
          let !(# ne, bit #) = Wider.shr1_c# ex
              !candidate = mul# r m
              !nr = select# r candidate bit
              !nm = sqr# m
          in  loop nr nm ne (n - 1)
  in  loop o b e (256 :: Word)
{-# INLINE exp# #-}

This one has no fixed schedule, and so we don’t know at compile time how many ‘mul#’ and ‘sqr#’ calls we’ll need to perform (we introduce a ‘mul#’ call when, looping through the bits of the exponent, the least-significant bit is set). Here, then, instead of computing one or the other branch depending on the LSB, which would introduce argument-dependent variability to the function’s runtime, we compute both branches, and perform a constant-time selection using ‘select#’ to continue the computation with the appropriate value.

(N.b., I use a ‘Data.Choice’ module to implement constant-time primitives like ‘select#’, again mostly following the lead of crypto-bigint. It might be worth splitting this out into its own library at some point.)

So, you get the point: looking at this code at a high, algorithmic level, it’s written explicitly to run in constant time. There is no input-dependent branching or indexing to be found, except in functions that are clearly marked as running in variable time.

Evaluation-Level Semantics

So, the constant-time semantics check out at the algorithm level, but the code needs to pass through GHC, which is well-known to optimize aggressively. It’s thus worth checking out an intermediate representation to verify that GHC doesn’t plan on doing anything undesirable to the code during compilation.

The usual thing that one wants to examine when doing low-level optimisation is Core, which is basically stripped-down Haskell with some optimizations applied, but for execution semantics it’s more useful to analyse an STG dump (where STG, as I’m sure everyone knows, means “Spineless Tagless G-machine,” GHC’s IR that makes operational semantics explicit).

As far as we’re concerned, raw STG differs from Core primarily in that 1) it is fully normalized to A-normal form, 2) thunks are explicit (in that every let binding corresponds directly to a heap allocation), and 3) case expressions represent both sequencing, i.e. forcing of evaluation, and branching.

An STG dump can be procured by building with the ‘-ddump-stg-final’ flag. Take a look at the STG corresponding to the overflowing addition function on ‘Wider’ values shown above, for example. First, the unboxed variant:

add_o#
  :: (# Limb, Limb, Limb, Limb #)
     -> (# Limb, Limb, Limb, Limb #)
     -> (# (# Limb, Limb, Limb, Limb #), Limb #) =
    \r [us us us us us us us us]
        case plusWord2# [us us] of {
        (#,#) c s ->
        case plusWord2# [us us] of {
        (#,#) c0 s0 ->
        case plusWord2# [s0 c] of {
        (#,#) c1 s1 ->
        case plusWord2# [us us] of {
        (#,#) c2 s2 ->
        case or# [c0 c1] of sat {
        __DEFAULT ->
        case plusWord2# [s2 sat] of {
        (#,#) c3 s3 ->
        case plusWord2# [us us] of {
        (#,#) c4 s4 ->
        case or# [c2 c3] of sat {
        __DEFAULT ->
        case plusWord2# [s4 sat] of {
        (#,#) c5 s5 ->
        case or# [c4 c5] of sat {
        __DEFAULT -> (#,,,,#) [s s1 s3 s5 sat];
        [..]
        };

Notice it contains no let bindings, and thus doesn’t allocate (recall raw limbs are stored in registers), and also that each case expression has only a single branch, such that each represents straightforward evaluation sequencing. This is exactly what one wants to see.

For the boxed wrapper, in which the above unboxed function has already been inlined:

add_o :: Wider -> Wider -> (Wider, Word) =
    \r [ds ds1]
        case ds of {
        Wider us us us us ->
        case ds1 of {
        Wider us us us us ->
        case plusWord2# [us us] of {
        (#,#) c s ->
        case plusWord2# [us us] of {
        (#,#) c0 s0 ->
        case plusWord2# [s0 c] of {
        (#,#) c1 s1 ->
        case plusWord2# [us us] of {
        (#,#) c2 s2 ->
        case or# [c0 c1] of sat {
        __DEFAULT ->
        case plusWord2# [s2 sat] of {
        (#,#) c3 s3 ->
        case plusWord2# [us us] of {
        (#,#) c4 s4 ->
        case or# [c2 c3] of sat {
        __DEFAULT ->
        case plusWord2# [s4 sat] of {
        (#,#) c5 s5 ->
        case or# [c4 c5] of sat {
        __DEFAULT ->
        let { sat :: Word = W#! [sat]; } in
        let { sat :: Wider = Wider! [s s1 s3 s5]; } in  (,) [sat sat];
        [..]
        };

This function returns a boxed value, so it allocates, as can be seen by the let bindings at the end. Recall that this function returns a value of type (Wider, Word), so it needs to allocate precisely one ‘Wider’ data constructor, one ‘Word’ data constructor, and one tuple constructor. This is all constant with respect to the function’s inputs, and there is no additional allocation to be found anywhere.

It’s also worth taking a look at the STG for the loop in the exponentiation function shown previously:

$wloop
  :: (# Limb, Limb, Limb, Limb #)
     -> (# Limb, Limb, Limb, Limb #)
     -> (# Limb, Limb, Limb, Limb #)
     -> Word#
     -> (# Limb, Limb, Limb, Limb #) =
    \r [us us us us us us us us us us us us ww]
        case ww<TagProper> of wild2 {
          __DEFAULT ->
              case sqr# us us us us of nm {
              (#,,,#) _ _ _ _ ->
              case mul# us us us us us us us us of {
              (#,,,#) ipv4 ipv5 ipv6 ipv7 ->
              case uncheckedShiftL# [us 63#] of sat {
              __DEFAULT ->
              case uncheckedShiftRL# [sat 63#] of sat {
              __DEFAULT ->
              case not# [sat] of sat {
              __DEFAULT ->
              case plusWord# [sat 1##] of ipv8 {
              __DEFAULT ->
              case minusWord# [wild2 1##] of sat {
              __DEFAULT ->
              case xor# [us ipv7] of sat {
              __DEFAULT ->
              case and# [ipv8 sat] of sat {
              __DEFAULT ->
              case xor# [us sat] of sat {
              __DEFAULT ->
              [..]
              case or# [sat sat] of sat {
              __DEFAULT ->
              exp_$s$wloop1
                  sat sat sat sat ipv ipv1 ipv2 ipv3 sat sat sat sat sat;
              };
              [..]
              };
          0## -> (#,,,#) [us us us us];
        };

This contains no let bindings at all, but you can see two branches in the first case expression, This corresponds to whether or not to continue or stop the loop (it stops when the looping index hits 0), which is deterministic and constant with respect to inputs, so this branch is unproblematic. Note in the meantime that the constant-time select operation used in the loop produces no such branching, but instead is captured by the salad of ‘xor#’ and ‘and#’ expressions in the body of the function (many omitted above), as intended.

For an example of something that does not run in constant time, take a look at the STG for the aptly-named eq_vartime function, which compares for equality in variable time:

eq_vartime :: Montgomery -> Montgomery -> Bool =
    \r [ds ds1]
        case ds of {
        Montgomery us us us us ->
        case ds1 of {
        Montgomery us us us us ->
        case eqWord# [us us] of {
          __DEFAULT -> False [];
          1# ->
              case eqWord# [us us] of {
                __DEFAULT -> False [];
                1# ->
                    case eqWord# [us us] of {
                      __DEFAULT -> False [];
                      1# -> eq_vartime# us us;
                    };
              };
        };
        };
        };

Here you can see a number of branches, each of which depends on the inputs to the function. This is what one would not want to see in code that’s supposed to run in constant time.

Assembly Analysis

All well thus far, in that the data representation is fixed-size, the algorithms used have constant-time and constant-allocation semantics, and that GHC’s evaluation model demonstrably doesn’t want to introduce any unexpected variability to them. But there’s still the matter of the assembly produced – one wants to be sure that the machine instructions that the code ultimately compiles down to are free of any unexpected input-dependent variation.

By my observation, GHC’s LLVM backend produces dramatically faster assembly than does the native code generator, so I’ve adopted it for all of my ppad libraries. To inspect the assembly, we can dump the LLVM IR, and then run it through ‘llc’ with optimizations, examining the result.

Assuming we’re compiling with the LLVM backend, an LLVM IR dump can be gotten via the ‘-ddump-llvm’ flag, and, after removing some headers GHC inserts, we can run it through llc -O3 -mcpu=native -filetype=asm to produce assembly.

Let’s first take a look at the (aarch64) assembly for the overflowing addition function. I’ll paste the entire thing, for completeness:

; %bb.0:                                ; %naoh
  stp x25, x26, [sp, #40]
  stp x23, x24, [sp, #24]
  stp x20, x22, [sp, #8]
  ldp x8, x9, [x20]
  stp x9, x8, [sp, #176]
  ldr x10, [x20, #16]
  str x10, [sp, #168]
  adds  x8, x24, x8
  cset  w10, hs
  adds  x11, x23, x27
  cset  w12, hs
  stp x11, x12, [sp, #152]
  stp x8, x10, [sp, #136]
  adcs  x8, x8, xzr
  cset  w11, hs
  stp x8, x11, [sp, #120]
  adds  x8, x25, x9
  cset  w9, hs
  stp x8, x9, [sp, #104]
  orr x10, x10, x11
  adds  x24, x8, x10
  cset  w8, hs
  stp x24, x8, [sp, #88]
  ldr x10, [sp, #48]
  ldr x11, [sp, #168]
  adds  x10, x10, x11
  cset  w11, hs
  stp x10, x11, [sp, #72]
  orr x8, x9, x8
  adds  x25, x10, x8
  cset  w8, hs
  stp x25, x8, [sp, #56]
  orr x26, x11, x8
  stp x25, x26, [sp, #40]
  ldr x23, [sp, #120]
  stp x23, x24, [sp, #24]
  ldr x22, [sp, #152]
  ldr x8, [sp, #8]
  add x20, x8, #24
  stp x20, x22, [sp, #8]
  ldr x8, [x8, #24]
  blr x8
  ret

The ‘str’ and ‘ldr’ instructions mean store register and load register, and ‘stp’ and ‘ldp’ mean store pair of registers and load pair of registers, respectively, and they all indicate stuff being shuffled around between registers and the stack. These all run in constant time; note that Jon’s Arm Reference contains the following blurb for each:

This instruction is a data-independent-time instruction

The branch with link to register (blr) and return from subroutine (ret) instructions at the end of the computation are “whole program” control-flow instructions that branch unconditionally, and not on secret data. BLR’s target register contains a GHC continuation pointer (i.e., “what we’re doing next”) and RET indicates that this is a subroutine return.

So, that’s all administrative stuff. The actual overflowing addition algorithm is better-captured by the following core sequence of instructions:

  adds  x8, x24, x8
  cset  w10, hs
  adds  x11, x23, x27
  cset  w12, hs
  adcs  x8, x8, xzr
  cset  w11, hs
  adds  x8, x25, x9
  cset  w9, hs
  orr x10, x10, x11
  adds  x24, x8, x10
  cset  w8, hs
  adds  x10, x10, x11
  cset  w11, hs
  orr x8, x9, x8
  adds  x25, x10, x8
  cset  w8, hs
  orr x26, x11, x8
  add x20, x8, #24

The add (add), add, setting flags (adds), add with carry, setting flags (adcs), and bitwise or (orr) instructions are all “data-independent-time,” and “cset” is actually an alias for conditional select increment (csinc), which is a constant-time branchless select. This is exactly what we want to see: that every instruction in this program runs in time constant with respect to its inputs.

The assembly for the variable-time equality function makes for an excellent comparison. Here’s the meat of it, with the “administrative” instructions omitted:

; %bb.0:
  cmp x8, x9
  b.ne  LBB31_4
; %bb.1:
  cmp x8, x9
  b.ne  LBB31_4
; %bb.2:
  cmp x8, x9
  b.ne  LBB31_4
; %bb.3:
  add x20, x8, #40

The core variable-time instruction here is branch conditionally (b.cond), where the condition of interest is “not equal to” (thus ‘b.ne’). This returns the subroutine early if a limb inequality check (performed by ‘cmp’) succeeds. If the subroutine can be timed reliably, especially with chosen inputs, then the timing differential will expose which limb differs first.

Benchmarks and Allocation Measurement

With the code, IR, and assembly all appearing to be in order, the benchmark suites serve mostly as sanity checks. They’re immensely useful, though. I’m reminded a bit of that quote attributed to Knuth, which I’ll paraphrase as “beware of bugs – I’ve only proved the program correct, not tested it.”

ppad-fixed has a Criterion suite that benchmarks functions of interest on varying-sized inputs. We expect variation in wall-clock time to be readily attributable to noise; here’s an example excerpt of the output, for a Montgomery modular exponentiation function:

benchmarking exp/curve:  M(2) ^ 2
time                 5.217 μs   (5.206 μs .. 5.224 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.208 μs   (5.198 μs .. 5.215 μs)
std dev              27.24 ns   (20.87 ns .. 37.18 ns)

benchmarking exp/curve:  M(2 ^ 255 - 19) ^ (2 ^ 255 - 19)
time                 5.214 μs   (5.207 μs .. 5.220 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.211 μs   (5.201 μs .. 5.217 μs)
std dev              27.82 ns   (16.79 ns .. 51.12 ns)

With a three-nanosecond average differential, the function runs in the same time (up to noise) when called with either small and large inputs, as expected.

A ‘weigh’ suite similarly ensures that there are no surprises when it comes to allocation. Allocations match exactly what is expected across the board, e.g.:

exp

  Case                            Allocated  GCs
  curve:  M(2) ^ 2                       40    0
  curve:  M(2) ^ (2 ^ 255 - 19)          40    0
  scalar:  M(2) ^ 2                      40    0
  scalar:  M(2) ^ (2 ^ 255 - 19)         40    0

You’ll recall that 40 bytes is what a four-limb strict, unboxed tuple behind a single constructor (‘Montgomery’, in this case) should allocate on a 64-bit system.

For functions that return something more complicated, we can always account for the additional allocation precisely. For the variable-time modular square root function on one of the Montgomery domains, for example:

sqrt_vartime

  Case                                  Allocated  GCs
  curve:  sqrt_vartime M(2)                    56    0
  curve:  sqrt_vartime M(2 ^ 255 - 19)         56    0

Here the function returns a ‘Maybe Montgomery’ value, so the additional 16 bytes are for the ‘Just’ constructor and the pointer to its payload.

Conclusion

So: I’ve analysed the entirety of this library in the above manner, and by my appraisal, all functions in this library will run in constant time on aarch64 when compiled with optimizations with GHC 9.8.1, unless marked otherwise. I haven’t performed the same analysis for other compiler versions or flags, or for other target architectures, but the above gives me confidence that the results are also likely to hold true for those too. The Criterion suite should be particularly useful as a sanity check for those cases.

I’ve also performed the same analysis for ppad-secp256k1, and intend to do it for all my other libraries where constant-time execution is expected. Feel free to reproduce my analysis yourself, if curious – and let me know if you do!

Much Faster Signatures on secp256k1

I’ve just released a new version of ppad-secp256k1 that supports dramatically faster Schnorr and ECDSA signature schemes, powered by faster elliptic curve arithmetic under the hood.

This follows on my previous posts in what are starting to constitute a sort of secp256k1 series. In the last post I had used the wNAF method for elliptic curve scalar multiplication to glean a large performance boost across the board. This time there have been two significant changes that have resulted in performance wins: first, the use of fixed-width, register-allocated words exclusively, in place of GHC’s heap-allocated ‘Integer’, and second, the use of Montgomery-form arithmetic to avoid performing expensive division operations when working over the secp256k1 fields.

Both of these features are provided by a new library I’ve released, ppad-fixed, which provides such stuff, as well as constant-time selection and arithmetic. ppad-fixed is essentially a Haskell translation of parts of Rust’s crypto-bigint library, except specialized to ‘wide’ (two-limb) and ‘wider’ (four-limb) words, as, unlike in Rust or Golang, we don’t have stack-allocated arrays in Haskell, and so everything is implemented in terms of fixed-size unboxed tuples. I’ve carefully verified that the assembly for many of the arithmetic operations is effectively optimal, at least for darwin-aarch64, when produced with GHC’s LLVM backend, so expect speediness – typically a few nanoseconds for addition and subtraction, a few more for multiplication, and a few microseconds for modular exponentiation, inversion, and square root.

I’d originally mentioned my desire for better fixed-width support in my first post on secp256k1:

Arbitrary-precision Integers are still slow, compared to fixed-width stuff that modern computers can positively chew through. I achieved a staggering speedup on some basic integer parsing by using a custom Word256 type (built from a bunch of Word64 values) under the hood, and converting to Integer only at the end.

They can also be annoying when one wants to achieve constant-time execution of cryptographically-sensitive functions, for obvious reasons. It would be nice to have good fixed-width support for stuff like Word128, Word256, Word512, and so on – I briefly considered implementing and using a custom Word256 type for everything, but this would be a ton of work, and I’m not sure I could beat GHC’s native bigint support for e.g. modular multiplication, exponentiation, and inversion anyway. We’ll stick with plain-old Integer for the time being – it’s still no slouch.

Well, at this point we beat Integer very handily, and provide solid guarantees of constant-time and constant-allocation execution to boot. We’re now getting into the realm of libsecp256k1-tier performance:

  benchmarking schnorr/sign_schnorr' (large)
  time                 43.34 μs   (43.21 μs .. 43.47 μs)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 43.21 μs   (43.13 μs .. 43.30 μs)
  std dev              280.8 ns   (221.0 ns .. 357.9 ns)

  benchmarking schnorr/verify_schnorr'
  time                 94.57 μs   (94.27 μs .. 94.90 μs)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 94.13 μs   (93.92 μs .. 94.38 μs)
  std dev              777.4 ns   (644.8 ns .. 925.8 ns)

  benchmarking ecdsa/sign_ecdsa' (large)
  time                 35.21 μs   (35.11 μs .. 35.32 μs)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 35.14 μs   (35.09 μs .. 35.20 μs)
  std dev              184.4 ns   (135.9 ns .. 241.5 ns)

  benchmarking ecdsa/verify_ecdsa'
  time                 88.70 μs   (88.44 μs .. 88.93 μs)
                       1.000 R²   (1.000 R² .. 1.000 R²)
  mean                 88.93 μs   (88.75 μs .. 89.13 μs)
  std dev              635.0 ns   (517.3 ns .. 963.9 ns)

And yeah, I also implemented Montgomery arithmetic for two of the secp256k1 domains of interest, i.e. those with moduli defined by the secp256k1 field prime and scalar group order, respectively. This allowed me to cut out a bunch of expensive modP/modQ operations everywhere in the Schnorr and ECDSA implementations.

Haskell’s much-maligned ‘Num’ typeclass actually makes these Montgomery form numbers a joy to work with. To crib an illustrative GHCi session from the ppad-fixed README:

  > import qualified Numeric.Montgomery.Secp256k1.Curve as C
  >
  > let one = 1 :: C.Montgomery
  > one
  1
  > putStrLn (C.render one)
  (4294968273, 0, 0, 0)
  >
  > let two = one + one
  > putStrLn (C.render two)
  (8589936546, 0, 0, 0)
  >
  > let big = 2 ^ (128 :: Int) :: C.Montgomery
  > big
  340282366920938463463374607431768211456
  > putStrLn (C.render big)
  (0, 0, 4294968273, 0)
  >
  > let inv = C.inv big
  > inv
  85349562743316995932995116683053049354367560536510302240860302699983992117553
  > putStrLn (C.render inv)
  (0, 0, 1, 0)
  >
  > inv * big
  1

(N.b., I’ve been writing Haskell for something like fourteen years, but to this day I still become amazed by what you can do with Num. It’s great to be able to write in terms of plain integers, but get fixed-width Montgomery stuff automatically handled under the hood, similar to the application I described for automatic differentiation some years ago.)

There’s still room to squeeze out further performance gains here. Some stuff that comes to mind: using Barrett reduction for my modQ operation, in place of the cute Montgomery round-trip I presently do to calculate it; employing the GLV endomorphism that secp256k1 admits in wNAF scalar multiplication; probably going for Bernstein-Yang inversion instead of the fixed-schedule Fermat inversion that I presently use. I’ll get to all of them eventually, as the need or interest arises.

Check it out for all of your Haskell secp256k1 needs, and merely let me know if there are any additional features you could use!

Reproducible Build Artifacts

Taproot Assets endeavours to provide completely reproducible and verifiable build artifacts for all releases. These include binaries for a number of target architectures/platforms, along with source code and vendored dependencies, all packaged up in gzipped tarballs, or in zip(1) files. Each release comes with a manifest of these artifacts along with accompanying SHA256 digests.

To attest the artifacts for a release, trusted contributors verify that they can produce the same manifest, and then they cryptographically sign it:

$ gpg --verify manifest-jtobin-v0.7.0.sig manifest-v0.7.0.txt
gpg: Signature made Thu 20 Nov 14:07:09 2025 +04
gpg:                using EDDSA key 7363F82F43B1324E1B0761310E4647D58F8A69E4
gpg: Good signature from "Jared Tobin (https://ppad.tech) <jared@ppad.tech>" [ultimate]
gpg:                 aka "Jared Tobin (jtobin) <jared@jtobin.io>" [ultimate]

If one is operating at an appropriate paranoia level, he can thus personally verify that, first, a quorum of trusted contributors has attested to a certain set of artifacts, and, second, that he can produce, from source and his own tool set, those very same artifacts, before running the software.

Interestingly, the binaries themselves are the least tricky part to verify, at least in principle. Modern Go compilers are able to produce completely reproducible binaries for any target, independent of the build host architecture. That is: given appropriate environment variables and linker flags (namely CGO_ENABLED="0" and -trimpath, probably -s and -w too), a modern Go compiler will produce, bit-for-bit, the same output, given consistent source, other flags, and so on.

You have to do a bit more legwork for the tarballs, etc. Putting together reproducible tarballs involves a salad of flags, e.g.:

  export TZ=UTC

  find "${dir}" -print0 | LC_ALL=C sort -r -z | tar \
    "--mtime=${BUILD_DATE}" --no-recursion --null --mode=u+rw,go+r-w,a+X \
    --owner=0 --group=0 --numeric-owner -c -T - | gzip -9n > "${dir}.tar.gz"

You have to be careful to lock down a bunch of stuff, like timestamps and user/group permissions. What’s more, you have to ensure you’re using the right variant of tar (BSD vs GNU), and ditto for at least gzip and zip.

The problem of producing a hermetic build environment is solved by Nix, of course (or Docker, more on which another time). But some interesting problems can actually pop up when using Nix to build the kinds of reproducible Go binaries I mentioned above.

Let’s start with the tarballs and zip files. Timestamps and permissions are normalized within the Nix store, so, easy: we merely need to put the following in the right place in a Nix flake:

  { nativeBuildInputs = with pkgs; [ coreutils gnutar gzip zip ]; }

Add a minimal set of reproducibility flags to the tar call, and then anyone building the output with the same Nix flake metadata should be able to produce consistent gzipped tarballs containing consistent compiled binaries.

Or so it would seem. When I first tried this, I was able to produce reproducible tarballs for both the project source and its vendored dependencies. The source is declared as follows:

  package = "taproot-assets";
  tag = "v0.7.0";

  # source info
  org = "lightninglabs";
  repo = package;
  tap_src = pkgs.fetchFromGitHub {
    owner = org;
    repo = repo;
    rev = tag;
    hash = "sha256-z4lNeVmy0AEDM8ivdfXfXqi/V1LIDHV2KtS5/19+Jlk=";
  };

and given a ‘tap’ derivation built by Nixpkgs’s buildGoModule, the following, in the buildPhase, can later get us the tarballs of interest:

  tar_gz $out/vendor.tar.gz ${tap.goModules} .
  tar_gz $out/taproot-assets-source-${tag}.tar.gz ${tap_src} .

But strangely, to me, I couldn’t generate reproducible binaries. That is: even though I pinned to a specific Go toolchain, I would produce binaries for different targets that differed depending on the architecture of the host. This was strange, since I know that the Go compiler shouldn’t have any issue doing this, and I was clearly using the right settings:

  # tapd builder
  build_tap = {
        goos ? null
      , goarch ? null
      , goarm ? null
      , tags ? tags_base
      , ldflags ? ldflags_base
      , gcflags ? [ ]
      }: pkgs.buildGoModule {
    pname = package;
    version = tag;
    src = tap_src;
    vendorHash = "sha256-p75eZoM6tSayrxcKTCoXR8Jlc3y2UyzfPCtRJmDy9ew=";
    subPackages = [ "cmd/tapd" "cmd/tapcli" ];

    inherit tags ldflags gcflags;

    preBuild = with pkgs; ''
      export CGO_ENABLED="0"
      export GOAMD64="v1"
      ${lib.optionalString (goos != null) "export GOOS=${goos}"}
      ${lib.optionalString (goarch != null) "export GOARCH=${goarch}"}
      ${lib.optionalString (goarm != null) "export GOARM=${goarm}"}
    '';

    doCheck = false;
  };

(‘-trimpath’, another requirement for reproducible builds, is added by default when using Nixpkgs’s buildGoModule, as is ‘-buildid=’, ensuring no rogue metadata creeps in.)

I dug into this by inspecting a couple of the produced binaries for a random architecture, one built on an amd64 Linux machine and the other on aarch64 Darwin. I produced a couple of dumps of ELF section headers via readelf -S $MY_BINARY and diffed them:

$ diff sections-linux.txt sections-darwin.txt
1c1
< There are 14 section headers, starting at offset 0xd4:
---
> There are 14 section headers, starting at offset 0x386ed30:
18c18
<   [13] .shstrtab         STRTAB          00000000 3870000 000094 00      0   0  1
---
>   [13] .shstrtab         STRTAB          00000000 386ecb0 00007d 00      0   0  1

The .shstrtab (“section name string table”) sections, which are pure ELF metadata, differed, so I produced hexdumps of those, this time via readelf -x .shstrtab $MY_BINARY, and diffed them:

$ diff shstrtab-linux.txt shstrtab-darwin.txt
3,12c3,10
<   0x00000000 002e7465 7874002e 6e6f7074 72646174 ..text..noptrdat
<   0x00000010 61002e64 61746100 2e627373 002e6e6f a..data..bss..no
<   0x00000020 70747262 7373002e 676f2e66 757a7a63 ptrbss..go.fuzzc
<   0x00000030 6e747273 002e676f 2e627569 6c64696e ntrs..go.buildin
<   0x00000040 666f002e 676f2e66 69707369 6e666f00 fo..go.fipsinfo.
<   0x00000050 2e656c66 64617461 002e726f 64617461 .elfdata..rodata
<   0x00000060 002e7479 70656c69 6e6b002e 69746162 ..typelink..itab
<   0x00000070 6c696e6b 002e676f 73796d74 6162002e link..gosymtab..
<   0x00000080 676f7063 6c6e7461 62002e73 68737472 gopclntab..shstr
<   0x00000090 74616200                            tab.
---
>   0x00000000 002e7465 7874002e 6e6f7074 72627373 ..text..noptrbss
>   0x00000010 002e6273 73002e67 6f2e6669 7073696e ..bss..go.fipsin
>   0x00000020 666f002e 676f2e62 75696c64 696e666f fo..go.buildinfo
>   0x00000030 002e7479 70656c69 6e6b002e 69746162 ..typelink..itab
>   0x00000040 6c696e6b 002e7368 73747274 6162002e link..shstrtab..
>   0x00000050 676f7063 6c6e7461 62002e67 6f73796d gopclntab..gosym
>   0x00000060 74616200 2e6e6f70 74726461 7461002e tab..noptrdata..
>   0x00000070 726f6461 7461002e 64617461 00       rodata..data.

Not much to go on there, but I eventually traced this to the fact that mkDerivation in Nixpkgs’s stdenv runs a “fixupPhase” by default, which performs its own stripping of ELF metadata, separate from what the Go compiler does when passed the ‘-s’ and ‘-w’ linker flags (for stripping symbol and debug information, plus DWARF debugging info, respectively). The stripping done by the fixupPhase seems to vary depending on the package set used, e.g. between aarch64-darwin and x86_64-linux.

I disabled the fixupPhase’s stripping by adding

  dontStrip = true;

to my builder function, and tried again. This time the resulting artifacts’ file sizes normalized, but they still produced different SHA256 digests. I produced another hexdump of each, this time using good old xxd(1), and diffed them again:

2028124,2028126c2028124,2028126
< 01ef25b0: 7265 2f32 7334 6870 7137 3368 6e34 396a  re/2s4hpq73hn49j
< 01ef25c0: 6438 3478 3937 366d 3361 6370 3372 6433  d84x976m3acp3rd3
< 01ef25d0: 6b31 782d 747a 6461 7461 2d32 3032 3562  k1x-tzdata-2025b
---
> 01ef25b0: 7265 2f78 686e 6231 6434 7767 6a33 3878  re/xhnb1d4wgj38x
> 01ef25c0: 6833 3833 7973 3162 7969 676e 3079 6861  h383ys1byign0yha
> 01ef25d0: 6371 682d 747a 6461 7461 2d32 3032 3562  cqh-tzdata-2025b
2028308,2028310c2028308,2028310
< 01ef3130: 2069 740a 2f6e 6978 2f73 746f 7265 2f34   it./nix/store/4
< 01ef3140: 6b77 3938 7138 3470 6276 326e 7771 3361  kw98q84pbv2nwq3a
< 01ef3150: 7931 3261 6176 6661 397a 6934 6439 7a2d  y12aavfa9zi4d9z-
---
> 01ef3130: 2069 740a 2f6e 6978 2f73 746f 7265 2f7a   it./nix/store/z
> 01ef3140: 6a62 3467 7763 7971 7670 3862 3236 716b  jb4gwcyqvp8b26qk
> 01ef3150: 7832 7163 3033 3776 6a6a 3631 7838 302d  x2qc037vjj61x80-
2028473,2028475c2028473,2028475
< 01ef3b80: 7374 6f72 652f 7279 6d64 7764 386b 3461  store/rymdwd8k4a
< 01ef3b90: 6268 3430 6233 6d77 3470 6636 3967 3739  bh40b3mw4pf69g79
< 01ef3ba0: 336d 6471 6269 2d69 616e 612d 6574 632d  3mdqbi-iana-etc-
---
> 01ef3b80: 7374 6f72 652f 7038 6369 3239 7637 6b37  store/p8ci29v7k7
> 01ef3b90: 6777 7773 786a 3138 3170 3035 6a63 6362  gwwsxj181p05jccb
> 01ef3ba0: 6e71 6236 3439 2d69 616e 612d 6574 632d  nqb649-iana-etc-
2028626,2028628c2028626,2028628
< 01ef4510: 6f72 652f 7279 6d64 7764 386b 3461 6268  ore/rymdwd8k4abh
< 01ef4520: 3430 6233 6d77 3470 6636 3967 3739 336d  40b3mw4pf69g793m
< 01ef4530: 6471 6269 2d69 616e 612d 6574 632d 3230  dqbi-iana-etc-20
---
> 01ef4510: 6f72 652f 7038 6369 3239 7637 6b37 6777  ore/p8ci29v7k7gw
> 01ef4520: 7773 786a 3138 3170 3035 6a63 6362 6e71  wsxj181p05jccbnq
> 01ef4530: 6236 3439 2d69 616e 612d 6574 632d 3230  b649-iana-etc-20

The difference here was immediately discernable: different binaries contained different hard-coded paths to the Nix store. This was at first immensely confusing to me, but it turns out that the default Go toolchains in Nixpkgs are not “official” Go toolchains, but patched ones. You can see this in the Go derivation for e.g. the nixos-25.05 package set, which contains the following ‘patches’ attribute:

  patches = [
    (replaceVars ./iana-etc-1.25.patch {
      iana = iana-etc;
    })
    # Patch the mimetype database location which is missing on NixOS.
    # but also allow static binaries built with NixOS to run outside nix
    (replaceVars ./mailcap-1.17.patch {
      inherit mailcap;
    })
    # prepend the nix path to the zoneinfo files but also leave the
    # original value for static binaries that run outside a nix server
    (replaceVars ./tzdata-1.19.patch {
      inherit tzdata;
    })
    ./remove-tools-1.11.patch
    ./go_no_vendor_checks-1.23.patch
  ];

The idea here is that the Go compiler bakes into the binaries it produces a number of fallback runtime paths for stuff like ‘tzdata’, e.g. ‘/usr/share/zoneinfo’. On NixOS this, and other paths, do not exist, so the ‘tzdata-1.19.patch’, for example, hard codes a path to the tzdata package in the Nix store that depends on the package set used. Since the package sets differ between platforms, these paths will also differ, and so will any binaries produced by the Go compilers they contain.

The solution to this is easy in Nix, provided you know you actually have to do it. Just build with an unpatched Go toolchain:

  # nixpkgs applies various patches to go toolchains; we need to remove
  # them to produce host-architecture-independent binaries.
  go-unpatched = pkgs.go.overrideAttrs (_: {
    patches = [ ];
  });

  buildGoModule = pkgs.buildGoModule.override {
    go = go-unpatched;
  };

  # tapd builder
  build_tap = {
        goos ? null
      , goarch ? null
      , ..
      , gcflags ? [ ]
      }: buildGoModule {
    ..
  };

This works as expected. The produced manifests look like this:

$ head -3 manifest-v0.7.0-darwin.txt | cut -f1 -d' '
ebebc38f71c7a902341bf4247d8ee2752d0d233deec2034953dab4b44ecb622a
737eb475c5435beedcdb49ae7adf0716c9980523e89682fea4622b59c5d132df
8570b3c659fc2df9cff365fc06bd95459d903b6a3e7c8fd2e5e4efc1a77f7adf

and diffing the manifests produced on both macOS and Linux hosts illustrates that the artifacts match, bit-for-bit, precisely:

$ diff manifest-v0.7.0-darwin.txt manifest-v0.7.0-linux.txt
<empty>

Here’s a gist to a flake for producing various tapd builds, as well as release artifacts, if you’re curious.