Runtime Analysis (ppad-fixed)

As noted in my last post, I recently released ppad-fixed, a high-performance fixed-width word library meant for use in cryptographic applications.

At present I use this library to drive elliptic curve arithmetic in ppad-secp256k1, as well as stuff that depends on that, e.g. ppad-bip32. It’s safe to say I have few users at the moment, but, as they say, “the way you do anything is the way you do everything,” so in this post I want to rigorously analyze some of its runtime and execution properties, to increase assurance both in general, and especially to myself, that it’s really fit for purpose.

My goal here is to illustrate constant-resource execution of this library on aarch64 when compiled with optimizations via GHC 9.8.1’s LLVM backend, first by presentation of the code in general, then some analysis of the STG IR, and then analysis of the produced assembly. I’ll also look at overall sanity checks, i.e. Criterion benchmarks and allocation measurement via ‘weigh’.

Data Representation

To start, ppad-fixed rolls its own four-limb word type (i.e. 256-bit, on a 64-bit system), consisting of a heap-allocated data constructor wrapping a strict, unboxed tuple:

data Wider = Wider !(# Limb, Limb, Limb, Limb #)

A ‘Limb’ here is just a newtype over an unboxed word, i.e. ‘Word#’, which is register-allocated. Such a ‘Wider’ allocates precisely 40 bytes on a 64-bit system: 8 bytes for each limb, and another 8 bytes for the ‘Wider’ data constructor itself. Library functions that return a ‘Wider’ value should allocate 40 bytes exactly, no matter what intermediate calculations they perform.

Operations on individual ‘Limbs’ just use available primitives from GHC.Exts, e.g. the following carrying addition function, and otherwise are written from scratch:

add_c#
  :: Limb             -- ^ augend
  -> Limb             -- ^ addend
  -> Limb             -- ^ carry
  -> (# Limb, Limb #) -- ^ (# sum, new carry #)
add_c# (Limb a) (Limb b) (Limb c) =
  let !(# c0, s0 #) = Exts.plusWord2# a b
      !(# c1,  s #) = Exts.plusWord2# s0 c
  in  (# Limb s, Limb (Exts.or# c0 c1) #)
{-# INLINE add_c# #-}

For ‘Wider’ and friends, the library provides two API’s; each function has an unboxed variant that works with unboxed tuples, avoiding all heap allocation whatsoever, and a boxed variant that allocates data constructors only at the end of the computation, simply to box up the result. Internal computations are always performed exclusively in terms of unboxed tuples.

For example, building on the above carrying addition function on single limbs, we have the following overflowing addition function on four-limb tuples:

add_o#
  :: (# Limb, Limb, Limb, Limb #)             -- ^ augend
  -> (# Limb, Limb, Limb, Limb #)             -- ^ addend
  -> (# (# Limb, Limb, Limb, Limb #), Limb #) -- ^ (# sum, carry bit #)
add_o# (# a0, a1, a2, a3 #) (# b0, b1, b2, b3 #) =
  let !(# s0, c0 #) = L.add_o# a0 b0
      !(# s1, c1 #) = L.add_c# a1 b1 c0
      !(# s2, c2 #) = L.add_c# a2 b2 c1
      !(# s3, c3 #) = L.add_c# a3 b3 c2
  in  (# (# s0, s1, s2, s3 #), c3 #)
{-# INLINE add_o# #-}

and then the boxed variant, which just uses the above function internally:

add_o
  :: Wider
  -> Wider
  -> (Wider, Word)
add_o (Wider a) (Wider b) =
  let !(# s, Limb c #) = add_o# a b
  in  (Wider s, W# c)

This is actually the same pattern that one gets when using GHC’s UNBOX pragma, and you can typically see this kind of separation between boxed and unboxed variants generated in compiled Core. For this work I wanted to write in terms of unboxed primitives specifically, just to be sure I knew exactly what was going on (I mentioned that one may want to take this approach in Fast Haskell Redux, as well).

The crux here in any case is that, unlike GHC’s native unbounded-size ‘Integer’ or ‘Natural’, ‘Wider’ values are always represented in the exact same manner, whether they have value 2 or 2 ^ 256 - 1, just like ‘Word’ or ‘Int’ are. The same is true for the ‘Montgomery’ types that are used to represent words in Montgomery form:

data Montgomery = Montgomery !(# Limb, Limb, Limb, Limb #)

(N.b., Montgomery form is a representation typically used to avoid expensive modular reductions when working with multiplication over a prime field. The primes used to define the elliptic curve secp256k1 apparently have some special structure that allow fast reductions anyway, eschewing the need for something like Montgomery form, but it works well with them regardless.)

ppad-fixed defines two of these types at present, one for each of the secp256k1 primes. The unboxed/boxed API split present for ‘Wider’ words also exists for the ‘Montgomery’ types.

In any case: the data representation ensures that there should be absolutely no input-dependent variability in allocation when producing values. Using the “unboxed API” internally means we should allocate solely when boxing results, and nowhere else.

Algorithm-Level Semantics

The algorithm-level invariant upheld in constant-time code is that it is “straight-line” code that avoids any branching or indexing on inputs. In ppad-fixed, following crypto-bigint, all functions are constant-time by default unless marked otherwise (and again, following crypto-bigint’s convention, I suffix all variable-time functions with ‘vartime’).

A good example of this kind of algorithmic invariant is seen in the implementation of modular inverse via Fermat inversion, in which there’s an unrolled loop of Montgomery squares and multiplies that proceeds according to a fixed schedule:

-- generated by etc/generate_inv.sh
inv#
  :: (# Limb, Limb, Limb, Limb #)
  -> (# Limb, Limb, Limb, Limb #)
inv# a =
  let -- montgomery 'one'
      !t0 = (# Limb 0x1000003D1##, Limb 0##, Limb 0##, Limb 0## #)
      !t1 = sqr# t0
      !t2 = mul# a t1
      !t3 = sqr# t2
      !t4 = mul# a t3
      --
      -- many intermediate steps omitted..
      --
      !t502 = mul# a t501
      !t503 = sqr# t502
      !t504 = sqr# t503
      !t505 = mul# a t504
      !r = t505
  in  r
{-# INLINE inv# #-}

As you can see from the comment, I generated this function via a bash script that loops through the binary expansion of the (constant, public, known) exponent used for Fermat inversion, printing ‘mul#’ and ‘sqr#’ bindings as appropriate. Since the schedule of multiplies and squares is fixed, then, by construction, the whole sequence runs in constant time, assuming that ‘sqr#’ and ‘mul#’ also do.

It’s also illustrative to look at a generic Montgomery exponentiation function:

exp#
  :: (# Limb, Limb, Limb, Limb #)
  -> (# Limb, Limb, Limb, Limb #)
  -> (# Limb, Limb, Limb, Limb #)
exp# b e =
  let !o = (# Limb 0x1000003D1##, Limb 0##, Limb 0##, Limb 0## #)
      loop !r !m !ex n = case n of
        0 -> r
        _ ->
          let !(# ne, bit #) = Wider.shr1_c# ex
              !candidate = mul# r m
              !nr = select# r candidate bit
              !nm = sqr# m
          in  loop nr nm ne (n - 1)
  in  loop o b e (256 :: Word)
{-# INLINE exp# #-}

This one has no fixed schedule, and so we don’t know at compile time how many ‘mul#’ and ‘sqr#’ calls we’ll need to perform (we introduce a ‘mul#’ call when, looping through the bits of the exponent, the least-significant bit is set). Here, then, instead of computing one or the other branch depending on the LSB, which would introduce argument-dependent variability to the function’s runtime, we compute both branches, and perform a constant-time selection using ‘select#’ to continue the computation with the appropriate value.

(N.b., I use a ‘Data.Choice’ module to implement constant-time primitives like ‘select#’, again mostly following the lead of crypto-bigint. It might be worth splitting this out into its own library at some point.)

So, you get the point: looking at this code at a high, algorithmic level, it’s written explicitly to run in constant time. There is no input-dependent branching or indexing to be found, except in functions that are clearly marked as running in variable time.

Evaluation-Level Semantics

So, the constant-time semantics check out at the algorithm level, but the code needs to pass through GHC, which is well-known to optimize aggressively. It’s thus worth checking out an intermediate representation to verify that GHC doesn’t plan on doing anything undesirable to the code during compilation.

The usual thing that one wants to examine when doing low-level optimisation is Core, which is basically stripped-down Haskell with some optimizations applied, but for execution semantics it’s more useful to analyse an STG dump (where STG, as I’m sure everyone knows, means “Spineless Tagless G-machine,” GHC’s IR that makes operational semantics explicit).

As far as we’re concerned, raw STG differs from Core primarily in that 1) it is fully normalized to A-normal form, 2) thunks are explicit (in that every let binding corresponds directly to a heap allocation), and 3) case expressions represent both sequencing, i.e. forcing of evaluation, and branching.

An STG dump can be procured by building with the ‘-ddump-stg-final’ flag. Take a look at the STG corresponding to the overflowing addition function on ‘Wider’ values shown above, for example. First, the unboxed variant:

add_o#
  :: (# Limb, Limb, Limb, Limb #)
     -> (# Limb, Limb, Limb, Limb #)
     -> (# (# Limb, Limb, Limb, Limb #), Limb #) =
    \r [us us us us us us us us]
        case plusWord2# [us us] of {
        (#,#) c s ->
        case plusWord2# [us us] of {
        (#,#) c0 s0 ->
        case plusWord2# [s0 c] of {
        (#,#) c1 s1 ->
        case plusWord2# [us us] of {
        (#,#) c2 s2 ->
        case or# [c0 c1] of sat {
        __DEFAULT ->
        case plusWord2# [s2 sat] of {
        (#,#) c3 s3 ->
        case plusWord2# [us us] of {
        (#,#) c4 s4 ->
        case or# [c2 c3] of sat {
        __DEFAULT ->
        case plusWord2# [s4 sat] of {
        (#,#) c5 s5 ->
        case or# [c4 c5] of sat {
        __DEFAULT -> (#,,,,#) [s s1 s3 s5 sat];
        [..]
        };

Notice it contains no let bindings, and thus doesn’t allocate (recall raw limbs are stored in registers), and also that each case expression has only a single branch, such that each represents straightforward evaluation sequencing. This is exactly what one wants to see.

For the boxed wrapper, in which the above unboxed function has already been inlined:

add_o :: Wider -> Wider -> (Wider, Word) =
    \r [ds ds1]
        case ds of {
        Wider us us us us ->
        case ds1 of {
        Wider us us us us ->
        case plusWord2# [us us] of {
        (#,#) c s ->
        case plusWord2# [us us] of {
        (#,#) c0 s0 ->
        case plusWord2# [s0 c] of {
        (#,#) c1 s1 ->
        case plusWord2# [us us] of {
        (#,#) c2 s2 ->
        case or# [c0 c1] of sat {
        __DEFAULT ->
        case plusWord2# [s2 sat] of {
        (#,#) c3 s3 ->
        case plusWord2# [us us] of {
        (#,#) c4 s4 ->
        case or# [c2 c3] of sat {
        __DEFAULT ->
        case plusWord2# [s4 sat] of {
        (#,#) c5 s5 ->
        case or# [c4 c5] of sat {
        __DEFAULT ->
        let { sat :: Word = W#! [sat]; } in
        let { sat :: Wider = Wider! [s s1 s3 s5]; } in  (,) [sat sat];
        [..]
        };

This function returns a boxed value, so it allocates, as can be seen by the let bindings at the end. Recall that this function returns a value of type (Wider, Word), so it needs to allocate precisely one ‘Wider’ data constructor, one ‘Word’ data constructor, and one tuple constructor. This is all constant with respect to the function’s inputs, and there is no additional allocation to be found anywhere.

It’s also worth taking a look at the STG for the loop in the exponentiation function shown previously:

$wloop
  :: (# Limb, Limb, Limb, Limb #)
     -> (# Limb, Limb, Limb, Limb #)
     -> (# Limb, Limb, Limb, Limb #)
     -> Word#
     -> (# Limb, Limb, Limb, Limb #) =
    \r [us us us us us us us us us us us us ww]
        case ww<TagProper> of wild2 {
          __DEFAULT ->
              case sqr# us us us us of nm {
              (#,,,#) _ _ _ _ ->
              case mul# us us us us us us us us of {
              (#,,,#) ipv4 ipv5 ipv6 ipv7 ->
              case uncheckedShiftL# [us 63#] of sat {
              __DEFAULT ->
              case uncheckedShiftRL# [sat 63#] of sat {
              __DEFAULT ->
              case not# [sat] of sat {
              __DEFAULT ->
              case plusWord# [sat 1##] of ipv8 {
              __DEFAULT ->
              case minusWord# [wild2 1##] of sat {
              __DEFAULT ->
              case xor# [us ipv7] of sat {
              __DEFAULT ->
              case and# [ipv8 sat] of sat {
              __DEFAULT ->
              case xor# [us sat] of sat {
              __DEFAULT ->
              [..]
              case or# [sat sat] of sat {
              __DEFAULT ->
              exp_$s$wloop1
                  sat sat sat sat ipv ipv1 ipv2 ipv3 sat sat sat sat sat;
              };
              [..]
              };
          0## -> (#,,,#) [us us us us];
        };

This contains no let bindings at all, but you can see two branches in the first case expression, This corresponds to whether or not to continue or stop the loop (it stops when the looping index hits 0), which is deterministic and constant with respect to inputs, so this branch is unproblematic. Note in the meantime that the constant-time select operation used in the loop produces no such branching, but instead is captured by the salad of ‘xor#’ and ‘and#’ expressions in the body of the function (many omitted above), as intended.

For an example of something that does not run in constant time, take a look at the STG for the aptly-named eq_vartime function, which compares for equality in variable time:

eq_vartime :: Montgomery -> Montgomery -> Bool =
    \r [ds ds1]
        case ds of {
        Montgomery us us us us ->
        case ds1 of {
        Montgomery us us us us ->
        case eqWord# [us us] of {
          __DEFAULT -> False [];
          1# ->
              case eqWord# [us us] of {
                __DEFAULT -> False [];
                1# ->
                    case eqWord# [us us] of {
                      __DEFAULT -> False [];
                      1# -> eq_vartime# us us;
                    };
              };
        };
        };
        };

Here you can see a number of branches, each of which depends on the inputs to the function. This is what one would not want to see in code that’s supposed to run in constant time.

Assembly Analysis

All well thus far, in that the data representation is fixed-size, the algorithms used have constant-time and constant-allocation semantics, and that GHC’s evaluation model demonstrably doesn’t want to introduce any unexpected variability to them. But there’s still the matter of the assembly produced – one wants to be sure that the machine instructions that the code ultimately compiles down to are free of any unexpected input-dependent variation.

By my observation, GHC’s LLVM backend produces dramatically faster assembly than does the native code generator, so I’ve adopted it for all of my ppad libraries. To inspect the assembly, we can dump the LLVM IR, and then run it through ‘llc’ with optimizations, examining the result.

Assuming we’re compiling with the LLVM backend, an LLVM IR dump can be gotten via the ‘-ddump-llvm’ flag, and, after removing some headers GHC inserts, we can run it through llc -O3 -mcpu=native -filetype=asm to produce assembly.

Let’s first take a look at the (aarch64) assembly for the overflowing addition function. I’ll paste the entire thing, for completeness:

; %bb.0:                                ; %naoh
  stp x25, x26, [sp, #40]
  stp x23, x24, [sp, #24]
  stp x20, x22, [sp, #8]
  ldp x8, x9, [x20]
  stp x9, x8, [sp, #176]
  ldr x10, [x20, #16]
  str x10, [sp, #168]
  adds  x8, x24, x8
  cset  w10, hs
  adds  x11, x23, x27
  cset  w12, hs
  stp x11, x12, [sp, #152]
  stp x8, x10, [sp, #136]
  adcs  x8, x8, xzr
  cset  w11, hs
  stp x8, x11, [sp, #120]
  adds  x8, x25, x9
  cset  w9, hs
  stp x8, x9, [sp, #104]
  orr x10, x10, x11
  adds  x24, x8, x10
  cset  w8, hs
  stp x24, x8, [sp, #88]
  ldr x10, [sp, #48]
  ldr x11, [sp, #168]
  adds  x10, x10, x11
  cset  w11, hs
  stp x10, x11, [sp, #72]
  orr x8, x9, x8
  adds  x25, x10, x8
  cset  w8, hs
  stp x25, x8, [sp, #56]
  orr x26, x11, x8
  stp x25, x26, [sp, #40]
  ldr x23, [sp, #120]
  stp x23, x24, [sp, #24]
  ldr x22, [sp, #152]
  ldr x8, [sp, #8]
  add x20, x8, #24
  stp x20, x22, [sp, #8]
  ldr x8, [x8, #24]
  blr x8
  ret

The ‘str’ and ‘ldr’ instructions mean store register and load register, and ‘stp’ and ‘ldp’ mean store pair of registers and load pair of registers, respectively, and they all indicate stuff being shuffled around between registers and the stack. These all run in constant time; note that Jon’s Arm Reference contains the following blurb for each:

This instruction is a data-independent-time instruction

The branch with link to register (blr) and return from subroutine (ret) instructions at the end of the computation are “whole program” control-flow instructions that branch unconditionally, and not on secret data. BLR’s target register contains a GHC continuation pointer (i.e., “what we’re doing next”) and RET indicates that this is a subroutine return.

So, that’s all administrative stuff. The actual overflowing addition algorithm is better-captured by the following core sequence of instructions:

  adds  x8, x24, x8
  cset  w10, hs
  adds  x11, x23, x27
  cset  w12, hs
  adcs  x8, x8, xzr
  cset  w11, hs
  adds  x8, x25, x9
  cset  w9, hs
  orr x10, x10, x11
  adds  x24, x8, x10
  cset  w8, hs
  adds  x10, x10, x11
  cset  w11, hs
  orr x8, x9, x8
  adds  x25, x10, x8
  cset  w8, hs
  orr x26, x11, x8
  add x20, x8, #24

The add (add), add, setting flags (adds), add with carry, setting flags (adcs), and bitwise or (orr) instructions are all “data-independent-time,” and “cset” is actually an alias for conditional select increment (csinc), which is a constant-time branchless select. This is exactly what we want to see: that every instruction in this program runs in time constant with respect to its inputs.

The assembly for the variable-time equality function makes for an excellent comparison. Here’s the meat of it, with the “administrative” instructions omitted:

; %bb.0:
  cmp x8, x9
  b.ne  LBB31_4
; %bb.1:
  cmp x8, x9
  b.ne  LBB31_4
; %bb.2:
  cmp x8, x9
  b.ne  LBB31_4
; %bb.3:
  add x20, x8, #40

The core variable-time instruction here is branch conditionally (b.cond), where the condition of interest is “not equal to” (thus ‘b.ne’). This returns the subroutine early if a limb inequality check (performed by ‘cmp’) succeeds. If the subroutine can be timed reliably, especially with chosen inputs, then the timing differential will expose which limb differs first.

Benchmarks and Allocation Measurement

With the code, IR, and assembly all appearing to be in order, the benchmark suites serve mostly as sanity checks. They’re immensely useful, though. I’m reminded a bit of that quote attributed to Knuth, which I’ll paraphrase as “beware of bugs – I’ve only proved the program correct, not tested it.”

ppad-fixed has a Criterion suite that benchmarks functions of interest on varying-sized inputs. We expect variation in wall-clock time to be readily attributable to noise; here’s an example excerpt of the output, for a Montgomery modular exponentiation function:

benchmarking exp/curve:  M(2) ^ 2
time                 5.217 μs   (5.206 μs .. 5.224 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.208 μs   (5.198 μs .. 5.215 μs)
std dev              27.24 ns   (20.87 ns .. 37.18 ns)

benchmarking exp/curve:  M(2 ^ 255 - 19) ^ (2 ^ 255 - 19)
time                 5.214 μs   (5.207 μs .. 5.220 μs)
                     1.000 R²   (1.000 R² .. 1.000 R²)
mean                 5.211 μs   (5.201 μs .. 5.217 μs)
std dev              27.82 ns   (16.79 ns .. 51.12 ns)

With a three-nanosecond average differential, the function runs in the same time (up to noise) when called with either small and large inputs, as expected.

A ‘weigh’ suite similarly ensures that there are no surprises when it comes to allocation. Allocations match exactly what is expected across the board, e.g.:

exp

  Case                            Allocated  GCs
  curve:  M(2) ^ 2                       40    0
  curve:  M(2) ^ (2 ^ 255 - 19)          40    0
  scalar:  M(2) ^ 2                      40    0
  scalar:  M(2) ^ (2 ^ 255 - 19)         40    0

You’ll recall that 40 bytes is what a four-limb strict, unboxed tuple behind a single constructor (‘Montgomery’, in this case) should allocate on a 64-bit system.

For functions that return something more complicated, we can always account for the additional allocation precisely. For the variable-time modular square root function on one of the Montgomery domains, for example:

sqrt_vartime

  Case                                  Allocated  GCs
  curve:  sqrt_vartime M(2)                    56    0
  curve:  sqrt_vartime M(2 ^ 255 - 19)         56    0

Here the function returns a ‘Maybe Montgomery’ value, so the additional 16 bytes are for the ‘Just’ constructor and the pointer to its payload.

Conclusion

So: I’ve analysed the entirety of this library in the above manner, and by my appraisal, all functions in this library will run in constant time on aarch64 when compiled with optimizations with GHC 9.8.1, unless marked otherwise. I haven’t performed the same analysis for other compiler versions or flags, or for other target architectures, but the above gives me confidence that the results are also likely to hold true for those too. The Criterion suite should be particularly useful as a sanity check for those cases.

I’ve also performed the same analysis for ppad-secp256k1, and intend to do it for all my other libraries where constant-time execution is expected. Feel free to reproduce my analysis yourself, if curious – and let me know if you do!