28 Nov 2025
Taproot Assets
endeavours to provide completely reproducible and verifiable build
artifacts for all releases. These include binaries for a number of
target architectures/platforms, along with source code and vendored
dependencies, all packaged up in gzipped tarballs, or in zip(1) files.
Each release comes with a manifest of these artifacts along with
accompanying SHA256 digests.
To attest the artifacts for a release, trusted contributors verify that
they can produce the same manifest, and then they cryptographically
sign it:
$ gpg --verify manifest-jtobin-v0.7.0.sig manifest-v0.7.0.txt
gpg: Signature made Thu 20 Nov 14:07:09 2025 +04
gpg: using EDDSA key 7363F82F43B1324E1B0761310E4647D58F8A69E4
gpg: Good signature from "Jared Tobin (https://ppad.tech) <jared@ppad.tech>" [ultimate]
gpg: aka "Jared Tobin (jtobin) <jared@jtobin.io>" [ultimate]
If one is operating at an appropriate paranoia level, he can thus
personally verify that, first, a quorum of trusted contributors has
attested to a certain set of artifacts, and, second, that he can
produce, from source and his own tool set, those very same artifacts,
before running the software.
Interestingly, the binaries themselves are the least tricky part to
verify, at least in principle. Modern Go compilers are able to produce
completely reproducible binaries for any target, independent
of the build host architecture. That is: given appropriate environment
variables and linker flags (namely CGO_ENABLED="0" and -trimpath,
probably -s and -w too), a modern Go compiler will produce,
bit-for-bit, the same output, given consistent source, other flags, and
so on.
You have to do a bit more legwork for the tarballs, etc. Putting
together reproducible tarballs involves a salad of flags, e.g.:
export TZ=UTC
find "${dir}" -print0 | LC_ALL=C sort -r -z | tar \
"--mtime=${BUILD_DATE}" --no-recursion --null --mode=u+rw,go+r-w,a+X \
--owner=0 --group=0 --numeric-owner -c -T - | gzip -9n > "${dir}.tar.gz"
You have to be careful to lock down a bunch of stuff, like timestamps
and user/group permissions. What’s more, you have to ensure you’re using
the right variant of tar (BSD vs GNU), and ditto for at least gzip and
zip.
The problem of producing a hermetic build environment is solved by Nix,
of course (or Docker, more on which another time). But some interesting
problems can actually pop up when using Nix to build the kinds of
reproducible Go binaries I mentioned above.
Let’s start with the tarballs and zip files. Timestamps and permissions
are normalized within the Nix store, so, easy: we merely need to put the
following in the right place in a Nix flake:
{ nativeBuildInputs = with pkgs; [ coreutils gnutar gzip zip ]; }
Add a minimal set of reproducibility flags to the tar call, and then
anyone building the output with the same Nix flake metadata should
be able to produce consistent gzipped tarballs containing consistent
compiled binaries.
Or so it would seem. When I first tried this, I was able to produce
reproducible tarballs for both the project source and its vendored
dependencies. The source is declared as follows:
package = "taproot-assets";
tag = "v0.7.0";
# source info
org = "lightninglabs";
repo = package;
tap_src = pkgs.fetchFromGitHub {
owner = org;
repo = repo;
rev = tag;
hash = "sha256-z4lNeVmy0AEDM8ivdfXfXqi/V1LIDHV2KtS5/19+Jlk=";
};
and given a ‘tap’ derivation built by Nixpkgs’s buildGoModule, the
following, in the buildPhase, can later get us the tarballs of interest:
tar_gz $out/vendor.tar.gz ${tap.goModules} .
tar_gz $out/taproot-assets-source-${tag}.tar.gz ${tap_src} .
But strangely, to me, I couldn’t generate reproducible binaries.
That is: even though I pinned to a specific Go toolchain, I would
produce binaries for different targets that differed depending on the
architecture of the host. This was strange, since I know that the Go
compiler shouldn’t have any issue doing this, and I was clearly using
the right settings:
# tapd builder
build_tap = {
goos ? null
, goarch ? null
, goarm ? null
, tags ? tags_base
, ldflags ? ldflags_base
, gcflags ? [ ]
}: pkgs.buildGoModule {
pname = package;
version = tag;
src = tap_src;
vendorHash = "sha256-p75eZoM6tSayrxcKTCoXR8Jlc3y2UyzfPCtRJmDy9ew=";
subPackages = [ "cmd/tapd" "cmd/tapcli" ];
inherit tags ldflags gcflags;
preBuild = with pkgs; ''
export CGO_ENABLED="0"
export GOAMD64="v1"
${lib.optionalString (goos != null) "export GOOS=${goos}"}
${lib.optionalString (goarch != null) "export GOARCH=${goarch}"}
${lib.optionalString (goarm != null) "export GOARM=${goarm}"}
'';
doCheck = false;
};
(‘-trimpath’, another requirement for reproducible builds, is added by
default when using Nixpkgs’s buildGoModule, as is ‘-buildid=’, ensuring
no rogue metadata creeps in.)
I dug into this by inspecting a couple of the produced binaries for a
random architecture, one built on an amd64 Linux machine and the other
on aarch64 Darwin. I produced a couple of dumps of ELF section headers
via readelf -S $MY_BINARY and diffed them:
$ diff sections-linux.txt sections-darwin.txt
1c1
< There are 14 section headers, starting at offset 0xd4:
---
> There are 14 section headers, starting at offset 0x386ed30:
18c18
< [13] .shstrtab STRTAB 00000000 3870000 000094 00 0 0 1
---
> [13] .shstrtab STRTAB 00000000 386ecb0 00007d 00 0 0 1
The .shstrtab (“section name string table”) sections, which are pure
ELF metadata, differed, so I produced hexdumps of those, this time via
readelf -x .shstrtab $MY_BINARY, and diffed them:
$ diff shstrtab-linux.txt shstrtab-darwin.txt
3,12c3,10
< 0x00000000 002e7465 7874002e 6e6f7074 72646174 ..text..noptrdat
< 0x00000010 61002e64 61746100 2e627373 002e6e6f a..data..bss..no
< 0x00000020 70747262 7373002e 676f2e66 757a7a63 ptrbss..go.fuzzc
< 0x00000030 6e747273 002e676f 2e627569 6c64696e ntrs..go.buildin
< 0x00000040 666f002e 676f2e66 69707369 6e666f00 fo..go.fipsinfo.
< 0x00000050 2e656c66 64617461 002e726f 64617461 .elfdata..rodata
< 0x00000060 002e7479 70656c69 6e6b002e 69746162 ..typelink..itab
< 0x00000070 6c696e6b 002e676f 73796d74 6162002e link..gosymtab..
< 0x00000080 676f7063 6c6e7461 62002e73 68737472 gopclntab..shstr
< 0x00000090 74616200 tab.
---
> 0x00000000 002e7465 7874002e 6e6f7074 72627373 ..text..noptrbss
> 0x00000010 002e6273 73002e67 6f2e6669 7073696e ..bss..go.fipsin
> 0x00000020 666f002e 676f2e62 75696c64 696e666f fo..go.buildinfo
> 0x00000030 002e7479 70656c69 6e6b002e 69746162 ..typelink..itab
> 0x00000040 6c696e6b 002e7368 73747274 6162002e link..shstrtab..
> 0x00000050 676f7063 6c6e7461 62002e67 6f73796d gopclntab..gosym
> 0x00000060 74616200 2e6e6f70 74726461 7461002e tab..noptrdata..
> 0x00000070 726f6461 7461002e 64617461 00 rodata..data.
Not much to go on there, but I eventually traced this to the fact that
mkDerivation in Nixpkgs’s stdenv runs a “fixupPhase” by default, which
performs its own stripping of ELF metadata, separate from what the Go
compiler does when passed the ‘-s’ and ‘-w’ linker flags (for stripping
symbol and debug information, plus DWARF debugging info, respectively).
The stripping done by the fixupPhase seems to vary depending on the
package set used, e.g. between aarch64-darwin and x86_64-linux.
I disabled the fixupPhase’s stripping by adding
to my builder function, and tried again. This time the resulting
artifacts’ file sizes normalized, but they still produced different
SHA256 digests. I produced another hexdump of each, this time using good
old xxd(1), and diffed them again:
2028124,2028126c2028124,2028126
< 01ef25b0: 7265 2f32 7334 6870 7137 3368 6e34 396a re/2s4hpq73hn49j
< 01ef25c0: 6438 3478 3937 366d 3361 6370 3372 6433 d84x976m3acp3rd3
< 01ef25d0: 6b31 782d 747a 6461 7461 2d32 3032 3562 k1x-tzdata-2025b
---
> 01ef25b0: 7265 2f78 686e 6231 6434 7767 6a33 3878 re/xhnb1d4wgj38x
> 01ef25c0: 6833 3833 7973 3162 7969 676e 3079 6861 h383ys1byign0yha
> 01ef25d0: 6371 682d 747a 6461 7461 2d32 3032 3562 cqh-tzdata-2025b
2028308,2028310c2028308,2028310
< 01ef3130: 2069 740a 2f6e 6978 2f73 746f 7265 2f34 it./nix/store/4
< 01ef3140: 6b77 3938 7138 3470 6276 326e 7771 3361 kw98q84pbv2nwq3a
< 01ef3150: 7931 3261 6176 6661 397a 6934 6439 7a2d y12aavfa9zi4d9z-
---
> 01ef3130: 2069 740a 2f6e 6978 2f73 746f 7265 2f7a it./nix/store/z
> 01ef3140: 6a62 3467 7763 7971 7670 3862 3236 716b jb4gwcyqvp8b26qk
> 01ef3150: 7832 7163 3033 3776 6a6a 3631 7838 302d x2qc037vjj61x80-
2028473,2028475c2028473,2028475
< 01ef3b80: 7374 6f72 652f 7279 6d64 7764 386b 3461 store/rymdwd8k4a
< 01ef3b90: 6268 3430 6233 6d77 3470 6636 3967 3739 bh40b3mw4pf69g79
< 01ef3ba0: 336d 6471 6269 2d69 616e 612d 6574 632d 3mdqbi-iana-etc-
---
> 01ef3b80: 7374 6f72 652f 7038 6369 3239 7637 6b37 store/p8ci29v7k7
> 01ef3b90: 6777 7773 786a 3138 3170 3035 6a63 6362 gwwsxj181p05jccb
> 01ef3ba0: 6e71 6236 3439 2d69 616e 612d 6574 632d nqb649-iana-etc-
2028626,2028628c2028626,2028628
< 01ef4510: 6f72 652f 7279 6d64 7764 386b 3461 6268 ore/rymdwd8k4abh
< 01ef4520: 3430 6233 6d77 3470 6636 3967 3739 336d 40b3mw4pf69g793m
< 01ef4530: 6471 6269 2d69 616e 612d 6574 632d 3230 dqbi-iana-etc-20
---
> 01ef4510: 6f72 652f 7038 6369 3239 7637 6b37 6777 ore/p8ci29v7k7gw
> 01ef4520: 7773 786a 3138 3170 3035 6a63 6362 6e71 wsxj181p05jccbnq
> 01ef4530: 6236 3439 2d69 616e 612d 6574 632d 3230 b649-iana-etc-20
The difference here was immediately discernable: different
binaries contained different hard-coded paths to the Nix
store. This was at first immensely confusing to me, but
it turns out that the default Go toolchains in Nixpkgs are
not “official” Go toolchains, but patched ones. You can see
this in the Go derivation for e.g. the nixos-25.05 package
set, which contains the following ‘patches’
attribute:
patches = [
(replaceVars ./iana-etc-1.25.patch {
iana = iana-etc;
})
# Patch the mimetype database location which is missing on NixOS.
# but also allow static binaries built with NixOS to run outside nix
(replaceVars ./mailcap-1.17.patch {
inherit mailcap;
})
# prepend the nix path to the zoneinfo files but also leave the
# original value for static binaries that run outside a nix server
(replaceVars ./tzdata-1.19.patch {
inherit tzdata;
})
./remove-tools-1.11.patch
./go_no_vendor_checks-1.23.patch
];
The idea here is that the Go compiler bakes into the binaries it
produces a number of fallback runtime paths for stuff like ‘tzdata’,
e.g. ‘/usr/share/zoneinfo’. On NixOS this, and other paths, do not
exist, so the ‘tzdata-1.19.patch’, for example, hard codes a path to
the tzdata package in the Nix store that depends on the package set
used. Since the package sets differ between platforms, these paths will
also differ, and so will any binaries produced by the Go compilers they
contain.
The solution to this is easy in Nix, provided you know you actually have
to do it. Just build with an unpatched Go toolchain:
# nixpkgs applies various patches to go toolchains; we need to remove
# them to produce host-architecture-independent binaries.
go-unpatched = pkgs.go.overrideAttrs (_: {
patches = [ ];
});
buildGoModule = pkgs.buildGoModule.override {
go = go-unpatched;
};
# tapd builder
build_tap = {
goos ? null
, goarch ? null
, ..
, gcflags ? [ ]
}: buildGoModule {
..
};
This works as expected. The produced manifests look like this:
$ head -3 manifest-v0.7.0-darwin.txt | cut -f1 -d' '
ebebc38f71c7a902341bf4247d8ee2752d0d233deec2034953dab4b44ecb622a
737eb475c5435beedcdb49ae7adf0716c9980523e89682fea4622b59c5d132df
8570b3c659fc2df9cff365fc06bd95459d903b6a3e7c8fd2e5e4efc1a77f7adf
and diffing the manifests produced on both macOS and Linux hosts
illustrates that the artifacts match, bit-for-bit, precisely:
$ diff manifest-v0.7.0-darwin.txt manifest-v0.7.0-linux.txt
<empty>
Here’s a
gist
to a flake for producing various tapd builds, as well as release
artifacts, if you’re curious.
17 Jan 2025
I wanted to make a quick addendum to my last post
in that I did some further experimenting with Data.ByteString.Builder
on both the base16 encoding and decoding tasks, comparing results with
the impure direct-allocation-and-write-based implementations I mentioned
previously.
I had remarked that builders can be pretty efficient if you’re careful
to pack your data aggressively, such that you don’t wind up needing to
use too many builders in the first place. I decided to try minimising,
as in objectively minimising, the number of builders required in both
base16 encoding & decoding, to see what kind of performance I could
squeeze out while sticking to the pure API. The builders didn’t
disappoint.

How would one “objectively minimise” the number of builders required
here? Simply by processing the biggest-sized chunk possible at a time,
given we always want to write a Word64. If we can’t do that, we’ll
write a Word32 and a Word16 and a Word8. If we can’t do that, we’ll
write a Word32 and a Word16. And so on. We can figure all this out
just by doing some checks on the length of the input bytestring when
starting out: we trade some additional cheap arithmetic operations
on machine integers up-front for fewer expensive builder allocations
further down the line.
For encoding, this means doing the following checks:
encode :: BS.ByteString -> BS.ByteString
encode bs@(BI.PS _ _ l)
| l < 64 = to_strict_small loop
| otherwise = to_strict loop
where
loop
| l `rem` 4 == 0 = go64 bs
| (l - 3) `rem` 4 == 0 = case BS.splitAt (l - 3) bs of
(chunk, etc) ->
go64 chunk
<> go32 (BU.unsafeTake 2 etc)
<> go16 (BU.unsafeDrop 2 etc)
| (l - 2) `rem` 4 == 0 = case BS.splitAt (l - 2) bs of
(chunk, etc) -> go64 chunk <> go32 etc
| (l - 1) `rem` 4 == 0 = case BS.splitAt (l - 1) bs of
(chunk, etc) -> go64 chunk <> go16 etc
| l `rem` 2 == 0 = go32 bs
| (l - 1) `rem` 2 == 0 = case BS.splitAt (l - 1) bs of
(chunk, etc) -> go32 chunk <> go16 etc
| otherwise = go16 bs
where each ‘go’ function writes words with the indicated number of bits
at a time, e.g.:
go64 b = case BS.splitAt 4 b of
(chunk, etc)
| BS.null chunk -> mempty
| otherwise ->
let !w16_0 = expand_w8 (BU.unsafeIndex chunk 0)
!w16_1 = expand_w8 (BU.unsafeIndex chunk 1)
!w16_2 = expand_w8 (BU.unsafeIndex chunk 2)
!w16_3 = expand_w8 (BU.unsafeIndex chunk 3)
!w64 = fi w16_0 `B.shiftL` 48
.|. fi w16_1 `B.shiftL` 32
.|. fi w16_2 `B.shiftL` 16
.|. fi w16_3
in BSB.word64BE w64 <> go64 etc
and where expand_w8 is just a variant of the previous ‘hilo’ function
that returns a Word16 directly, rather than a pair of Word8’s.
‘go32’ works on a chunk of size two, writing a single Word32, and ‘go16’
on a chunk of size one, writing a Word16. The point of having all these
functions is that we can now always write the largest-sized word that we
can, instead of writing Word8’s or Word16’s exclusively.
(Decoding works similarly, except we need more checks, and an additional
‘go8’ function to handle the one-byte case. I won’t paste the salad of
conditionals required here.)
In any case, by employing this strategy we can come pretty close to
the performance of the impure implementations. Here are some benchmark
results for encoding a 1kb input:
benchmarking encode/ppad-base16
time 5.929 μs (5.847 μs .. 6.013 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 5.975 μs (5.913 μs .. 6.057 μs)
std dev 233.1 ns (172.4 ns .. 310.0 ns)
benchmarking encode/base16-bytestring
time 3.246 μs (3.233 μs .. 3.262 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.284 μs (3.268 μs .. 3.303 μs)
std dev 61.05 ns (47.83 ns .. 76.77 ns)
benchmarking encode/base16
time 3.236 μs (3.221 μs .. 3.253 μs)
1.000 R² (1.000 R² .. 1.000 R²)
mean 3.244 μs (3.233 μs .. 3.256 μs)
std dev 37.31 ns (31.87 ns .. 47.21 ns)
Case Allocated GCs
ppad-base16 (encode) 53,704 0
base16-bytestring (encode) 2,272 0
base16 (encode) 2,256 0
We’re allocating 25x more than the impure versions, but are only 2x
slower or less. Here’s the decoding story:
benchmarking decode/ppad-base16
time 4.942 μs (4.884 μs .. 4.995 μs)
0.999 R² (0.998 R² .. 0.999 R²)
mean 4.908 μs (4.854 μs .. 4.964 μs)
std dev 176.8 ns (150.3 ns .. 214.3 ns)
variance introduced by outliers: 46% (moderately inflated)
benchmarking decode/base16-bytestring
time 540.8 ns (533.7 ns .. 548.2 ns)
0.999 R² (0.999 R² .. 0.999 R²)
mean 541.6 ns (536.9 ns .. 547.5 ns)
std dev 17.64 ns (13.87 ns .. 22.24 ns)
variance introduced by outliers: 47% (moderately inflated)
benchmarking decode/base16
time 555.8 ns (549.7 ns .. 560.9 ns)
0.999 R² (0.999 R² .. 1.000 R²)
mean 550.7 ns (546.7 ns .. 555.5 ns)
std dev 15.46 ns (13.11 ns .. 18.97 ns)
variance introduced by outliers: 39% (moderately inflated)
Case Allocated GCs
ppad-base16 (decode) 21,960 0
base16-bytestring (decode) 128 0
base16 (decode) 2,440 0
We’re allocating less (we’re writing less), but are closer to 10x
slower. Not too bad, all things considered!
(N.b., it’s worth noting that the impure decoding functions also use
what appears to be a more efficient lookup table to convert from
hex characters back to Word8, so that may account for some of the
differential there.)
16 Jan 2025
In this post I’m going to incrementally optimise a simple base16
(hexadecimal) encoding routine and illustrate what sort of performance
boost each optimisation yields. Hopefully it can be used to glean a bit
about what tends to make Haskell code fast – especially code that deals
with bytestrings.

You can think of this as a kind of supplement to Chris
Done’s Fast Haskell: Competing with C at parsing
XML post from
a few years ago. Here, like in Chris’s example, we’re going to focus
a lot on bytestring handling, though we’ll deal with some different
issues than he faced, and also eventually go a little lower-level on the
bytestring side of things.
Base16 Encoding
Let’s get right into it. The basic idea here is: for each byte
(Word8) in an input, extract its high and low bits, and then map each
(effectively a Word4) to a character from the hex alphabet:
import qualified Data.Bits as B
import qualified Data.ByteString as BS
hex_charset :: BS.ByteString
hex_charset = "0123456789abcdef"
hilo :: Word8 -> (Word8, Word8)
hilo b =
let hi = BS.index hex_charset (fromIntegral b `B.shiftR` 4)
lo = BS.index hex_charset (fromIntegral b .&. 0b00001111)
in (hi, lo)
You then get a base16-encoded output by gluing the resulting characters
together in the appropriate fashion:
encode :: BS.ByteString -> BS.ByteString
encode bs = go mempty 0
where
l = BS.length bs
go !acc j
| j == l = BS.reverse acc
| otherwise =
let (hi, lo) = hilo (BS.index bs j)
in go (BS.cons lo (BS.cons hi acc)) (succ j)
There are some things here that might stick out to someone
accustomed to writing performant Haskell code, but it’s an otherwise
reasonable-looking first take. How does it perform on a 1kb input?
benchmarking base16/basic
time 114.2 μs (113.1 μs .. 115.4 μs)
0.999 R² (0.998 R² .. 1.000 R²)
mean 116.1 μs (115.1 μs .. 118.3 μs)
std dev 4.453 μs (2.609 μs .. 8.263 μs)
variance introduced by outliers: 38% (moderately inflated)
Allocated GCs
basic 2,326,160 0
Well, according to weigh it
seems to allocate a ton. The primary issue is that every invocation of
the ‘cons’ function creates a copy of its input bytestring; this is the
main thing that would scream out at an experienced Haskeller if they
were to glance at the above code. We’re not dealing with O(1) ‘cons’ in
bytestring-land, as we are when we use lists.
(As a side note: although ‘BS.reverse’ might raise an eyebrow, it’s
actually really fast. It’s a FFI call to a C reverse routine.)
Builders
A more efficient way to construct bytestrings is via
Data.ByteString.Builder, which supports constant-time concatenation of
sequences of bytes. Here’s a version of ‘encode’ that uses builders:
import qualified Data.ByteString.Builder as BSB
to_strict :: BSB.Builder -> BS.ByteString
to_strict = BS.toStrict . BSB.toLazyByteString
{-# INLINE to_strict #-}
encode :: BS.ByteString -> BS.ByteString
encode bs = to_strict (go 0)
where
l = BS.length bs
go j
| j == l = mempty
| otherwise =
let (hi, lo) = hilo (BS.index bs j)
in BSB.word8 hi <> BSB.word8 lo <> go (succ j)
There’s a new function to convert the builder back to a strict
bytestring, and now we concatenate builder singletons in order. Simple
enough. How does it compare in terms of performance?
benchmarking base16/builder
time 42.54 μs (42.01 μs .. 43.27 μs)
0.999 R² (0.999 R² .. 0.999 R²)
mean 42.88 μs (42.57 μs .. 43.22 μs)
std dev 1.105 μs (946.6 ns .. 1.387 μs)
variance introduced by outliers: 24% (moderately inflated)
Allocated GCs
builder 397,768 0
Much better. It allocates about 6x less and is almost 3x faster.
Builders are definitely worth knowing about when dealing with
bytestrings, as they’re easy to use, and allow one to write pure code
that performs reasonably well. There’s also some fine-tining you can do
in order to squeeze additional performance out of them in certain cases.
For small inputs, you can use a custom strategy to more efficiently
convert builders to lazy bytestrings en route to a strict one, e.g.:
import qualified Data.ByteString.Builder.Extra as BE
to_strict_small :: BSB.Builder -> BS.ByteString
to_strict_small = BS.toStrict
. BE.toLazyByteStringWith (BE.safeStrategy 128 BE.smallChunkSize) mempty
Using less builders helps as well, and probably even moreso. Consider
the following, in which the loop writes a single Word16 at a time
instead of two Word8’s:
encode :: BS.ByteString -> BS.ByteString
encode bs@(BI.PS _ _ l)
| l < 128 = to_strict_small (go 0)
| otherwise = to_strict (go 0)
where
go j
| j == l = mempty
| otherwise =
let (hi, lo) = hilo (BS.index bs j)
w16 = fromIntegral hi `B.shiftL` 8
.|. fromIntegral lo
in BSB.word16BE w16 <> go (succ j)
It allocates slightly less, and is a microsecond peppier, because
there’s just less building going on:
benchmarking base16/better builder
time 40.96 μs (40.64 μs .. 41.33 μs)
0.999 R² (0.999 R² .. 1.000 R²)
mean 41.12 μs (40.79 μs .. 41.48 μs)
std dev 1.163 μs (969.0 ns .. 1.494 μs)
variance introduced by outliers: 28% (moderately inflated)
Allocated GCs
better builder 389,592 0
(Note that I’ve also introduced the use of the ‘BI.PS’ pattern synonym
here, but only to more easily grab the input bytestring’s length. It has
nothing to do with performance.)
Unsafe Functions
Another easy gain can be won by replacing the calls to bytestring’s
‘index’ with its ‘unsafeIndex’ variant:
import qualified Data.ByteString.Unsafe as BU
hilo :: Word8 -> (Word8, Word8)
hilo b =
let hi = BU.unsafeIndex hex_charset (fromIntegral b `B.shiftR` 4)
lo = BU.unsafeIndex hex_charset (fromIntegral b .&. 0b00001111)
in (hi, lo)
encode :: BS.ByteString -> BS.ByteString
encode bs@(BI.PS _ _ l)
| l < 128 = to_strict_small (go 0)
| otherwise = to_strict (go 0)
where
go j
| j == l = mempty
| otherwise =
let (hi, lo) = hilo (BU.unsafeIndex bs j)
w16 = fromIntegral hi `B.shiftL` 8
.|. fromIntegral lo
in BSB.word16BE w16 <> go (succ j)
It often makes sense to do this so long as you can prove that the call
is actually safe (the compiler, of course, can’t), as ‘unsafeIndex’
reliably yields a decent performance boost. In this case, the unsafe
indexing into the hex alphabet in ‘hilo’ is being done with what are
effectively four-bit words, which will always be safe to use as indices
in a 16-length bytestring (there are 2^4 = 16 distinct Word4’s). The
unsafe index called in the body of the loop can similarly be verified
to remain safely within the input bytestring’s bounds, since the loop
terminates when its index argument hits the end.
The performance now:
benchmarking base16/unsafe index
time 25.58 μs (25.25 μs .. 25.89 μs)
0.998 R² (0.997 R² .. 0.999 R²)
mean 25.69 μs (25.41 μs .. 26.03 μs)
std dev 1.051 μs (852.3 ns .. 1.396 μs)
variance introduced by outliers: 47% (moderately inflated)
Allocated GCs
unsafe index 233,944 0
Another substantial win. Much less allocation and a great reduction in
wall-clock time.
Note however that not all unsafe functions will yield a boost as
impressive as bytestring’s ‘unsafeIndex’. I’ve never found the
‘unsafeShiftL’ and ‘unsafeShiftR’ functions in Data.Bits to ever really
seem to do much at all, for example, so we’ll keep the plain ‘B.shiftR’
call in ‘hilo’ above.
Unboxed Primitives
Next up is another optimisation that any seasoned Haskeller should know
about: use unboxed types, unless there’s some unusual reason not to.
Unboxed values require no allocation. To quote the GHC user
guide:
The representation of a Haskell Int, for example, is a two-word heap
object. An unboxed type, however, is represented by the value itself, no
pointers or heap allocation are involved.
Unboxed types correspond to the “raw machine” types you would use
in C: Int# (long int), Double# (double), Addr# (void *), etc.
The primitive operations (PrimOps) on these types are what you might
expect; e.g., (+#) is addition on Int#s, and is the machine-addition
that we all know and love—usually one instruction.
You can work with unboxed types and values explicitly by using the
appropriate imports and the MagicHash pragma, which can actually be
pretty nice, because you express what’s really going on, but more
commonly, unboxed types are denoted via strictness annotations and the
UNPACK pragma, like so:
data W8Pair = Pair
{-# UNPACK #-} !Word8
{-# UNPACK #-} !Word8
hilo :: Word8 -> W8Pair
hilo b =
let !hi = BU.unsafeIndex hex_charset (fromIntegral b `B.shiftR` 4)
!lo = BU.unsafeIndex hex_charset (fromIntegral b .&. 0b00001111)
in Pair hi lo
encode :: BS.ByteString -> BS.ByteString
encode bs@(BI.PS _ _ l)
| l < 128 = to_strict_small (go 0)
| otherwise = to_strict (go 0)
where
go j
| j == l = mempty
| otherwise =
let !(Pair hi lo) = hilo (BU.unsafeIndex bs j)
in BSB.word8 hi <> BSB.word8 lo <> go (succ j)
Now ‘hilo’ as a whole – so long as one compiles with optimisation
– simply doesn’t allocate at all, and the overall allocation and
wall-clock time are trimmed by a decent chunk yet again:
benchmarking base16/strict, unpack
time 20.90 μs (20.56 μs .. 21.25 μs)
0.998 R² (0.998 R² .. 0.999 R²)
mean 20.97 μs (20.76 μs .. 21.19 μs)
std dev 742.1 ns (619.9 ns .. 938.1 ns)
variance introduced by outliers: 41% (moderately inflated)
Allocated GCs
strict, unpack 176,600 0
At this point the allocation is absolutely dominated by the builders, so
if we want to do better we’ll need to do something about them.
Direct Allocation and Writes
A strict ByteString is just a wrapper around some memory. It’s defined
in recent versions via:
data ByteString = BS
{-# UNPACK #-} !(ForeignPtr Word8) -- payload
{-# UNPACK #-} !Int -- length
So, a “foreign pointer” to some memory location and a length (where a
foreign pointer means a pointer to something that isn’t managed by the
RTS in the same way normal data is). To most efficiently create a
bytestring, one can thus do it in the same way one would do so in C:
allocate some memory and write to it directly.
This is the line beyond which one probably can’t consider his code
to be “pure Haskell” anymore. An ‘unsafePerformIO’ call or similar
isn’t technically going to be required (it could be masked via
‘Data.ByteString.Internal.unsafeCreate’, for example), and one still
doesn’t need to type the {-# LANGUAGE FFI #-} pragma at the top of his
module. But it’s safe to say that any assertions of purity would at this
point be at least somewhat controversial.
But if one does want to wield this particular Ring of Power, it
can be done like so (this is basically what Bryan O’Sullivan’s
base16-bytestring or Emily Pillmore’s base16 package do, for example):
import Foreign.Ptr
import Foreign.Storable
import GHC.ForeignPtr
import GHC.Word
import System.IO.Unsafe
encode :: BS.ByteString -> BS.ByteString
encode (BI.PS bs _ l) = unsafeDupablePerformIO $ do
buffer <- mallocPlainForeignPtrBytes (l * 2)
withForeignPtr buffer $ \p_buf ->
withForeignPtr bs $ \p_src ->
go p_buf p_src (p_src `plusPtr` l)
pure (BI.BS buffer (l * 2))
where
go !buf !src !end
| src == end = pure ()
| otherwise = do
!b <- peek src
let !(Pair hi lo) = hilo b
poke buf hi
poke (buf `plusPtr` 1) lo
go (buf `plusPtr` 2) (src `plusPtr` 1) end
Here we allocate a buffer explicitly and loop through the input
bytestring’s allocated memory, rather than its Haskell representation,
in order to populate it, wrapping the result up in a new bytestring via
the raw ‘BI.BS’ constructor. It’s worth noting that we could still just
make use of ‘unsafeIndex’ on the input bytestring if we wanted, rather
than making direct use of its foreign pointer, but since we’re already
going big, why not go all the way?
This function allocates minimally, and indeed is at parity with the
relevant “encode” functions found in both the base16 and
base16-bytestring packages in terms of wall-clock time. It’s about 50x faster than the
initial naïve version, and perhaps 5x faster than the most efficient
version that used builders:
benchmarking base16/pointer ops
time 2.929 μs (2.903 μs .. 2.959 μs)
1.000 R² (0.999 R² .. 1.000 R²)
mean 2.950 μs (2.936 μs .. 2.967 μs)
std dev 52.87 ns (43.32 ns .. 64.89 ns)
variance introduced by outliers: 18% (moderately inflated)
Allocated GCs
pointer ops 120 0
I’m guessing this is probably about as fast as one can make this
function via “normal” means. Other techniques that I tried while
preparing this article don’t seem to move the needle much on this
problem, if at all. I’d be pretty impressed by anything that produced
another order-of-magnitude performance boost, though – if anyone can
achieve that, I’d love to hear about it!
Recap
So to conclude with a quick, advice-laden summary of the techniques used
here:
-
Avoid using BS.cons, BS.append, and so on, as each requires making a
copy of their input bytestring or bytestrings. Instead, prefer
builders, and try to pack your data so that you can get away with as
few of them as possible. If you can write an occasional Word64, that’s
much better than writing many Word8’s.
It’s worth nothing, though, that occasional uses of cons, append, etc.
on small inputs can be very benign, and even hard to beat by clever
rewriting. You don’t always need to reach for builders or
harder-core optimisations every time you want to glue a few bytestrings
together. Consider this HMAC implementation from
ppad-sha256:
hmac
:: BS.ByteString
-> BS.ByteString
-> BS.ByteString
hmac mk@(BI.PS _ _ l) text =
let step1 = k <> BS.replicate (64 - lk) 0x00
step2 = BS.map (B.xor 0x36) step1
step3 = step2 <> text
step4 = hash step3
step5 = BS.map (B.xor 0x5C) step1
step6 = step5 <> step4
in hash step6
where
!(KeyAndLen k lk)
| l > 64 = KeyAndLen (hash mk) 32
| otherwise = KeyAndLen mk l
It looks like there’s a lot of unnecessary copying going on here,
but in the grand scheme of things, it’s simply not that much, and
the bytestring functions are highly optimised, after all. I wasn’t
able to improve this function’s performance either by builders or
even low-level allocations and writes, including manually minimising
allocations, reusing memory areas, and so on. The story would be
different if I were calling these functions many times in a loop,
though, at which point builders or direct writes would definitely
prove useful.
-
Don’t necessarily shy away from unsafe functions if said use can be
proven to be perfectly safe. Aside from ‘unsafeIndex’, other things can
be useful, particularly ‘unsafeTake’ and ‘unsafeDrop’. These can
provide great performance boosts, though if one is not gunning for
absolute, maximum performance, then sure, it might be better to avoid
them.
It’s also worth nothing that there are some useful unsafe functions
that don’t exist in the bytestring API, but that can be assembled
manually. Take this home-cooked unsafe splitAt, for example, that does
no bounds checking:
data SSPair = SSPair
{-# UNPACK #-} !BS.ByteString
{-# UNPACK #-} !BS.ByteString
unsafe_splitAt :: Int -> BS.ByteString -> SSPair
unsafe_splitAt n (BI.BS x l) =
SSPair (BI.BS x n) (BI.BS (plusForeignPtr x n) (l - n))
-
Make data strict unless you for some reason need it to be lazy, and
always unpack strict fields when it’s possible to do so. You can also
use -funbox-strict-fields or -funbox-small-strict-fields to unbox
all strict fields at the module level, though I generally prefer the
more explicit local use of UNPACK pragmas.
-
Don’t be afraid to use low-level allocation and writes where it makes
sense to do so. It may be unnecessary, even in
highly-optimised library code (I chose to use builders in
ppad-bech32 because we’re never
dealing with large outputs there, for example), but it’s a great
technique to have around for when performance really counts.
If you’d like to play with the code yourself, I’ve pushed it to this
GitHub repo. Presuming you have
flake support enabled, you can use ‘nix develop’ to enter a Nix shell,
and then e.g. ‘cabal repl’ to open GHCi, or ‘cabal bench’ to run the
benchmark suite. Enjoy!