jtobin.io2024-10-20T00:26:29+04:00https://jtobin.ioJared TobinSignatures on secp256k12024-10-19T00:00:00+04:00https://jtobin.io/signatures-on-secp256k1<p>I’ve released a library supporting <a href="https://github.com/bitcoin/bips/blob/master/bip-0340.mediawiki">BIP340 Schnorr signatures</a>
and <a href="https://www.rfc-editor.org/rfc/rfc6979">deterministic ECDSA</a> on the elliptic curve secp256k1.
<a href="https://git.ppad.tech/secp256k1">Get it</a> while it’s hot – for when you just aren’t feeling
libsecp256k1!</p>
<p>This is another “minimal” library in the ppad suite of libraries I’m
working on. Minimal in the sense that it is pure Haskell (no FFI
– you can check out <a href="https://git.ppad.tech/csecp256k1">ppad-csecp256k1</a> if you want that) and
depends only on ‘base’, ‘bytestring’, and my own <a href="/first-ppad-libraries">HMAC-DRBG and SHA256
libraries</a>. The feature set also intentionally remains rather lean
for the time being (though if you could use other features in there, let
me know!).</p>
<p>Performance is decent, though unsurprisingly it still pales in
comparison to the low-level and battle-hardened libsecp256k1 (think 5ms
vs 50μs to create a Schnorr signature, for example). There’s ample
room for optimisation, though. Probably the lowest-hanging fruit is
that scalar multiplication on secp256k1 can seemingly be made much
more efficient via the so-called <a href="https://en.wikipedia.org/wiki/Elliptic_curve_point_multiplication">wNAF method</a> that relies on
precomputed points, such that we might be looking at more like 500μs
to create a Schnorr signature, with a similar improvement for ECDSA. It
would require slightly more annoying UX, probably warranting its own set
of user-facing functions that would also accept a context argument, but
does not appear difficult to implement.</p>
<p>A few things I observed or noted while writing this library:</p>
<ul>
<li>
<p>The modular arithmetic functions for arbitrary-precision Integers
contained in GHC.Num.Integer can be extremely fast, compared
to hand-rolled alternatives. Things like integerPowMod# and
integerRecipMod# absolutely fly, and will probably beat any
hand-rolled variant.</p>
</li>
<li>
<p>Arbitrary-precision Integers are still slow, compared to fixed-width
stuff that modern computers can positively chew through (this is <a href="https://github.com/jtobin/dates">not
news to me</a>, but still). I achieved a staggering speedup on
some basic integer parsing by using a custom Word256 type (built from
a bunch of Word64 values) under the hood, and converting to Integer
only at the end.</p>
<p>They can also be annoying when one wants to achieve constant-time
execution of cryptographically-sensitive functions, for obvious
reasons. It would be nice to have good fixed-width support for stuff
like Word128, Word256, Word512, and so on – I briefly considered
implementing and using a custom Word256 type for <em>everything</em>, but
this would be a ton of work, and I’m not sure I could beat GHC’s
native bigint support for e.g. modular multiplication, exponentiation,
and inversion anyway. We’ll stick with plain-old Integer for the time
being – it’s still no slouch.</p>
</li>
<li>
<p>Algorithmically constant-time operations can still fail to be
constant-time in practice due to factors largely outside the
programmer’s control. A good example is found in bit operations;
looping through a bytestring and performing some “equivalent-looking”
work on every byte may still result in execution time discrepancies
between zero and nonzero bytes, for example, and these can be very
hard to eliminate. This stuff can depend on the compiler or runtime,
the architecture/processor used, etc.</p>
</li>
<li>
<p>The necessary organization of this kind of “catch-all” library is kind
of unsatisfying. Rather than picking a single curve, and then
implementing every feature one can possibly think of for it, it
would intuitively be better to implement a generic curve library or
libraries (for Weierstrass, Edwards, <a href="https://www.youtube.com/watch?v=FJ3oHpup-pk">Montgomery</a>, etc.), and
then implement e.g. ECDSA or EdDSA or ECDH or or whatever as separate
libraries depending on those generic curves as appropriate. One could
then use everything in more of a plug-and-play fashion – this might
be a design I’ll explore further in the future.</p>
</li>
<li>
<p>ByteString seems to provide a good user-facing “interface” for libraries
like this. It’s reliable, familiar to Haskellers, has a great API, and
is very well-tuned. It’s possible one might want to standardize on
something else for internals, though; unboxed vectors are an obvious
choice, though I would actually be inclined to use the <em>primitive</em>
library’s PrimArrays directly, favouring simplicity, and eschewing
<em>vector</em>’s harder-core optimisations.</p>
<p>The idea here in any case would be that one would use ByteString
only at the user-facing layer, and then work with PrimArrays (or
whatever) everywhere internally. It’s perhaps worth exploring further
– bytestring is <em>very</em> fast (strict bytestrings are, after all,
merely pointers to cstrings), but so are PrimArrays, and mutation à
la MutablePrimArray could be very helpful to have here and there.</p>
<p>(FWIW, though, I’ve benchmarked PrimArray/MutablePrimArray in
<em>ppad-sha256</em> and found them to yield equivalent-or-slower performance
compared to ByteString in that setting.)</p>
</li>
</ul>
<p>The library has been tested on the <a href="https://github.com/C2SP/wycheproof">Project Wycheproof</a> and
official BIP340 test vectors, as well as <a href="https://github.com/paulmillr/noble-secp256k1">noble-secp256k1’s</a> test
suite, and care has been taken around timing of functions that operate
on secret data. Kick the tires on it, if you feel so inclined!</p>
<p>(I mentioned this in my last post as well, but I’m indebted to Paul
Miller’s <a href="https://paulmillr.com/noble/">noble cryptography</a> project for this work, both as
inspiration and also as a reference.)</p>
New HMAC-DRBG and SHA-2 Libraries2024-10-07T00:00:00+04:00https://jtobin.io/first-ppad-libraries<p>Just FYI, I’ve dropped a few simple libraries supporting <a href="https://datatracker.ietf.org/doc/html/rfc6234">SHA-{256,512}</a>, <a href="https://datatracker.ietf.org/doc/html/rfc2104">HMAC-SHA{256, 512}</a>, and <a href="https://nvlpubs.nist.gov/nistpubs/SpecialPublications/NIST.SP.800-90Ar1.pdf">HMAC-DRBG</a>. You can find the repos here:</p>
<ul>
<li><a href="https://git.ppad.tech/sha256">ppad-sha256</a></li>
<li><a href="https://git.ppad.tech/sha512">ppad-sha512</a></li>
<li><a href="https://git.ppad.tech/hmac-drbg">ppad-hmac-drbg</a></li>
</ul>
<p>Each is packaged there as a Nix <a href="https://zero-to-nix.com/concepts/flakes">flake</a>, and each is also
available on Hackage.</p>
<p>This is the first battery of a series of libraries I’m writing that were
primarily inspired by <a href="https://paulmillr.com/noble/">noble-cryptography</a> after the death (or
at least deprecation) of <a href="https://hackage.haskell.org/package/cryptonite">cryptonite</a>. The libraries are pure,
readable, concise GHC Haskell, having minimal dependencies, and aim
for clarity, security, performance, and user-friendliness.</p>
<p>I finally got around to going through most of the famous <a href="https://github.com/jtobin/cryptopals">cryptopals
challenges</a> last year and have since felt like writing some
“foundational” cryptography (and cryptography-adjacent) libraries that
I myself would want to use. I’d like to understand them well, test and
benchmark them myself, eke out performance and UX wins where I can get
them, etc. etc.</p>
<p>An example is found in the case of <em>ppad-hmac-drbg</em> – I want to use
this DRBG in the manner I’m accustomed to using generators from e.g.
<a href="https://hackage.haskell.org/package/mwc-random">mwc-random</a>, in which I can utilise any PrimMonad to handle the
generator state:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ghci</span><span class="o">></span> <span class="o">:</span><span class="n">set</span> <span class="o">-</span><span class="kt">XOverloadedStrings</span>
<span class="n">ghci</span><span class="o">></span> <span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Crypto.DRBG.HMAC</span> <span class="k">as</span> <span class="n">DRBG</span>
<span class="n">ghci</span><span class="o">></span> <span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Crypto.Hash.SHA256</span> <span class="k">as</span> <span class="n">SHA256</span>
<span class="n">ghci</span><span class="o">></span>
<span class="n">ghci</span><span class="o">></span> <span class="kr">let</span> <span class="n">entropy</span> <span class="o">=</span> <span class="s">"very random"</span>
<span class="n">ghci</span><span class="o">></span> <span class="kr">let</span> <span class="n">nonce</span> <span class="o">=</span> <span class="s">"very unused"</span>
<span class="n">ghci</span><span class="o">></span> <span class="kr">let</span> <span class="n">personalization_string</span> <span class="o">=</span> <span class="s">"very personal"</span>
<span class="n">ghci</span><span class="o">></span>
<span class="n">ghci</span><span class="o">></span> <span class="n">drbg</span> <span class="o"><-</span> <span class="kt">DRBG</span><span class="o">.</span><span class="n">new</span> <span class="kt">SHA256</span><span class="o">.</span><span class="n">hmac</span> <span class="n">entropy</span> <span class="n">nonce</span> <span class="n">personalization_string</span>
<span class="n">ghci</span><span class="o">></span> <span class="n">bytes</span> <span class="o"><-</span> <span class="kt">DRBG</span><span class="o">.</span><span class="n">gen</span> <span class="n">mempty</span> <span class="mi">32</span> <span class="n">drbg</span>
<span class="n">ghci</span><span class="o">></span> <span class="n">more_bytes</span> <span class="o"><-</span> <span class="kt">DRBG</span><span class="o">.</span><span class="n">gen</span> <span class="n">mempty</span> <span class="mi">16</span> <span class="n">drbg</span>
</code></pre></div></div>
<p>I haven’t actually tried the <a href="https://hackage.haskell.org/package/DRBG">other DRBG library</a> I found on
Hackage, but it has different UX, a lot of dependencies, and has since
apparently been deprecated. The generator in <em>ppad-hmac-drbg</em> matches my
preferred UX, passes the official <a href="https://github.com/coruus/nist-testvectors/blob/master/csrc.nist.gov/groups/STM/cavp/documents/drbg/drbgtestvectors/drbgvectors_pr_false/HMAC_DRBG.txt">DRBGVS vectors</a>, depends only
on ‘base’, ‘bytestring’, and ‘primitive’, and can be used with arbitrary
appropriate HMAC functions (and is maintained!).</p>
<p>The <em>ppad-sha256</em> and <em>ppad-sha512</em> libraries depend only on ‘base’ and
‘bytestring’ and are faster than any other pure Haskell SHA-2
implementations I’m aware of, even if the performance differences are
moral, rather than practical, victories:</p>
<p><em>ppad-sha256</em>’s SHA-256, vs <a href="https://hackage.haskell.org/package/SHA">SHA’s</a>, on a 32B input:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code> benchmarking ppad-sha256/SHA256 (32B input)/hash
time 1.898 μs (1.858 μs .. 1.941 μs)
0.997 R² (0.996 R² .. 0.999 R²)
mean 1.874 μs (1.856 μs .. 1.902 μs)
std dev 75.90 ns (60.30 ns .. 101.8 ns)
variance introduced by outliers: 55% (severely inflated)
benchmarking SHA/SHA256 (32B input)/sha256
time 2.929 μs (2.871 μs .. 2.995 μs)
0.997 R² (0.995 R² .. 0.998 R²)
mean 2.879 μs (2.833 μs .. 2.938 μs)
std dev 170.4 ns (130.4 ns .. 258.9 ns)
variance introduced by outliers: 71% (severely inflated)
</code></pre></div></div>
<p>And the same, but now a HMAC-SHA256 battle:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> benchmarking ppad-sha256/HMAC-SHA256 (32B input)/hmac
time 7.287 μs (7.128 μs .. 7.424 μs)
0.996 R² (0.995 R² .. 0.998 R²)
mean 7.272 μs (7.115 μs .. 7.455 μs)
std dev 565.2 ns (490.9 ns .. 689.7 ns)
variance introduced by outliers: 80% (severely inflated)
benchmarking SHA/HMAC-SHA256 (32B input)/hmacSha256
time 11.42 μs (11.09 μs .. 11.80 μs)
0.994 R² (0.992 R² .. 0.997 R²)
mean 11.36 μs (11.09 μs .. 11.61 μs)
std dev 903.5 ns (766.5 ns .. 1.057 μs)
variance introduced by outliers: 79% (severely inflated)
</code></pre></div></div>
<p>The performance differential is larger on larger inputs; I think the
difference between the two on a contrived 1GB input was 22 vs 32s, all
on my mid-2020 MacBook Air. I haven’t bothered to implement e.g. SHA-224
and SHA-384, which are trivial adjustments of SHA-256 and SHA-512, but
if anyone could use them for some reason, please just let me know.</p>
<p>Anyway: enjoy, and let me know if you get any use out of these. Expect
more releases in this spirit!</p>
Reservoir Sampling2024-09-22T00:00:00+04:00https://jtobin.io/reservoir-sampling<p>I have a little library called <a href="https://hackage.haskell.org/package/sampling">sampling</a> floating around for
general-purpose sampling from arbitrary foldable collections. It’s a
bit of a funny project: I originally hacked it together quickly, just
to get something done, so it’s not a very good library <em>qua</em> library –
it has plenty of unnecessary dependencies, and it’s not at all tuned
for performance. <em>But</em> it was always straightforward to use, and good
enough for my needs, so I’ve never felt any particular urge to update
it. Lately it caught my attention again, though, and I started thinking
about possible ways to revamp it, as well as giving it some much-needed
quality-of-life improvements, in order to make it more generally useful.</p>
<p>The library supports sampling with and without replacement in both the
equal and unequal-probability cases, from collections such as lists,
maps, vectors, etc. – again, anything with a Foldable instance.
In particular: the equal-probability, sampling-without-replacement
case depends on some code that Gabriella Gonzalez <a href="https://hackage.haskell.org/package/foldl-1.4.17/docs/src/Control.Foldl.html#randomN">wrote</a> for
<em>reservoir sampling</em>, a common online method for sampling from a
potentially-unbounded stream. I started looking into alternative
algorithms for reservoir sampling, to see what else was out there, and
discovered one that was <em>extremely</em> fast, <em>extremely</em> clever, and that
I’d never heard of before. It doesn’t seem to be that well-known, so I
simply want to illustrate it here, just so others are aware of it.</p>
<p>I learned about the algorithm from <a href="https://erikerlandson.github.io/blog/2015/11/20/very-fast-reservoir-sampling/">Erik Erlandson’s blog</a>, but,
as he points out, it goes back <a href="http://www.ittc.ku.edu/~jsv/Papers/Vit87.RandomSampling.pdf">almost forty years</a> to one J.S
Vitter, who apparently also popularised the “basic” reservoir sampling
algorithm that everyone uses today (though it was not invented by him).</p>
<p>The basic reservoir sampling algorithm is simple. We’re sampling
some number of elements from a stream of unknown size: if we haven’t
collected enough elements yet, we just dump everything we see into our
“reservoir” (i.e., the collected sample); otherwise, we just generate a
random number and do a basic comparison to determine whether or not to
eject a value in our reservoir in favour of the new element we’ve just
seen. This is evidently also known as <em>Algorithm R</em>, and <a href="https://richardstartin.github.io/posts/reservoir-sampling">can apparently
be found</a> in Knuth’s <em>Art of Computer Programming</em>. Here’s a basic
implementation that treats the reservoir as a mutable vector:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">algo_r</span> <span class="n">len</span> <span class="n">stream</span> <span class="n">prng</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">reservoir</span> <span class="o"><-</span> <span class="kt">VM</span><span class="o">.</span><span class="n">new</span> <span class="n">len</span>
<span class="n">loop</span> <span class="n">reservoir</span> <span class="mi">0</span> <span class="n">stream</span>
<span class="kr">where</span>
<span class="n">loop</span> <span class="o">!</span><span class="n">sample</span> <span class="n">index</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">[]</span> <span class="o">-></span>
<span class="kt">V</span><span class="o">.</span><span class="n">freeze</span> <span class="n">sample</span>
<span class="p">(</span><span class="n">h</span><span class="o">:</span><span class="n">t</span><span class="p">)</span>
<span class="o">|</span> <span class="n">index</span> <span class="o"><</span> <span class="n">len</span> <span class="o">-></span> <span class="kr">do</span>
<span class="kt">VM</span><span class="o">.</span><span class="n">write</span> <span class="n">sample</span> <span class="n">index</span> <span class="n">h</span>
<span class="n">loop</span> <span class="n">sample</span> <span class="p">(</span><span class="n">succ</span> <span class="n">index</span><span class="p">)</span> <span class="n">t</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">j</span> <span class="o"><-</span> <span class="kt">S</span><span class="o">.</span><span class="n">uniformRM</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">index</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">prng</span>
<span class="n">when</span> <span class="p">(</span><span class="n">j</span> <span class="o"><</span> <span class="n">len</span><span class="p">)</span>
<span class="p">(</span><span class="kt">VM</span><span class="o">.</span><span class="n">write</span> <span class="n">sample</span> <span class="n">j</span> <span class="n">h</span><span class="p">)</span>
<span class="n">loop</span> <span class="n">sample</span> <span class="p">(</span><span class="n">succ</span> <span class="n">index</span><span class="p">)</span> <span class="n">t</span>
</code></pre></div></div>
<p>Here’s a quick, informal benchmark of it. Sampling 100 elements from a
stream of 10M 64-bit integers, using <a href="https://hackage.haskell.org/package/mwc-random">Marsaglia’s MWC256 PRNG</a>,
yields the following runtime statistics on my mid-2020 MacBook Air:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 73,116,027,160 bytes allocated in the heap
14,279,384 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 17639 colls, 0 par 0.084s 0.123s 0.0000s 0.0004s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.006s elapsed)
MUT time 18.794s ( 18.651s elapsed)
GC time 0.084s ( 0.123s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 18.885s ( 18.781s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 3,890,333,621 bytes per MUT second
Productivity 99.5% of total user, 99.3% of total elapsed
</code></pre></div></div>
<p>It’s fairly slow, and pretty much all we’re doing is generating 10M
<a href="/randomness-in-haskell">random</a> numbers.</p>
<p>In my experience, the most effective optimisations that can be made to
a numerical algorithm like this tend to be “mechanical” in nature –
avoiding <a href="https://github.com/jtobin/dates">allocation</a>, cache misses, branch prediction failures,
etc. I find it exceptionally pleasing when there’s some domain-specific
intuition that admits substantial optimisation of an algorithm.</p>
<p>Vitter’s optimisation is along these lines. I find it as ingenious as
e.g. the <a href="https://www.youtube.com/watch?v=3liCbRZPrZA">kernel trick</a> in the support vector machine context. The
crucial observation Vitter made is that one doesn’t need to consider
whether or not to add <em>every single element</em> he encounters to the
reservoir; the “gap” <em>between</em> entries also follows a well-defined
probability distribution, and one can just instead sample <em>that</em> in
order to determine the next element to add. Erik Erlandson points out
that when the size of the reservoir is small relative to the size of the
stream, as is typically the case, this distribution is well-approximated
by the geometric distribution – that which describes the number of coin
flips required before one observes a head (it has <a href="/recursive-stochastic-processes">come up</a> in a
couple of my previous blog posts).</p>
<p>So: the algorithm is adjusted so that, while processing the stream,
we sample how many elements to skip, and do so, then adding the next
element encountered after that to the reservoir. Here’s a version of
that, using Erlandson’s fast geometric approximation for sampling what
Vitter calls the skip distance:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Loop</span> <span class="o">=</span>
<span class="kt">Next</span>
<span class="o">|</span> <span class="kt">Skip</span> <span class="o">!</span><span class="kt">Int</span>
<span class="n">algo_r_optim</span> <span class="n">len</span> <span class="n">stream</span> <span class="n">prng</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">reservoir</span> <span class="o"><-</span> <span class="kt">VM</span><span class="o">.</span><span class="n">new</span> <span class="n">len</span>
<span class="n">go</span> <span class="kt">Next</span> <span class="n">reservoir</span> <span class="mi">0</span> <span class="n">stream</span>
<span class="kr">where</span>
<span class="n">go</span> <span class="n">what</span> <span class="o">!</span><span class="n">sample</span> <span class="o">!</span><span class="n">index</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">[]</span> <span class="o">-></span>
<span class="kt">V</span><span class="o">.</span><span class="n">freeze</span> <span class="n">sample</span>
<span class="n">s</span><span class="o">@</span><span class="p">(</span><span class="n">h</span><span class="o">:</span><span class="n">t</span><span class="p">)</span> <span class="o">-></span> <span class="kr">case</span> <span class="n">what</span> <span class="kr">of</span>
<span class="kt">Next</span>
<span class="c1">-- below the reservoir size, just write elements</span>
<span class="o">|</span> <span class="n">index</span> <span class="o"><</span> <span class="n">len</span> <span class="o">-></span> <span class="kr">do</span>
<span class="kt">VM</span><span class="o">.</span><span class="n">write</span> <span class="n">sample</span> <span class="n">index</span> <span class="n">h</span>
<span class="n">go</span> <span class="kt">Next</span> <span class="n">sample</span> <span class="p">(</span><span class="n">succ</span> <span class="n">index</span><span class="p">)</span> <span class="n">t</span>
<span class="c1">-- sample skip distance</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">u</span> <span class="o"><-</span> <span class="kt">S</span><span class="o">.</span><span class="n">uniformDouble01M</span> <span class="n">prng</span>
<span class="kr">let</span> <span class="n">p</span> <span class="o">=</span> <span class="n">fi</span> <span class="n">len</span> <span class="o">/</span> <span class="n">fi</span> <span class="n">index</span>
<span class="n">skip</span> <span class="o">=</span> <span class="n">floor</span> <span class="p">(</span><span class="n">log</span> <span class="n">u</span> <span class="o">/</span> <span class="n">log</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">))</span>
<span class="n">go</span> <span class="p">(</span><span class="kt">Skip</span> <span class="n">skip</span><span class="p">)</span> <span class="n">sample</span> <span class="n">index</span> <span class="n">s</span>
<span class="kt">Skip</span> <span class="n">skip</span>
<span class="c1">-- stop skipping, use this element</span>
<span class="o">|</span> <span class="n">skip</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">j</span> <span class="o"><-</span> <span class="kt">S</span><span class="o">.</span><span class="n">uniformRM</span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="n">len</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="n">prng</span>
<span class="kt">VM</span><span class="o">.</span><span class="n">write</span> <span class="n">sample</span> <span class="n">j</span> <span class="n">h</span>
<span class="n">go</span> <span class="kt">Next</span> <span class="n">sample</span> <span class="p">(</span><span class="n">succ</span> <span class="n">index</span><span class="p">)</span> <span class="n">t</span>
<span class="c1">-- skip (d - 1) more elements</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">-></span>
<span class="n">go</span> <span class="p">(</span><span class="kt">Skip</span> <span class="p">(</span><span class="n">pred</span> <span class="n">d</span><span class="p">))</span> <span class="n">sample</span> <span class="p">(</span><span class="n">succ</span> <span class="n">index</span><span class="p">)</span> <span class="n">t</span>
</code></pre></div></div>
<p>And now the same simple benchmark:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code> 1,852,883,584 bytes allocated in the heap
210,440 bytes copied during GC
45,960 bytes maximum residency (2 sample(s))
31,864 bytes maximum slop
6 MiB total memory in use (0 MiB lost due to fragmentation)
Tot time (elapsed) Avg pause Max pause
Gen 0 445 colls, 0 par 0.002s 0.003s 0.0000s 0.0002s
Gen 1 2 colls, 0 par 0.000s 0.000s 0.0002s 0.0003s
INIT time 0.007s ( 0.007s elapsed)
MUT time 0.286s ( 0.283s elapsed)
GC time 0.002s ( 0.003s elapsed)
RP time 0.000s ( 0.000s elapsed)
PROF time 0.000s ( 0.000s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 0.296s ( 0.293s elapsed)
%GC time 0.0% (0.0% elapsed)
Alloc rate 6,477,345,638 bytes per MUT second
Productivity 96.8% of total user, 96.4% of total elapsed
</code></pre></div></div>
<p>Both Vitter and Erlandson estimated orders of magnitude improvement
in sampling time, given we need to spend much less time iterating our
PRNG; here we see a 65x performance gain, with 40x less allocation.
Very impressive, and again, the optimisation is entirely probabilistic,
rather than “mechanical,” in nature (indeed, I haven’t tested any
mechanical optimisations to these implementations at all).</p>
<p>It turns out there are extensions in this spirit to unequal-probability
reservoir sampling as well, as is the method of “exponential jumps”
described in <a href="https://doi.org/10.1016/j.ipl.2005.11.003">a 2006 paper by Efraimidis and Spirakis</a>. I’ll
probably benchmark that algorithm too, and, if it fits the bill, and I
ever really <em>do</em> get around to properly updating my ‘sampling’ library,
I’ll refactor things to make use of these high-performance algorithms.
Let me know if you could use this sort of thing!</p>
More Recursive Stochastic Processes2024-09-01T00:00:00+04:00https://jtobin.io/more-recursive-stochastic-processes<p>Some years ago I <a href="/recursive-stochastic-processes">wrote</a> about using <a href="/practical-recursion-schemes">recursion schemes</a>
to encode stochastic processes in an <a href="/simple-probabilistic-programming">embedded probabilistic
programming</a> setting. The crux of it was that recursion schemes
allow one to “factor out” the probabilistic phenomena from the recursive
structure of the process; the probabilistic stuff typically sits in
the so-called <em>coalgebra</em> of the recursion scheme, while the recursion
scheme itself dictates the manner in which the process evolves.</p>
<p>While it’s arguable as to whether this stuff is useful from a strictly
practical perspective (I would suggest “not”), I think the intuition one
gleans from it can be somewhat worthwhile, and my curious brain finds
itself wandering back to the topic from time to time.</p>
<p>I happened to take a look at this sort of framework again recently
and discovered that I couldn’t easily seem to implement a <a href="https://en.wikipedia.org/wiki/Chinese_restaurant_process">Chinese
Restaurant Process</a> (CRP) – a stochastic process famous from the
setting of nonparametric Bayesian models – via either of the “standard”
patterns I used throughout my <a href="/recursive-stochastic-processes">Recursive Stochastic Processes</a>
post. This indicated that the recursive structure of the CRP differs
from the others I studied previously, at least in the manner I was
attempting to encode it in my particular embedded language setting.</p>
<p>So let’s take a look to see what the initial problem was, how one can
resolve it, and what insights we can take away from it all.</p>
<h2 id="framework">Framework</h2>
<p>Here’s a simple and slightly more refined version of the embedded
probabilistic programming framework I introduced in some of my older
posts. I’ll elide incidental details, such as common imports and noisy
code that might distract from the main points.</p>
<p>First, some unfamiliar imports and minimal core types:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Control.Monad.Trans.Free</span> <span class="k">as</span> <span class="n">TF</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">System.Random.MWC.Probability</span> <span class="k">as</span> <span class="n">MWC</span>
<span class="kr">data</span> <span class="kt">ModelF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">BernoulliF</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Bool</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">UniformF</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
<span class="kr">type</span> <span class="kt">Model</span> <span class="o">=</span> <span class="kt">Free</span> <span class="kt">ModelF</span>
</code></pre></div></div>
<p>As a quick refresher, an expression of type ‘Model’ denotes a
probability distribution over some carrier type, and terms that
construct, manipulate, or interpret values of type ‘Model’ constitute
those of a simple embedded probabilistic programming language.</p>
<p>Importantly, here we’re using a very minimal such language, consisting
only of two primitives: the Bernoulli distribution (which is a
probability distribution over a coin flip) and the uniform distribution
over the interval [0, 1]. A trivial model that draws a probability of
success from a uniform distribution, and then flips a coin conditional
on that probability, would look like this, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">model</span> <span class="o">::</span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">model</span> <span class="o">=</span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">UniformF</span> <span class="p">(</span><span class="nf">\</span><span class="n">u</span> <span class="o">-></span>
<span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">pure</span><span class="p">))</span>
</code></pre></div></div>
<p>(We could create helper functions such that programs in this embedded
language would be nicer to write, but that’s not the focus of this
post.)</p>
<p>We can construct expressions that denote stochastic processes by using
the normal salad of recursion schemes. Here’s how we can encode a
geometric distribution, for example, in which one counts the number of
coin flips required to observe a head:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">geometric</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Int</span>
<span class="n">geometric</span> <span class="n">p</span> <span class="o">=</span> <span class="n">apo</span> <span class="n">coalg</span> <span class="mi">1</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="kt">Left</span> <span class="p">(</span><span class="n">pure</span> <span class="n">n</span><span class="p">)</span>
<span class="kr">else</span> <span class="kt">Right</span> <span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)))</span>
</code></pre></div></div>
<p>The coalgebra isolates the probabilistic phenomena (a Bernoulli draw,
i.e. a coin flip), and the recursion scheme determines how the process
evolves (halting if a Bernoulli proposal is accepted). The coalgebra is
defined in terms of the so-called <em>base</em> or <em>pattern functor</em> of the
free monad type, defined in ‘Control.Monad.Trans.Free’ (see <a href="/practical-recursion-schemes">Practical
Recursion Schemes</a> for a refresher on base functors if you’re
rusty).</p>
<p>The result is an expression in our embedded language, of course, and is
completely abstract. If we want to sample from the model it encodes, we
can use an interpreter like the following:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prob</span> <span class="o">::</span> <span class="kt">Model</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">MWC</span><span class="o">.</span><span class="kt">Prob</span> <span class="kt">IO</span> <span class="n">a</span>
<span class="n">prob</span> <span class="o">=</span> <span class="n">iterM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">MWC</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">UniformF</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">MWC</span><span class="o">.</span><span class="n">uniform</span> <span class="o">>>=</span> <span class="n">f</span>
</code></pre></div></div>
<p>where the ‘MWC’-prefixed functions are sampling functions from the
<em>mwc-probability</em> library, and ‘iterM’ is the familiar monadic
catamorphism-like recursion scheme over the free monad. This will
produce a function that can be sampled with ‘MWC.sample’ when provided
with a PRNG.</p>
<p>Here’s what some samples look like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>ghci> gen <- MWC.create
ghci> replicateM 10 (MWC.sample (prob (geometric 0.1)) gen)
[1,9,13,3,4,3,13,1,4,17]
</code></pre></div></div>
<h2 id="chinese-restaurant-process">Chinese Restaurant Process</h2>
<p>The CRP is described technically as a stochastic process over “finite
integer partitions,” and, more memorably, over configurations of a
particular sort of indefinitely-large Chinese restaurant. One is to
imagine customers entering the restaurant sequentially; the first
sits at the first table available, and each additional customer is
either seated at a new table with probability proportional to some
dispersion parameter, or is seated at an occupied table with probability
proportional to the number of other customers already sitting there.</p>
<p>If each customer is labelled by how many others were in the restaurant
when they entered, the result is, for ‘n’ total customers, a random
partition of the natural numbers up to ‘n’. A particular realisation of
the process, following the arrival of 10 customers, might look like the
following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[[9,8,7,6,4,2,0], [1], [5,3]]
</code></pre></div></div>
<p>This is a restaurant configuration after 10 arrivals, where each element
is a table populated by the labelled customers.</p>
<p>One of the rules of explaining the CRP is that you always have to
include a visualization like this:</p>
<p><img src="images/crp_seq.png" alt="" class="center-image" /></p>
<p>It corresponds to the realised sample, and we certainly wouldn’t want to
break any pedagogical regulations.</p>
<h2 id="encoding-first-attempt">Encoding, First Attempt</h2>
<p>A natural way to encode the process is to seat each arriving customer at
a new table with the appropriate probability, and then, if it turns out
he is to be sat at an occupied table, to do that with the appropriate
<em>conditional</em> probability (i.e., conditional on the fact that he’s not
going to be seated at a new table).</p>
<p>So let’s imagine encoding a CRP with dispersion parameter ‘a’ and total
number of customers ‘n’. Our first attempt might go something like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">crp</span> <span class="n">n</span> <span class="n">a</span> <span class="o">=</span> <span class="n">ana</span> <span class="n">coalg</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o"><</span><span class="n">initial</span> <span class="n">restaurant</span><span class="o">></span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">customer</span><span class="p">,</span> <span class="n">tables</span><span class="p">)</span>
<span class="o">|</span> <span class="n">customer</span> <span class="o">>=</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Pure</span> <span class="n">tables</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">p</span> <span class="o">=</span> <span class="o"><</span><span class="n">probability</span> <span class="kr">of</span> <span class="n">seating</span> <span class="n">customer</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span>
<span class="kr">in</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span><span class="p">)</span>
<span class="kr">else</span> <span class="o">???</span>
<span class="p">))</span>
</code></pre></div></div>
<p>We run into a problem when we hit the ‘else’ branch of the conditional.
Here we want to express another random choice – viz., at which occupied
table do we seat the arriving customer. We’d want to do something like
‘TF.Free (UniformF (\u -> …))’ in that ‘else’ branch and then use the
produced uniform value to choose between the occupied tables based on
their conditional probabilities. But that will prove to be impossible,
given the type requirements.</p>
<p>There’s similarly no way to restructure things by producing the
desired uniform value earlier, before the conditional expression. For
appropriate type ‘t’, the coalgebra used for the anamorphism must have
type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coalg</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Base</span> <span class="n">t</span> <span class="n">a</span>
</code></pre></div></div>
<p>which in our case reduces to:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coalg</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">TF</span><span class="o">.</span><span class="kt">FreeF</span> <span class="kt">ModelF</span> <span class="n">a</span> <span class="n">a</span>
</code></pre></div></div>
<p>You’ll find that there’s simply no way to add another TF.Free
constructor to the mix while satisfying the above type. So it seems that
with an anamorphism (or apomorphism, which encounters the same problem)
we’re limited to denoting a single probabilistic operation on any
recursive call.</p>
<h2 id="encoding-correctly">Encoding, Correctly</h2>
<p>We thus need a recursion scheme that allows us to add multiple levels of
the base functor at a time. The most appropriate scheme appears to me
to be the <em>futumorphism</em>, which I also wrote about in <a href="/time-traveling-recursion">Time Traveling
Recursion Schemes</a>.</p>
<p>(As Patrick Thomson <a href="https://blog.sumtypeofway.com/posts/recursion-schemes-part-4.html">pointed out</a> in his sublime series on
recursion schemes, both anamorphisms and apomorphisms are special cases
of the futumorphism.)</p>
<p>The coalgebra used by a futumorphism has a different type than that used
by an ana- or apomorphism, namely:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coalg</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Base</span> <span class="n">t</span> <span class="p">(</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span><span class="p">)</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>In our case, this is:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coalg</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">TF</span><span class="o">.</span><span class="kt">FreeF</span> <span class="kt">ModelF</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">TF</span><span class="o">.</span><span class="kt">FreeF</span> <span class="kt">ModelF</span> <span class="n">a</span><span class="p">)</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>Note that here we’ll be able to use additional ‘TF.Free’ constructors
inside other expressions involving the base functor. In <em>Time Traveling
Recursion Schemes</em> I referred to this as “working with values that
don’t exist yet, while pretending like they do” – but better would
be to say that one is simply <em>defining</em> values to be produced during
(co)recursion, using separate monadic code.</p>
<p>Here’s how we’d encode the CRP using a futumorphism, in pseudocode:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">crp</span> <span class="n">n</span> <span class="n">a</span> <span class="o">=</span> <span class="n">futu</span> <span class="n">coalg</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o"><</span><span class="n">initial</span> <span class="n">restaurant</span><span class="o">></span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">customer</span><span class="p">,</span> <span class="n">tables</span><span class="p">)</span>
<span class="o">|</span> <span class="n">customer</span> <span class="o">>=</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Pure</span> <span class="n">tables</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">p</span> <span class="o">=</span> <span class="o"><</span><span class="n">probability</span> <span class="kr">of</span> <span class="n">seating</span> <span class="n">customer</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span>
<span class="kr">in</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">pure</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span><span class="p">)</span>
<span class="kr">else</span> <span class="kr">do</span>
<span class="n">res</span> <span class="o"><-</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">UniformF</span> <span class="p">(</span><span class="nf">\</span><span class="n">u</span> <span class="o">-></span>
<span class="o"><</span><span class="n">seat</span> <span class="n">amongst</span> <span class="n">occupied</span> <span class="n">tables</span> <span class="n">using</span> <span class="sc">'u'</span><span class="o">></span>
<span class="p">)))</span>
<span class="n">pure</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="n">res</span><span class="p">)))</span>
</code></pre></div></div>
<p>Note that recursive points are expressed under a ‘TF.Free’ constructor
by returning a value (using ‘pure’) in the free monad itself. This
effectively allows us to use more than one embedded language construct
on each recursive call – as many as we want, as a matter of fact. The
familiar ‘liftF’ function lifts each such expression into the free
monad for us.</p>
<p>I mentioned in <em>Time Traveling Recursion Schemes</em> that this sort of
thing can look a little bit nicer if you have the appropriate embedded
language terms floating around. If we define the following:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">uniform</span> <span class="o">::</span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">TF</span><span class="o">.</span><span class="kt">FreeF</span> <span class="kt">ModelF</span> <span class="n">a</span><span class="p">)</span> <span class="kt">Double</span>
<span class="n">uniform</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">UniformF</span> <span class="n">id</span><span class="p">))</span>
</code></pre></div></div>
<p>then we can use it to tidy things up a bit:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">crp</span> <span class="n">n</span> <span class="n">a</span> <span class="o">=</span> <span class="n">futu</span> <span class="n">coalg</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o"><</span><span class="n">initial</span> <span class="n">restaurant</span><span class="o">></span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">customer</span><span class="p">,</span> <span class="n">tables</span><span class="p">)</span>
<span class="o">|</span> <span class="n">customer</span> <span class="o">>=</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Pure</span> <span class="n">tables</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">p</span> <span class="o">=</span> <span class="o"><</span><span class="n">probability</span> <span class="kr">of</span> <span class="n">seating</span> <span class="n">customer</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span>
<span class="kr">in</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">pure</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span><span class="p">)</span>
<span class="kr">else</span> <span class="kr">do</span>
<span class="n">u</span> <span class="o"><-</span> <span class="n">uniform</span>
<span class="kr">let</span> <span class="n">res</span> <span class="o">=</span> <span class="o"><</span><span class="n">seat</span> <span class="n">amongst</span> <span class="n">occupied</span> <span class="n">tables</span> <span class="n">using</span> <span class="sc">'u'</span><span class="o">></span>
<span class="n">pure</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="n">res</span><span class="p">)))</span>
</code></pre></div></div>
<p>In any case, the futumorphism gets us to a faithfully-encoded CRP.
It allows us to express multiple probabilistic operations in every
recursive call, which is what’s required here due to our use of
conditional probabilities.</p>
<h2 id="alternative-encodings">Alternative Encodings</h2>
<p>Now. Recall that I mentioned the following:</p>
<blockquote>
<p>A natural way to encode the process is to seat each arriving customer
at a new table with the appropriate probability, and then, if it
turns out he is to be sat at an occupied table, to do that with the
appropriate <em>conditional</em> probability (i.e., conditional on the fact
that he’s not going to be seated at a new table).</p>
</blockquote>
<p>This is probably the most natural way to encode the process, and indeed,
if we only have Bernoulli and uniform language terms (or beta, Gaussian,
etc. in place of uniform – something that can at least be transformed
to produce a uniform, in any case), this seems to be the best we can do.
But it is not the <em>only</em> way to encode the CRP. If we have a language
term corresponding to a categorical distribution, then we can instead
choose between a new table and any of the occupied tables simultaneously
using the <em>unconditional</em> probabilities for each.</p>
<p>Let’s adjust our ModelF functor, adding a language term corresponding
to a categorical distribution. It will have as its parameter a list of
probabilities, one for each possible outcome under consideration:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">ModelF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">BernoulliF</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Bool</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">UniformF</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">CategoricalF</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="p">(</span><span class="kt">Int</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
</code></pre></div></div>
<p>With this we’ll be able to encode the CRP using a mere anamorphism, as
the following pseudocode describes:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">crp</span> <span class="n">n</span> <span class="n">a</span> <span class="o">=</span> <span class="n">ana</span> <span class="n">coalg</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o"><</span><span class="n">initial</span> <span class="n">restaurant</span><span class="o">></span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">customer</span><span class="p">,</span> <span class="n">tables</span><span class="p">)</span>
<span class="o">|</span> <span class="n">customer</span> <span class="o">>=</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Pure</span> <span class="n">tables</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">ps</span> <span class="o">=</span> <span class="o"><</span><span class="n">unconditional</span> <span class="n">categorical</span> <span class="n">probabilities</span><span class="o">></span>
<span class="kr">in</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">CategoricalF</span> <span class="n">ps</span> <span class="p">(</span><span class="nf">\</span><span class="n">i</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">i</span> <span class="o">==</span> <span class="mi">0</span>
<span class="kr">then</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span><span class="p">)</span>
<span class="kr">else</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">occupied</span> <span class="n">table</span> <span class="n">'i</span> <span class="o">-</span> <span class="mi">1</span><span class="n">'</span><span class="o">></span><span class="p">)</span>
<span class="p">))</span>
</code></pre></div></div>
<p>Since here we only ever express a single probabilistic term during
recursion, the “standard” pattern we’ve used previously applies here.
But it’s worth noting that we could still employ the “unconditional,
then conditional” approach using a categorical distribution – we’d just
use the conditional probabilities to sample from the occupied tables
directly, rather than doing it in more low-level fashion using the
single uniform draw. Given an appropriate ‘categorical’ term, that would
look more like:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">crp</span> <span class="n">n</span> <span class="n">a</span> <span class="o">=</span> <span class="n">futu</span> <span class="n">coalg</span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="o"><</span><span class="n">initial</span> <span class="n">restaurant</span><span class="o">></span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">customer</span><span class="p">,</span> <span class="n">tables</span><span class="p">)</span>
<span class="o">|</span> <span class="n">customer</span> <span class="o">>=</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Pure</span> <span class="n">tables</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">p</span> <span class="o">=</span> <span class="o"><</span><span class="n">probability</span> <span class="kr">of</span> <span class="n">seating</span> <span class="n">customer</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span>
<span class="kr">in</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">pure</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">new</span> <span class="n">table</span><span class="o">></span><span class="p">)</span>
<span class="kr">else</span> <span class="kr">do</span>
<span class="kr">let</span> <span class="n">ps</span> <span class="o">=</span> <span class="o"><</span><span class="n">conditional</span> <span class="n">categorical</span> <span class="n">probabilities</span><span class="o">></span>
<span class="n">i</span> <span class="o"><-</span> <span class="n">categorical</span> <span class="n">ps</span>
<span class="n">pure</span> <span class="p">(</span><span class="n">succ</span> <span class="n">customer</span><span class="p">,</span> <span class="o"><</span><span class="n">seat</span> <span class="n">at</span> <span class="n">occupied</span> <span class="n">table</span> <span class="sc">'i'</span><span class="o">></span><span class="p">)))</span>
</code></pre></div></div>
<p>The recursive structure in this case is more complicated, requiring a
futumorphism instead of an anamorphism, but typically the conditional
probabilities are easier to compute.</p>
<h2 id="fin">Fin</h2>
<p>So, some takeaways here, if one wants to indulge in this sort of
framework:</p>
<p>If we want to express a stochastic process using multiple probabilistic
operations in any given recursive call, we may need to employ a
scheme that supports richer (co)recursion than a plain anamorphism
or apomorphism. Here that’s captured nicely by a futumorphism, which
naturally captures what we’re looking for.</p>
<p>The same might be true if our functor is sufficiently limited, as
was the case here if we supported only the Bernoulli and uniform
distributions. There we had no option but to express the recursion using
condiitonal probabilistic operations, and so needed the richer recursive
structure that a futumorphism provides.</p>
<p>On the other hand, it may not be the case that one <em>needs</em> the richer
structure provided by a futumorphism, if instead one can express the
coalgebra using only a single layer of the base functor. Adding a
primitive categorical distribution to our embedded language eliminated
the need to use conditional probabilities when describing the recursion,
allowing us to drop back to a “basic” anamorphism.</p>
<p><a href="https://gist.github.com/jtobin/8da5c8b46297e4868c25082d74bd1ebf">Here</a> is a gist containing a fleshed-out version of the code
above, if you’d like to play with it. Enjoy!</p>
Kelvin Versioning2020-02-25T00:00:00+04:00https://jtobin.io/kelvin-versioning<p>Long ago, in the distant past, Curtis introduced the idea of <em>kelvin
versioning</em> in an <a href="https://moronlab.blogspot.com/2010/01/urbit-functional-programming-from.html">informal blog post</a> about <a href="https://urbit.org">Urbit</a>. Imagining
the idea of an ancient and long-frozen form of Martian computing, he described
this versioning scheme as follows:</p>
<blockquote>
<p>Some standards are extensible or versionable, but some are not. ASCII, for
instance, is perma-frozen. So is IPv4 (its relationship to IPv6 is little
more than nominal - if they were really the same protocol, they’d have the
same ethertype). Moreover, many standards render themselves incompatible in
practice through excessive enthusiasm for extensibility. They may not be
perma-frozen, but they probably should be.</p>
<p>The true, Martian way to perma-freeze a system is what I call Kelvin
versioning. In Kelvin versioning, releases count down by integer degrees
Kelvin. At absolute zero, the system can no longer be changed. At 1K, one
more modification is possible. And so on. For instance, Nock is at 9K. It
might change, though it probably won’t. Nouns themselves are at 0K - it is
impossible to imagine changing anything about those three sentences.</p>
</blockquote>
<p>Understood in this way, kelvin versioning is very simple. One simply counts
downwards, and at absolute zero (i.e. 0K) no other releases are legal. It is
no more than a versioning scheme designed for abstract components that should
eventually freeze.</p>
<p>Many years later, the Urbit blog described kelvin versioning once more in the
post <a href="https://urbit.org/blog/toward-a-frozen-operating-system/">Towards a Frozen Operating System</a>. This presented a significant
refinement of the original scheme, introducing both recursive and so-called
“telescoping” mechanics to it:</p>
<blockquote>
<p>The right way for this trunk to approach absolute zero is to “telescope” its
Kelvin versions. The rules of telescoping are simple:</p>
<p>If tool B sits on platform A, either both A and B must be at absolute zero,
or B must be warmer than A.</p>
<p>Whenever the temperature of A (the platform) declines, the temperature of B
(the tool) must also decline.</p>
<p>B must state the version of A it was developed against. A, when loading B,
must state its own current version, and the warmest version of itself with
which it’s backward-compatible.</p>
<p>Of course, if B itself is a platform on which some higher-level tool C
depends, it must follow the same constraints recursively.</p>
</blockquote>
<p>This is more or less a complete characterisation of kelvin versioning, but it’s
still not quite precise enough. If one looks at other versioning schemes that
try to communicate some specific semantic content (the most obvious example
being <a href="https://semver.org/">semver</a>), it’s obvious that they take great pains to be formal and
precise about their mechanics.</p>
<p>Experience has demonstrated to me that such formality is necessary. Even the
excerpt above has proven to be ambiguous or underspecified re: the details of
various situations or corner cases that one might run into. These confusions
can be resolved by a rigorous protocol specification, which, in this case isn’t
very difficult to put together.</p>
<p>Kelvin versioning and its use in Urbit is the subject of the currently-evolving
<a href="https://github.com/urbit/proposals/blob/master/009-arvo-versioning.md">UP9</a>, recent proposed updates to which have not yet been ratified. The
following is my own personal take on and simple formal specification of kelvin
versioning – I believe it resolves any major ambiguities that the original
descriptions may have introduced.</p>
<h2 id="kelvin-versioning-specification">Kelvin Versioning (Specification)</h2>
<p>(The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”,
“SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be
interpreted as described in <a href="https://www.ietf.org/rfc/rfc2119.txt">RFC 2119</a>.)</p>
<p>For any component A following kelvin versioning,</p>
<ol>
<li>
<p>A’s version <strong>SHALL</strong> be a nonnegative integer.</p>
</li>
<li>
<p>A, at any specific version, <strong>MUST NOT</strong> be modified after release.</p>
</li>
<li>
<p>At version 0, new versions of A <strong>MUST NOT</strong> be released.</p>
</li>
<li>
<p>New releases of A <strong>MUST</strong> be assigned a new version, and this version
<strong>MUST</strong> be strictly less than the previous one.</p>
</li>
<li>
<p>If A supports another component B that also follows kelvin versioning, then:</p>
<ul>
<li>Either both A and B <strong>MUST</strong> be at version 0, or B’s version <strong>MUST</strong> be
strictly greater than A’s version.</li>
<li>If a new version of A is released and that version supports B, then a new
version of B <strong>MUST</strong> be released.</li>
</ul>
</li>
</ol>
<p>These rules apply recursively for any kelvin-versioned component C that is
supported by B, and so on.</p>
<h2 id="examples">Examples</h2>
<p>Examples are particularly useful here, so let me go through a few.</p>
<p>Let’s take the following four components, sitting in three layers, as a running
example. Here’s our initial state:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 10K
B 20K
C 21K
D 30K
</code></pre></div></div>
<p>So we have A at 10K supporting B at 20K. B in turn supports both C at 21K and
D at 30K.</p>
<h3 id="state-1">State 1</h3>
<p>Imagine we have some patches lying around for D and want to release a new
version of it. That’s easy to do; we push out a new version of D. In this
case it will have version one less than 30, i.e. 29K:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 10K
B 20K
C 21K
D 29K <-- cools from 30K to 29K
</code></pre></div></div>
<p>Easy peasy. This is the most trivial example.</p>
<p>The only possible point of confusion here is: well, what kind of change
warrants a version decrement? And the answer is: any (released) change
whatsoever. Anything with an associated kelvin version is immutable after
being released at that version, analogous to how things are done in any other
versioning scheme.</p>
<h3 id="state-2">State 2</h3>
<p>For a second example, imagine that we now have completed a major refactoring
of A and want to release a new version of that.</p>
<p>Since A supports B, releasing a new version of A obligates us to release a new
version of B as well. And since B supports both C and D, we are obligated,
recursively, to release new versions of those to boot.</p>
<p>The total effect of a new A release is thus the following:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 9K <-- cools from 10K to 9K
B 19K <-- cools from 20K to 19K
C 20K <-- cools from 21K to 20K
D 28K <-- cools from 29K to 28K
</code></pre></div></div>
<p>This demonstrates the recursive mechanic of kelvin versioning.</p>
<p>An interesting effect of the above mechanic, as described in <a href="https://urbit.org/blog/toward-a-frozen-operating-system/">Toward a Frozen
Operating System</a> is that anything that depends on (say) A, B, and C only
needs to express its dependency on some version of C. Depending on C at e.g.
20K implicitly specifies a dependency on its supporting component, B, at 19K,
and then A at 9K as well (since any change to A or B must also result in a
change to C).</p>
<h3 id="state-3">State 3</h3>
<p>Now imagine that someone has contributed a performance enhancement to C, and
we’d like to release a new version of that.</p>
<p>The interesting thing here is that we’re <em>prohibited</em> from releasing a new
version of C. Recall our current state:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 9K
B 19K
C 20K <-- one degree K warmer than B
D 28K
</code></pre></div></div>
<p>Releasing a new version of C would require us to cool it by at least one
kelvin, resulting in the warmest possible version of 19K. But since its
supporting component, B, is already at 19K, this would constitute an illegal
state under kelvin versioning. A supporting component must always be strictly
cooler than anything it supports, or be at absolute zero conjointly with
anything it supports.</p>
<p>This illustrates the so-called telescoping mechanic of kelvin versioning – one
is to imagine one of those handheld telescopes made of segments that flatten
into each other when collapsed.</p>
<h3 id="state-4">State 4</h3>
<p>But now, say that we’re finally going to release our new API for B. We release
a new version of B, this one at 18K, which obligates us to in turn release new
versions of C and D:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 9K
B 18K <-- cools from 19K to 18K
C 19K <-- cools from 20K to 19K
D 27K <-- cools from 28K to 27K
</code></pre></div></div>
<p>In particular, the new version of B gives us the necessary space to release a
new version of C, and, indeed, obligates us to release a new version of it. In
releasing C at 19K, presumably we’d include the performance enhancement that we
were prohibited from releasing in State 3.</p>
<h3 id="state-5">State 5</h3>
<p>A final example that’s simple, but useful to illustrate explicitly, involves
introducing a new component, or replacing a component entirely.</p>
<p>For example: say that we’ve decided to deprecate C and D and replace them with
a single new component, E, supported by B. This is as easy as it sounds:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 9K
B 18K
E 40K <-- initial release at 40K
</code></pre></div></div>
<p>We just swap in E at the desired initial kelvin version. The initial kelvin
can be chosen arbitrarily; the only restriction is that it be warmer than the
the component that supports it (or be at absolute zero conjointly with it).</p>
<p>It’s important to remember that, in this component-resolution of kelvin
versioning, there is no notion of the “total temperature” of the stack. Some
third party could write another component, F, supported by E, with initial
version at 1000K, for example. It doesn’t introduce any extra burden or
responsibility on the maintainers of components A through E.</p>
<h2 id="collective-kelvin-versioning">Collective Kelvin Versioning</h2>
<p>So – all that is well and good for what I’ll call the component-level
mechanics of kelvin versioning. But it’s useful to touch on a related
construct, that of <em>collectively</em> versioning a stack of kelvin-versioned
components. This minor innovation on Curtis’s original idea was put together
by myself and my colleague Philip Monk.</p>
<p>If you have a collection of kelvin-versioned things, e.g. the things in our
initial state from the prior examples:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 10K
B 20K
C 21K
D 30K
</code></pre></div></div>
<p>then you may want to release all these things, together, as some abstract
thing. Notably, this happens in the case of the Urbit kernel, where the stack
consists of a <a href="/nock">functional VM</a>, an <a href="/basic-hoonery">unapologetically amathematical purely
functional programming language</a>, special-purpose kernel modules, etc.
It’s useful to be able to describe the whole kernel with a single version
number.</p>
<p>To do this in a consistent way, you can select one component in your stack to
serve as a primary index of sorts, and then capture everything it supports via
a patch-like, monotonically decreasing “fractional temperature” suffix.</p>
<p>This is best illustrated via example. If we choose B as our primary index in
the initial state above, for example, we could version the stack collectively
as 20.9K. B provides the 20K, and everything it supports is just lumped into
the “patch version” 9.</p>
<p>If we then consider the example given in State 1, i.e.:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 10K
B 20K
C 21K
D 29K
</code></pre></div></div>
<p>in which D has cooled by a degree kelvin, then we can version this stack
collectively as 20.8K. If we were to then release a new version of C at 20K,
then we could release the stack collectively as 20.7K. And so on.</p>
<p>There is no strictly prescribed schedule as to how to decrease the fractional
temperature, but the following schedule is recommended:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.9, .8, .7, .., .1, .01, .001, .0001, ..
</code></pre></div></div>
<p>Similarly, the fractional temperature should reset to .9 whenever the primary
index cools. If we consider the State 2, for example, where a new release of A
led to every other component in the stack cooling, we had this:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>A 9K
B 19K
C 20K
D 28K
</code></pre></div></div>
<p>Note that B has cooled by a kelvin, so we would version this stack collectively
as 19.9K. The primary index has decreased by a kelvin, and the fractional
temperature has been reset to .9.</p>
<p>While I think examples illustrate this collective scheme most clearly, after my
schpeel about the pitfalls of ambiguity it would be remiss of me not to include
a more formal spec:</p>
<h2 id="collective-kelvin-versioning-specification">Collective Kelvin Versioning (Specification)</h2>
<p>(The key words “MUST”, “MUST NOT”, “REQUIRED”, “SHALL”, “SHALL NOT”, “SHOULD”,
“SHOULD NOT”, “RECOMMENDED”, “MAY”, and “OPTIONAL” in this document are to be
interpreted as described in <a href="https://www.ietf.org/rfc/rfc2119.txt">RFC 2119</a>.)</p>
<p>For a collection of kelvin-versioned components K:</p>
<ol>
<li>
<p>K’s version <strong>SHALL</strong> be characterised by a primary index, chosen from a
component in K, and and a real number in the interval [0, 1) (the
“fractional temperature”), determined by all components that the primary
index component supports.</p>
<p>The fractional temperature <strong>MAY</strong> be 0 only if the primary index’s version
is 0.</p>
</li>
<li>
<p>K, at any particular version, <strong>MUST NOT</strong> be modified after release.</p>
</li>
<li>
<p>At primary index version 0 and fractional temperature 0, new versions of K
<strong>MUST NOT</strong> be released.</p>
</li>
<li>
<p>New releases of K <strong>MUST</strong> be assigned a new version, and this version
<strong>MUST</strong> be strictly less than the previous one.</p>
</li>
<li>
<p>When a new release of K includes new versions of any component supported by
the primary index, but not a new version of the primary index proper, its
fractional temperature <strong>MUST</strong> be less than the previous version.</p>
<p>Given constant primary index versions, fractional temperatures corresponding
to new releases <strong>SHOULD</strong> decrease according to the following schedule:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.9, .8, .7, .., .1, .01, .001, .0001, ..
</code></pre></div> </div>
</li>
<li>
<p>When a new release of K includes a new version of the primary index, the
fractional temperature of <strong>SHOULD</strong> be reset to 9.</p>
</li>
<li>
<p>New versions of K <strong>MAY</strong> be indexed by components other than the primary
index (i.e., K may be “reindexed” at any point). However, the new chosen
component <strong>MUST</strong> either be colder than the primary index it replaces, or
be at version 0 conjointly with the primary index it replaces.</p>
</li>
</ol>
<h2 id="etc">Etc.</h2>
<p>In my experience, the major concern in adopting a kelvin versioning scheme is
that one will accidentally initialise everything with a set of temperatures
(i.e. versions) that are too cold (i.e. too close to 0), and thus burn through
too many version numbers too quickly on the path to freezing. To alleviate
this, it helps to remember that one has an infinite number of release
candidates available for every component at every temperature.</p>
<p>The convention around release candidates is just to prepend a suffix to the
next release version along the lines of .rc1, .rc2, etc. One should feel
comfortable using these liberally, iterating through release candidates as
necessary before finally committing to a new version at a properly cooler
temperature.</p>
<p>The applications that might want to adopt kelvin versioning are probably pretty
limited, and may indeed even be restricted to the Urbit kernel itself (Urbit
has been described by some as “that operating system with kernel that
eventually reaches absolute zero under kelvin versioning”). Nonetheless: I
believe this scheme to certainly be more than a mere marketing gimmick or what
have you, and, at minimum, it makes for an interesting change of pace from
semver.</p>
Email for l33t h4x0rz2020-02-11T00:00:00+04:00https://jtobin.io/email<p>(<strong>UPDATE 2024/09/08</strong>: while hosting your own mailserver is not covered
in this post, I recommend you check out <a href="https://nixos-mailserver.readthedocs.io/en/latest/">Simple NixOS Mailserver</a>
for a borderline trivial way to do it.)</p>
<p>A couple of people recently asked about my email setup, so I figured it might
be best to simply document some of it here.</p>
<p>I run my own mail server for jtobin.io, plus another domain or two, and usually
wind up interacting with gmail for work. I use offlineimap to fetch and sync
mail with these remote servers, msmtp and msmtpq to send mail, mutt as my MUA,
notmuch for search, and <a href="https://www.tarsnap.com/">tarsnap</a> for backups.</p>
<p>There are other details; vim for writing emails, urlview for dealing with URLs,
w3m for viewing HTML, <a href="https://www.passwordstore.org/">pass</a> for password storage, etc. etc. But the
mail setup proper is as above.</p>
<p>I’ll just spell out some of the major config below, and will focus on the
configuration that works with gmail, since that’s probably of broader appeal.
You can get all the software for it via the following in <a href="https://nixos.org/nix/">nixpkgs</a>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mutt offlineimap msmtp notmuch notmuch-mutt
</code></pre></div></div>
<h2 id="offlineimap">offlineimap</h2>
<p>offlineimap is used to sync local and remote email; I use it to manually grab
emails occasionally throughout the day. You could of course set it up to run
automatically as a cron job or what have you, but I like deliberately fetching
my email only when I actually want to deal with it.</p>
<p>Here’s a tweaked version of one of my .offlineimaprc files:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[general]
accounts = work
[Account work]
localrepository = work-local
remoterepository = work-remote
postsynchook = notmuch new
[Repository work-local]
type = Maildir
localfolders = ~/mail/work
sep = /
restoreatime = no
[Repository work-remote]
type = Gmail
remoteuser = FIXME_user@domain.tld
remotepass = FIXME_google_app_password
realdelete = no
ssl = yes
sslcacertfile = /usr/local/etc/openssl/cert.pem
folderfilter = lambda folder: folder not in\
['[Gmail]/All Mail', '[Gmail]/Important', '[Gmail]/Starred']
</code></pre></div></div>
<p>You should be able to figure out the gist of this. Pay particular attention to
the ‘remoteuser’, ‘remotepass’, and ‘folderfilter’ options. For ‘remotepass’
in particular you’ll want to generate an app-specific password from Google.
The ‘folderfilter’ option lets you specify the gmail folders that you actually
want to sync; <code class="language-plaintext highlighter-rouge">folder in [..]</code> and <code class="language-plaintext highlighter-rouge">folder not in [..]</code> are probably all you’ll
want here.</p>
<p>If you <em>don’t</em> want to store your password in cleartext, and instead want to
grab it from an encrypted store, you can use the ‘remotepasseval’ option. I
don’t bother with this for Google accounts that have app-specific passwords,
but do for others.</p>
<p>This involves a little bit of extra setup. First, you can make some Python
functions available to the config file with ‘pythonfile’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[general]
accounts = work
pythonfile = ~/.offlineimap.py
</code></pre></div></div>
<p>Here’s a version of that file that I keep, which grabs the desired from
<code class="language-plaintext highlighter-rouge">pass(1)</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>#! /usr/bin/env python2
from subprocess import check_output
def get_pass():
return check_output("pass FIXME_PASSWORD", shell=True).strip("\n")
</code></pre></div></div>
<p>Then you can just call the <code class="language-plaintext highlighter-rouge">get_pass</code> function in ‘remotepasseval’ back in
.offlineimaprc:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[Repository work-remote]
type = Gmail
remoteuser = FIXME_user@domain.tld
remotepasseval = get_pass()
realdelete = no
ssl = yes
sslcacertfile = /usr/local/etc/openssl/cert.pem
folderfilter = lambda folder: folder not in\
['[Gmail]/All Mail', '[Gmail]/Important', '[Gmail]/Starred']
</code></pre></div></div>
<p>When you’ve got this set up, you should just be able to run <code class="language-plaintext highlighter-rouge">offlineimap</code> to
fetch your email. If you maintain multiple configuration files, it’s helpful
to specify a specific one using <code class="language-plaintext highlighter-rouge">-c</code>, e.g. <code class="language-plaintext highlighter-rouge">offlineimap -c .offlineimaprc-foo</code>.</p>
<h2 id="msmtp-msmtpq">msmtp, msmtpq</h2>
<p>msmtp is used to send emails. It’s a very simple SMTP client. Here’s a
version of my <code class="language-plaintext highlighter-rouge">.msmtprc</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>defaults
auth on
tls on
tls_starttls on
tls_trust_file /usr/local/etc/openssl/cert.pem
logfile ~/.msmtp.log
account work
host smtp.gmail.com
port 587
from FIXME_user@domain.tld
user FIXME_user@domain.tld
password FIXME_google_app_password
account default: work
</code></pre></div></div>
<p>Again, very simple.</p>
<p>You can do a similar thing here if you don’t want to store passwords in
cleartext. Just use ‘passwordeval’ and the desired shell command directly,
e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>account work
host smtp.gmail.com
port 587
from FIXME_user@domain.tld
user FIXME_user@domain.tld
passwordeval "pass FIXME_PASSWORD"
</code></pre></div></div>
<p>I occasionally like to work offline, so I use msmtpq to queue up emails to send
later. Normally you don’t have to deal with any of this directly, but
occasionally it’s nice to be able to check the queue. You can do that with
<code class="language-plaintext highlighter-rouge">msmtp-queue -d</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ msmtp-queue -d
no mail in queue
</code></pre></div></div>
<p>If there <em>is</em> something stuck in the queue, you can force it to send with
<code class="language-plaintext highlighter-rouge">msmtp-queue -r</code> or <code class="language-plaintext highlighter-rouge">-R</code>. FWIW, this has happened to me while interacting with
gmail under a VPN in the past.</p>
<h2 id="mutt">mutt</h2>
<p>Mutt is a fantastic MUA. Its tagline is “all mail clients suck, this one just
sucks less,” but I really love mutt. It may come as a surprise that working
with email can be a pleasure, especially if you’re accustomed to working with
clunky webmail UIs, but mutt makes it so.</p>
<p>Here’s a pruned-down version of one of my .muttrc files:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set realname = "MyReal Name"
set from = "user@domain.tld"
set use_from = yes
set envelope_from = yes
set mbox_type = Maildir
set sendmail = "~/.nix-profile/bin/msmtpq -a work"
set sendmail_wait = -1
set folder = "~/mail/work"
set spoolfile = "+INBOX"
set record = "+[Gmail]/Sent Mail"
set postponed = "+[Gmail]/Drafts"
set smtp_pass = "FIXME_google_app_password"
set imap_pass = "FIXME_google_app_password"
set signature = "~/.mutt/.signature-work"
set editor = "vim"
set sort = threads
set sort_aux = reverse-last-date-received
set pgp_default_key = "my_default_pgp@key"
set crypt_use_gpgme = yes
set crypt_autosign = yes
set crypt_replysign = yes
set crypt_replyencrypt = yes
set crypt_replysignencrypted = yes
bind index gg first-entry
bind index G last-entry
bind index B imap-fetch-mail
bind index - collapse-thread
bind index _ collapse-all
set alias_file = ~/.mutt/aliases
set sort_alias = alias
set reverse_alias = yes
source $alias_file
auto_view text/html
alternative_order text/plain text/enriched text/html
subscribe my_favourite@mailing.list
macro index <F8> \
"<enter-command>set my_old_pipe_decode=\$pipe_decode my_old_wait_key=\$wait_key nopipe_decode nowait_key<enter>\
<shell-escape>notmuch-mutt -r --prompt search<enter>\
<change-folder-readonly>`echo ${XDG_CACHE_HOME:-$HOME/.cache}/notmuch/mutt/results`<enter>\
<enter-command>set pipe_decode=\$my_old_pipe_decode wait_key=\$my_old_wait_key<enter>i" \
"notmuch: search mail"
macro index <F9> \
"<enter-command>set my_old_pipe_decode=\$pipe_decode my_old_wait_key=\$wait_key nopipe_decode nowait_key<enter>\
<pipe-message>notmuch-mutt -r thread<enter>\
<change-folder-readonly>`echo ${XDG_CACHE_HOME:-$HOME/.cache}/notmuch/mutt/results`<enter>\
<enter-command>set pipe_decode=\$my_old_pipe_decode wait_key=\$my_old_wait_key<enter>" \
"notmuch: reconstruct thread"
macro index l "<enter-command>unset wait_key<enter><shell-escape>read -p 'notmuch query: ' x; echo \$x >~/.cache/mutt_terms<enter><limit>~i \"\`notmuch search --output=messages \$(cat ~/.cache/mutt_terms) | head -n 600 | perl -le '@a=<>;chomp@a;s/\^id:// for@a;$,=\"|\";print@a'\`\"<enter>" "show only messages matching a notmuch pattern"
# patch rendering
# https://karelzak.blogspot.com/2010/02/highlighted-patches-inside-mutt.html
color body green default "^diff \-.*"
color body green default "^index [a-f0-9].*"
color body green default "^\-\-\- .*"
color body green default "^[\+]{3} .*"
color body cyan default "^[\+][^\+]+.*"
color body red default "^\-[^\-]+.*"
color body brightblue default "^@@ .*"
# vim: ft=muttrc:
</code></pre></div></div>
<p>Some comments on all that:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set mbox_type = Maildir
set sendmail = "~/.nix-profile/bin/msmtpq -a work"
set sendmail_wait = -1
set folder = "~/mail/work"
set spoolfile = "+INBOX"
set record = "+[Gmail]/Sent Mail"
set postponed = "+[Gmail]/Drafts"
</code></pre></div></div>
<p>Note here that we’re specifying msmtpq as our sendmail program. The <code class="language-plaintext highlighter-rouge">-a work</code>
command here refers to the account defined in your .msmtprc file, so if you
change the name of it there, you have to do it here as well. Ditto for the
folder.</p>
<p>(If you’re tweaking these config files for your own use, I’d recommend just
substituting all instances of ‘work’ with your own preferred account name.)</p>
<p>The negative ‘sendmail_wait’ value handles queueing mails up appropriately
when offline, IIRC.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set smtp_pass = "FIXME_google_app_password"
set imap_pass = "FIXME_google_app_password"
</code></pre></div></div>
<p>Here are the usual cleartext app passwords. If you want to store them
encrypted, there’s a usual method for doing that: add the following to the top
of your .muttrc:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>source "gpg -d ~/.mutt/my-passwords.gpg |"
</code></pre></div></div>
<p>where <code class="language-plaintext highlighter-rouge">.mutt/my-passwords.gpg</code> should be the above <code class="language-plaintext highlighter-rouge">smtp_pass</code> and <code class="language-plaintext highlighter-rouge">imap_pass</code>
assignments, encrypted with your desired private key.</p>
<p>Continuing with the file at hand:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set signature = "~/.mutt/.signature-work"
set editor = "vim"
</code></pre></div></div>
<p>These should be self-explanatory. The signature file should just contain the
signature you want appended to your mails (it will be appended under a pair of
dashes). And if you want to use some other editor to compose your emails, just
specify it here.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set pgp_default_key = "my_default_pgp@key"
set crypt_use_gpgme = yes
set crypt_autosign = yes
set crypt_replysign = yes
set crypt_replyencrypt = yes
set crypt_replysignencrypted = yes
</code></pre></div></div>
<p>Mutt is one of the few programs that has great built-in support for PGP. It
can easily encrypt, decrypt, and sign messages, grab public keys, etc. Here
you can see that I’ve set it to autosign messages, reply to encrypted messages
with encrypted messages, and so on.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>bind index gg first-entry
bind index G last-entry
bind index B imap-fetch-mail
bind index - collapse-thread
bind index _ collapse-all
</code></pre></div></div>
<p>These are a few key bindings that I find helpful. The first bunch are familiar
to vim users and are useful for navigating around; the second two are really
useful for compressing or expanding the view of your mailbox.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>set alias_file = ~/.mutt/aliases
set sort_alias = alias
set reverse_alias = yes
source $alias_file
auto_view text/html
alternative_order text/plain text/enriched text/html
</code></pre></div></div>
<p>The alias file lets you define common shortcuts for single or multiple
addresses. I get a lot of use out of multiple address aliases, e.g.:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>alias chums socrates@ago.ra, plato@acade.my, aristotle@lyce.um
</code></pre></div></div>
<p>The MIME type stuff below the alias config is just a sane set of defaults for
viewing common mail formats.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>subscribe my_favourite@mailing.list
</code></pre></div></div>
<p>Mutt makes interacting with mailing lists very easy just by default, but you
can also indicate addresses that you’re subscribed to, as above, to unlock a
few extra features for them (‘list reply’ being a central one). To tell mutt
that you’re subscribed to haskell-cafe, for example, you’d use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>subscribe haskell-cafe@haskell.org
</code></pre></div></div>
<p>The three longer macros that follow are for notmuch. I really only find myself
using the last one, ‘l’, for search. FWIW, notmuch’s search functionality is
fantastic; I’ve found it to be more useful than gmail’s web UI search, I think.</p>
<p>The patch rendering stuff at the end is just a collection of heuristics for
rendering in-body .patch files well. This is useful if you subscribe to a
patch-heavy mailing list, e.g. LKML or <code class="language-plaintext highlighter-rouge">git@vger.kernel.org</code>, or if you just
want to be able to better-communicate about diffs in your day-to-day emails
with your buddies.</p>
<h2 id="fin">Fin</h2>
<p>There are obviously endless ways you can configure all this stuff, especially
mutt, and common usage patterns that you’ll quickly find yourself falling into.
But whatever you find those to be, the above should at least get you up and
running pretty quickly with 80% of the desired feature set.</p>
Basic Hoonery2019-02-04T00:00:00+04:00https://jtobin.io/basic-hoonery<p>In <a href="/nock">my last post</a> I first introduced <a href="https://github.com/jtobin/hnock">hnock</a>, a little interpreter
for <a href="https://urbit.org/docs/learn/arvo/nock/definition/">Nock</a>, and then demonstrated it on a hand-rolled decrement function.
In this post I’ll look at how one can handle the same (contrived, but
illustrative) task in <a href="https://urbit.org/docs/learn/arvo/hoon/">Hoon</a>.</p>
<p>Hoon is the higher- or application-level programming language for working with
<a href="https://urbit.org/docs/learn/arvo/">Arvo</a>, the operating system of <a href="http://urbit.org/">Urbit</a>. The best way I can
describe it is something like “Haskell meets C meets J meets the environment is
always explicit.”</p>
<p>As a typed, functional language, Hoon feels surprisingly low-level. One is
never allocating or deallocating memory explicitly when programming in Hoon,
but the experience somehow feels similar to working in C. The idea is that the
language should be simple and straightforward and support a fairly limited
level of abstraction. There are the usual low-level functional idioms (map,
reduce, etc.), as well as a structural type system to keep the programmer
honest, but at its core, Hoon is something of a functional Go (a language
which, I happen to think, is <a href="http://yager.io/programming/go.html">not good</a>).</p>
<p>It’s not a complex language, like Scala or Rust, nor a language that overtly
supports sky-high abstraction, like Haskell or Idris. Hoon is supposed to
exist at a sweet spot for getting work done. And I am at least willing to buy
the argument that it is pretty good for getting work done in Urbit.</p>
<p>Recall our naïve decrement function in Haskell. It looked like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dec</span> <span class="o">::</span> <span class="kt">Integer</span> <span class="o">-></span> <span class="kt">Integer</span>
<span class="n">dec</span> <span class="n">m</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">loop</span> <span class="n">n</span>
<span class="o">|</span> <span class="n">succ</span> <span class="n">n</span> <span class="o">==</span> <span class="n">m</span> <span class="o">=</span> <span class="n">n</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">loop</span> <span class="p">(</span><span class="n">succ</span> <span class="n">n</span><span class="p">)</span>
<span class="kr">in</span> <span class="n">loop</span> <span class="mi">0</span>
</code></pre></div></div>
<p>Let’s look at a number of ways to write this in Hoon, showing off some of the
most important Hoon programming concepts in the process.</p>
<h2 id="cores">Cores</h2>
<p>Here’s a Hoon version of decrement. Note that to the uninitiated, Hoon looks
<em>gnarly</em>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|= m=@
=/ n=@ 0
=/ loop
|%
++ recur
?: =(+(n) m)
n
recur(n +(n))
--
recur:loop
</code></pre></div></div>
<p>We can read it as follows:</p>
<ul>
<li>Define a function that takes an argument, ‘m’, having type atom (recall
that an atom is an unsigned integer).</li>
<li>Define a local variable called ‘n’, having type atom and value 0, and add it
to the environment (or, if you recall our Nock terminology, to the
<em>subject</em>).</li>
<li>Define a local variable called ‘loop’, with precise definition to follow, and
add it to the environment.</li>
<li>‘loop’ is a <em>core</em>, i.e. more or less a named collection of functions.
Define one such function (or <em>arm</em>), ‘recur’, that checks to see if the
increment of ‘n’ is equal to ‘m’, returning ‘n’ if so, and calling itself,
except with the value of ‘n’ in the environment changed to ‘n + 1’, if not.</li>
<li>Evaluate ‘recur’ as defined in ‘loop’.</li>
</ul>
<p>(To test this, you can enter the Hoon line-by-line into the Arvo <a href="https://urbit.org/docs/learn/arvo/arvo-internals/shell/">dojo</a>.
Just preface it with something like <code class="language-plaintext highlighter-rouge">=core-dec</code> to give it a name, and call it
via e.g. <code class="language-plaintext highlighter-rouge">(core-dec 20)</code>.)</p>
<p>Hoon may appear to be a write-only language, though I’ve found this to not
necessarily be the case (just to note, at present I’ve read more Hoon code than
I’ve written). Good Hoon has a terse and very <em>vertical</em> style. The principle
that keeps it readable is that, roughly, each line should contain one important
logical operation. These operations are denoted by <em>runes</em>, the <code class="language-plaintext highlighter-rouge">=/</code> and <code class="language-plaintext highlighter-rouge">?:</code>
and similar ASCII digraphs sprinkled along the left hand columns of the above
example. This makes it look similar to e.g. <a href="https://en.wikipedia.org/wiki/J_(programming_language)">J</a> – a language I have
long loved, but never mastered – although in J the rough ‘one operator per
line’ convention is not typically in play.</p>
<p>In addition to the standard digraph runes, there is also a healthy dose of
‘irregular’ syntax in most Hoon code for simple operations that one uses
frequently. Examples used above include <code class="language-plaintext highlighter-rouge">=(a b)</code> for equality testing, <code class="language-plaintext highlighter-rouge">+(n)</code>
for incrementing an atom, and <code class="language-plaintext highlighter-rouge">foo(a b)</code> for evaluating ‘foo’ with the value of
‘a’ in the environment changed to ‘b’. Each of these could be replaced with a
more standard rune-based expression, though for such operations the extra
verbosity is not usually warranted.</p>
<p>Cores like ‘loop’ seem, to me, to be the mainstay workhorse of Hoon
programming. A core is more or less a structure, or object, or dictionary, or
whatever, of functions. One defines them liberally, constructs a subject (i.e.
environment) to suit, and then evaluates them, or some part of them, against
the subject.</p>
<p>To be more precise, a core is a Nock expression; like every non-atomic value in
Nock, it is a tree. Starting from the cell <code class="language-plaintext highlighter-rouge">[l r]</code>, the left subtree, ‘l’, is
a tree of Nock formulas (i.e. the functions, like ‘recur’, defined in the
core). The right subtree, ‘r’ is all the data required to evaluate those Nock
formulas. The traditional name for the left subtree, ‘l’, is the <em>battery</em> of
the core; the traditional name for the right subtree is the <em>payload</em>.</p>
<p>One is always building up a local environment in Hoon and then evaluating some
value against it. Aside from the arm ‘recur’, the core ‘loop’ also contains in
its payload the values ‘m’ and ‘n’. The expression ‘recur:loop’ – irregular
syntax for <code class="language-plaintext highlighter-rouge">=< recur loop</code> – means “use ‘loop’ as the environment and
evaluate ‘recur’.” <em>Et voilà</em>, that’s how we get our decrement.</p>
<p>You’ll note that this should feel very similar to the way we defined decrement
in Nock. Our hand-assembled Nock code, slightly cleaned up, looked like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
8
[1
6
[5 [4 0 6] [0 7]]
[0 6]
2 [[0 2] [4 0 6] [0 7]] [0 2]
]
2 [0 1] [0 2]
]
</code></pre></div></div>
<p>This formula, when evaluated against an atom subject, creates <em>another</em> subject
from it, defining a ‘loop’ analogue that looks in specific addresses in the
subject for itself, as well as the ‘m’ and ‘n’ variables, such that it produces
the decrement of the original subject. Our Hoon code does much the same –
every ‘top-level’ rune expression adds something to the subject, until we get
to the final expression, ‘recur:loop’, which evaluates ‘recur’ against the
subject, ‘loop’.</p>
<p>The advantage of Hoon, in comparison to Nock, is that we can work with names,
instead of raw tree addresses, as well as with higher-level abstractions like
cores. The difference between Hoon and Nock really is like the difference
between C and assembly!</p>
<p>For what it’s worth, here is the compiled Nock corresponding to our above
decrement function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
8
[8
[1
6
[5 [4 0 6] 0 30]
[0 6]
9 2 10 [6 4 0 6] 0 1
]
0 1
]
7 [0 2] 9 2 0 1
]
</code></pre></div></div>
<p>It’s similar, though not identical, to our hand-rolled Nock. In particular,
you can see that it is adding a constant conditional formula, including the
familiar equality check, to the subject (note that the equality check, using
Nock-5, refers to address 30 instead of 7 – presumably this is because I have
more junk floating around in my dojo subject). Additionally, the formulas
using Nock-9 and Nock-10 reduce to Nock-2 and Nock-0, just like our hand-rolled
code does.</p>
<p>But our Hoon is doing more than the bespoke Nock version did, so we’re not
getting quite the same code. Worth noting is the ‘extra’ use of Nock-8, which
is presumably required because I’ve defined both ‘recur’, the looping function,
and ‘loop’, the core to hold it, and the hand-rolled Nock obviously didn’t
involve a core.</p>
<h2 id="doors">Doors</h2>
<p>Here’s another way to write decrement, using another fundamental Hoon
construct, the <em>door</em>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|= m=@
=/ loop
|_ n=@
++ recur
?: =(+(n) m)
n
~(recur ..recur +(n))
--
~(recur loop 0)
</code></pre></div></div>
<p>A door is a core that takes an argument. Here we’ve used the <code class="language-plaintext highlighter-rouge">|_</code> rune,
instead than <code class="language-plaintext highlighter-rouge">|%</code>, to define ‘loop’, and note that it takes ‘n’ as an argument.
So instead of ‘n’ being defined external to the core, as it was in the previous
example, here we have to specify it explicitly when we call ‘recur’. Note that
this is more similar to our Haskell example, in which ‘loop’ was defined as a
function taking ‘n’ as an argument.</p>
<p>The two other novel things here are the <code class="language-plaintext highlighter-rouge">~(recur ..recur +(n))</code> and <code class="language-plaintext highlighter-rouge">~(recur
loop 0)</code> expressions, which actually turn out to be mostly the same thing. The
syntax:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~(arm door argument)
</code></pre></div></div>
<p>is irregular, and means “evaluate ‘arm’ in ‘door’ using ‘argument’”. So in the
last line, <code class="language-plaintext highlighter-rouge">~(recur loop 0)</code> means “evaluate ‘recur’ in ‘loop’ with n set to 0.”
In the definition of ‘recur’, on the other hand, we need to refer to the door
that contains it, but are in the very process of defining that thing. The
‘..recur’ syntax means “the door that contains ‘recur’,” and is useful for
exactly this task, given we can’t yet refer to ‘loop’. The syntax <code class="language-plaintext highlighter-rouge">~(recur
..recur +(n))</code> means “evaluate ‘recur’ in its parent door with n set to n + 1.”</p>
<p>Let’s check the compiled Nock of this version:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[8
[1 0]
[1
6
[5 [4 0 6] 0 30]
[0 6]
8
[0 1]
9 2 10 [6 7 [0 3] 4 0 6] 0 2
]
0 1
]
8
[0 2]
9 2 10 [6 7 [0 3] 1 0] 0 2
]
</code></pre></div></div>
<p>There’s even more going on here than in our core-implemented decrement, but
doors are a generalisation of cores, so that’s to be expected.</p>
<p>Hoon has special support, though, for <em>one-armed</em> doors. This is precisely how
functions (also called <em>gates</em> or <em>traps</em>, depending on the context) are
implemented in Hoon. The following is probably the most idiomatic version of
naïve decrement:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|= m=@
=/ n 0
|-
?: =(+(n) m)
n
$(n +(n))
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">|=</code> rune that we’ve been using throughout these examples really defines a
door, taking the specified argument, with a single arm called ‘$’. The <code class="language-plaintext highlighter-rouge">|-</code>
rune here does the same, except it immediately calls the ‘$’ arm after defining
it. The last line, <code class="language-plaintext highlighter-rouge">$(n +(n))</code>, is analogous to the <code class="language-plaintext highlighter-rouge">recur(n +(n))</code> line in
our first example: it evaluates the ‘$’ arm, except changing the value of ‘n’
to ‘n + 1’ in the environment.</p>
<p>(Note that there are two ‘$’ arms defined in the above code – one via the use
of <code class="language-plaintext highlighter-rouge">|=</code>, and one via the use of <code class="language-plaintext highlighter-rouge">|-</code>. But there is no confusion as to which
one we mean, since the latter has been the latest to be added to the subject.
Additions to the subject are always <em>prepended</em> in Hoon – i.e. they are
placed at address 2. As the topmost ‘$’ in the subject is the one that
corresponds to <code class="language-plaintext highlighter-rouge">|-</code>, it is resolved first.)</p>
<p>The compiled Nock for this version looks like the following:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
8
[1
6
[5 [4 0 6] 0 30]
[0 6]
9 2 10 [6 4 0 6] 0 1
]
9 2 0 1
]
</code></pre></div></div>
<p>And it is possible (see the appendix) to show that, modulo some different
addressing, this reduces exactly to our hand-rolled Nock code.</p>
<p><strong>UPDATE</strong>: my colleague Ted Blackman, an <em>actual</em> Hoon programmer, recommended
the following as a slightly more idiomatic version of naïve decrement:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>=| n=@
|= m=@
^- @
?: =(+(n) m)
n
$(n +(n))
</code></pre></div></div>
<p>Note that here we’re declaring ‘n’ outside of the gate itself by using another
rune, <code class="language-plaintext highlighter-rouge">=|</code>, that gives the variable a default value based on its type (an
atom’s default value is 0). There’s also an explicit type cast via <code class="language-plaintext highlighter-rouge">^- @</code>,
indicating that the gate produces an atom (like type signatures in Haskell, it
is considered good practice to include these, even though they may not strictly
be required).</p>
<p>Declaring ‘n’ outside the gate is interesting. It has an imperative feel,
as if one were writing the code in Python, or were using a monad like ‘State’
or a ‘PrimMonad’ in Haskell. Like in the Haskell case, we aren’t actually
doing any mutation here, of course – we’re creating new subjects to evaluate
each iteration of our Nock formula against. And the resulting Nock is very
succinct:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[6
[5 [4 0 14] 0 6]
[0 14]
9 2 10 [14 4 0 14] 0 1
]
</code></pre></div></div>
<h2 id="basic-generators">Basic Generators</h2>
<p>If you tested the above examples, I instructed you to do so by typing them into
Arvo’s dojo. I’ve come to believe that, in general, this is a poor way to
teach Hoon. It shouldn’t be done for all but the most introductory examples
(such as the ones I’ve provided here).</p>
<p>If you’ve learned Haskell, you are familiar with the REPL provided by GHCi, the
Glasgow Haskell Compiler’s interpreter. Code running in GHCi is implicitly
running in the IO monad, and I think this leads to confusion amongst newcomers
who must then mentally separate “Haskell in GHC” from “Haskell in GHCi.”</p>
<p>I think there is a similar problem in Hoon. Expressions entered into the dojo
implicitly grow or shrink or otherwise manipulate the dojo’s subject, which is
not, in general, available to standalone Hoon programs. Such standalone Hoon
programs are called <em>generators</em>. In general, they’re what you will use when
working in Hoon and Arvo.</p>
<p>There are four kinds of generators: naked, <code class="language-plaintext highlighter-rouge">%say</code>, <code class="language-plaintext highlighter-rouge">%ask</code>, and <code class="language-plaintext highlighter-rouge">%get</code>. In this
post we’ll just look at the first two; the last couple are out of scope, for
now.</p>
<h3 id="naked-generators">Naked Generators</h3>
<p>The simplest kind of generator is the ‘naked’ generator, which just exists in
a file somewhere in your Urbit’s “desk.” If you save the following as
<code class="language-plaintext highlighter-rouge">naive-decrement.hoon</code> in an Urbit’s <code class="language-plaintext highlighter-rouge">home/gen</code> directory, for example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>|= m=@
=/ n 0
|-
?: =(+(n) m)
n
$(n +(n))
</code></pre></div></div>
<p>Then you’ll be able to run it in a dojo via:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~zod:dojo> +naive-decrement 20
19
</code></pre></div></div>
<p>A naked generator can only be a simple function (technically, a gate) that
produces a noun. It has no access to any external environment – it’s
basically just a self-contained function in a file. It must have an argument,
and it must have only one argument; to pass multiple values to a naked
generator, one must use a cell.</p>
<h3 id="say-generators">Say Generators</h3>
<p>Hoon is a purely functional language, but, unlike Haskell, it also has no IO
monad to demarcate I/O effects. Hoon programs do not produce effects on their
own at all – instead, they construct nouns that tell <em>Arvo</em> how to produce
some effect or other.</p>
<p>A <code class="language-plaintext highlighter-rouge">%say</code> generator (where <code class="language-plaintext highlighter-rouge">%say</code> is a symbol) produces a noun, but it can also
make use of provided environment data (e.g. date information, entropy, etc.).
The idea is that the generator has a specific structure that Arvo knows how to
handle, in order to supply it with the requisite information. Specifically,
<code class="language-plaintext highlighter-rouge">%say</code> generators have the structure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:- %say
|= [<environment data> <list of arguments> <list of optional arguments>]
:- %noun
<code>
</code></pre></div></div>
<p>I’ll avoid discussing what a list is in Hoon at the moment, and we won’t
actually use any environment data in any examples here. But if you dump the
following in <code class="language-plaintext highlighter-rouge">home/gen/naive-decrement.hoon</code>, for example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>:- %say
|= [* [m=@ ~] ~]
:- %noun
=/ n 0
|-
?: =(+(n) m)
n
$(n +(n))
</code></pre></div></div>
<p>you can call it from the dojo via the mechanism as before:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~zod:dojo> +naive-decrement 20
19
</code></pre></div></div>
<p>The generator itself actually returns a particularly-structured noun; a cell
with the symbol <code class="language-plaintext highlighter-rouge">%say</code> as its head, and a gate returning a pair of the symbol
<code class="language-plaintext highlighter-rouge">%noun</code> and a noun as its tail. The <code class="language-plaintext highlighter-rouge">%noun</code> symbol describes the data produced
by the generator. But note that this is not displayed when evaluating the
generator in the dojo – instead, we just get the noun itself, but this
behaviour is dojo-dependent.</p>
<p>I think one should almost get in the habit of writing <code class="language-plaintext highlighter-rouge">%say</code> generators for
most Hoon code, even if a simple naked generator or throwaway dojo command
would do the trick. They are so important for getting things done in Hoon that
it helps to learn about & start using them sooner than later.</p>
<h2 id="fin">Fin</h2>
<p>I’ve introduced Hoon and given a brief tour of what I think are some of the
most important tools for getting work done in the language. Cores, doors, and
gates will get you plenty far, and early exposure to generators, in the form of
the basic naked and <code class="language-plaintext highlighter-rouge">%say</code> variants, will help you avoid the habit of
programming in the dojo, and get you writing more practically-structured Hoon
code from the get-go.</p>
<p>I haven’t had time in this post to describe Hoon’s type system, which is
another very important topic when it comes to getting work done in the
language. I’ll probably write one more to create a small trilogy of sorts –
stay tuned.</p>
<h2 id="appendix">Appendix</h2>
<p>Let’s demonstrate that the compiled Nock code from our door-implemented
decrement reduces to the same as our hand-rolled Nock, save different address
use. Recall that our compile Nock code was:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
8
[1
6
[5 [4 0 6] 0 30]
[0 6]
9 2 10 [6 4 0 6] 0 1
]
9 2 0 1
]
</code></pre></div></div>
<p>An easy reduction is from Nock-9 to Nock-2. Note that <code class="language-plaintext highlighter-rouge">*[a 9 b c]</code> is the same
as <code class="language-plaintext highlighter-rouge">*[*[a c] 2 [0 1] 0 b]</code>. When ‘c’ is <code class="language-plaintext highlighter-rouge">[0 1]</code>, we have that <code class="language-plaintext highlighter-rouge">*[a c] = a</code>,
such that <code class="language-plaintext highlighter-rouge">*[a 9 b [0 1]]</code> is the same as <code class="language-plaintext highlighter-rouge">*[a 2 [0 1] 0 b]</code>, i.e. that the
formula <code class="language-plaintext highlighter-rouge">[9 b c]</code> is the same as the formula <code class="language-plaintext highlighter-rouge">[2 [0 1] 0 b]</code>. We can thus
reduce the use of Nock-9 on the last line to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
8
[1
6
[5 [4 0 6] 0 30]
[0 6]
9 2 10 [6 4 0 6] 0 1
]
2 [0 1] 0 2
]
</code></pre></div></div>
<p>The remaining formula involving Nock-9 evaluates <code class="language-plaintext highlighter-rouge">[10 [6 4 0 6] 0 1]</code> against
the subject, and then evaluates <code class="language-plaintext highlighter-rouge">[2 [0 1] [0 2]]</code> against the result. Note
that, for some subject ‘a’, we have:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[a 10 [6 4 0 6] 0 1]
= #[6 *[a 4 0 6] *[a 0 1]]
= #[6 *[a 4 0 6] a]
= #[3 [*[a 4 0 6] /[7 a]] a]
= #[1 [/[2 a] [*[a 4 0 6] /[7 a]]] a]
= [/[2 a] [*[a 4 0 6] /[7 a]]]
= [*[a 0 2] [*[a 4 0 6] *[a 0 7]]]
= *[a [0 2] [4 0 6] [0 7]]
</code></pre></div></div>
<p>such that <code class="language-plaintext highlighter-rouge">[10 [6 4 0 6] 0 1] = [[0 2] [4 0 6] [0 7]]</code>. And for
<code class="language-plaintext highlighter-rouge">c = [[0 2] [4 0 6] [0 7]]</code> and some subject ‘a’, we have:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[a 9 2 c]
= *[*[a c] 2 [0 1] 0 2]
</code></pre></div></div>
<p>and for <code class="language-plaintext highlighter-rouge">b = [2 [0 1] 0 2]</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[*[a c] b]
= *[a 7 c b]
= *[a 7 [[0 2] [4 0 6] [0 7]] [2 [0 1] 0 2]]
</code></pre></div></div>
<p>such that:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[9 2 [0 2] [4 0 6] [0 7]] = [7 [[0 2] [4 0 6] [0 7]] [2 [0 1] 0 2]]
</code></pre></div></div>
<p>Now. Note that for any subject ‘a’ we have:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[a 7 [[0 2] [4 0 6] [0 7]] [2 [0 1] 0 2]]
= *[a 7 [[0 2] [4 0 6] [0 7]] *[a 0 2]]
</code></pre></div></div>
<p>since <code class="language-plaintext highlighter-rouge">*[a 2 [0 1] 0 2] = *[a *[a 0 2]]</code>. Thus, we can reduce:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[a 7 [[0 2] [4 0 6] [0 7]] *[a 0 2]]
= *[*[a [0 2] [4 0 6] [0 7]] *[a 0 2]]
= *[a 2 [[0 2] [4 0 6] [0 7]] [0 2]]
</code></pre></div></div>
<p>such that</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[7 [[0 2] [4 0 6] [0 7]] [2 [0 1] 0 2]] = [2 [[0 2] [4 0 6] [0 7]] [0 2]]
</code></pre></div></div>
<p>and, so that, finally, we can reduce the compiled Nock to:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
8
[1
6
[5 [4 0 6] 0 30]
[0 6]
2 [[0 2] [4 0 6] [0 7]] 0 2
]
2 [0 1] 0 2
]
</code></pre></div></div>
<p>which, aside from the use of the dojo-assigned address 30 (and any reduction
errors on this author’s part), is the same as our hand-rolled Nock.</p>
A Nock Interpreter2019-01-31T00:00:00+04:00https://jtobin.io/nock<p>I wrote a little <a href="https://urbit.org/docs/learn/arvo/nock/definition/">Nock</a> interpreter called <a href="https://github.com/jtobin/hnock">hnock</a> some months ago
and just yesterday updated it to support the latest version of Nock, 4K. Nock
– the base layer VM of <a href="https://urbit.org/">Urbit</a> – is a very simple little “functional
assembly language” of sorts. It is of particular interest in that it is
capable of practical computation (indeed, it is Turing complete) but is not
defined in terms of the lambda calculus. So, no variable naming and
capture-avoidance and the like to deal with, which is kind of neat.</p>
<p>Nock (at 4K) itself has a small specification, repeated below. Famously, it
fits on a t-shirt:</p>
<div class="language-text highlighter-rouge"><div class="highlight"><pre class="highlight"><code>nock(a) *a
[a b c] [a [b c]]
?[a b] 0
?a 1
+[a b] +[a b]
+a 1 + a
=[a a] 0
=[a b] 1
/[1 a] a
/[2 a b] a
/[3 a b] b
/[(a + a) b] /[2 /[a b]]
/[(a + a + 1) b] /[3 /[a b]]
/a /a
#[1 a b] a
#[(a + a) b c] #[a [b /[(a + a + 1) c]] c]
#[(a + a + 1) b c] #[a [/[(a + a) c] b] c]
#a #a
*[a [b c] d] [*[a b c] *[a d]]
*[a 0 b] /[b a]
*[a 1 b] b
*[a 2 b c] *[*[a b] *[a c]]
*[a 3 b] ?*[a b]
*[a 4 b] +*[a b]
*[a 5 b c] =[*[a b] *[a c]]
*[a 6 b c d] *[a *[[c d] 0 *[[2 3] 0 *[a 4 4 b]]]]
*[a 7 b c] *[*[a b] c]
*[a 8 b c] *[[*[a b] a] c]
*[a 9 b c] *[*[a c] 2 [0 1] 0 b]
*[a 10 [b c] d] #[b *[a c] *[a d]]
*[a 11 [b c] d] *[[*[a c] *[a d]] 0 3]
*[a 11 b c] *[a c]
*a *a
</code></pre></div></div>
<p>Perhaps you are a neophyte Haskeller and have never implemented an interpreter
for a language before. Nock makes for an excellent target to practice on.</p>
<h3 id="expressions">Expressions</h3>
<p>A <em>noun</em> in Nock is either an <em>atom</em>, i.e. an unsigned integer, or a <em>cell</em>,
i.e. an ordered pair of nouns:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Noun</span> <span class="o">=</span>
<span class="kt">Atom</span> <span class="kt">Integer</span>
<span class="o">|</span> <span class="kt">Cell</span> <span class="kt">Noun</span> <span class="kt">Noun</span>
<span class="kr">deriving</span> <span class="kt">Eq</span>
</code></pre></div></div>
<p>Very simple. A Nock <em>expression</em> is then a noun, or some <em>operator</em> applied to
a noun. Per the spec, the operators are denoted by <code class="language-plaintext highlighter-rouge">?</code>, <code class="language-plaintext highlighter-rouge">+</code>, <code class="language-plaintext highlighter-rouge">=</code>, <code class="language-plaintext highlighter-rouge">/</code>, <code class="language-plaintext highlighter-rouge">#</code>,
and <code class="language-plaintext highlighter-rouge">*</code>, and the fun, convenient, and authentically Urbit-y way to pronounce
these are <em>wut</em>, <em>lus</em>, <em>tis</em>, <em>net</em>, <em>hax</em>, and <em>tar</em> respectively:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Expr</span> <span class="o">=</span>
<span class="kt">Noun</span> <span class="kt">Noun</span>
<span class="o">|</span> <span class="kt">Wut</span> <span class="kt">Noun</span>
<span class="o">|</span> <span class="kt">Lus</span> <span class="kt">Noun</span>
<span class="o">|</span> <span class="kt">Tis</span> <span class="kt">Noun</span>
<span class="o">|</span> <span class="kt">Net</span> <span class="kt">Noun</span>
<span class="o">|</span> <span class="kt">Hax</span> <span class="kt">Noun</span>
<span class="o">|</span> <span class="kt">Tar</span> <span class="kt">Noun</span>
<span class="kr">deriving</span> <span class="kt">Eq</span>
</code></pre></div></div>
<p>So, equipped with the above definitions, you can write Nock expressions in
Haskell. For example, the following expression:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[57 [4 0 1]]
</code></pre></div></div>
<p>maps to:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kt">Tar</span>
<span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">57</span><span class="p">)</span>
<span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">4</span><span class="p">)</span>
<span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">0</span><span class="p">)</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">1</span><span class="p">))))</span>
</code></pre></div></div>
<p>Note per the spec that cells associate to the right, thus something like <code class="language-plaintext highlighter-rouge">[4 0
1]</code> maps to <code class="language-plaintext highlighter-rouge">[4 [0 1]]</code>, and thus <code class="language-plaintext highlighter-rouge">Cell (Atom 4) (Cell (Atom 0) (Atom 1))</code>.</p>
<h3 id="evaluation-and-semantics">Evaluation and Semantics</h3>
<p>For evaluation one can more or less copy out the production rules from the Nock
spec. We can use an Either type to denote a lack of defined semantics:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Error</span> <span class="o">=</span> <span class="kt">Error</span> <span class="kt">Noun</span>
<span class="kr">deriving</span> <span class="kt">Show</span>
<span class="kr">type</span> <span class="kt">Possibly</span> <span class="o">=</span> <span class="kt">Either</span> <span class="kt">Error</span>
</code></pre></div></div>
<p>and then simply define the various evaluation functions appropriately. ‘wut’,
for example, checks if something is a cell:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">wut</span> <span class="o">::</span> <span class="kt">Noun</span> <span class="o">-></span> <span class="kt">Possibly</span> <span class="kt">Noun</span>
<span class="n">wut</span> <span class="n">noun</span> <span class="o">=</span> <span class="n">return</span> <span class="o">$</span> <span class="kr">case</span> <span class="n">noun</span> <span class="kr">of</span>
<span class="kt">Cell</span> <span class="p">{}</span> <span class="o">-></span> <span class="kt">Atom</span> <span class="mi">0</span>
<span class="kt">Atom</span> <span class="p">{}</span> <span class="o">-></span> <span class="kt">Atom</span> <span class="mi">1</span>
</code></pre></div></div>
<p>(note that in Nock and Urbit more generally, 0 denotes ‘true’).</p>
<p>‘tis’ checks if the elements of a cell are equal:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tis</span> <span class="o">::</span> <span class="kt">Noun</span> <span class="o">-></span> <span class="kt">Possibly</span> <span class="kt">Noun</span>
<span class="n">tis</span> <span class="n">noun</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">noun</span> <span class="kr">of</span>
<span class="kt">Atom</span> <span class="p">{}</span> <span class="o">-></span> <span class="kt">Left</span> <span class="p">(</span><span class="kt">Error</span> <span class="n">noun</span><span class="p">)</span>
<span class="kt">Cell</span> <span class="n">m</span> <span class="n">n</span> <span class="o">-></span> <span class="n">return</span> <span class="o">$</span>
<span class="kr">if</span> <span class="n">m</span> <span class="o">==</span> <span class="n">n</span>
<span class="kr">then</span> <span class="kt">Atom</span> <span class="mi">0</span>
<span class="kr">else</span> <span class="kt">Atom</span> <span class="mi">1</span>
</code></pre></div></div>
<p>And so on. There are other operators for lookups, substitution, <em>et cetera</em>.
The most involved operator is ‘tar’, which constitutes the Nock interpreter
proper. One can simply match on the cases appropriately:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tar</span> <span class="o">::</span> <span class="kt">Noun</span> <span class="o">-></span> <span class="kt">Possibly</span> <span class="kt">Noun</span>
<span class="n">tar</span> <span class="n">noun</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">noun</span> <span class="kr">of</span>
<span class="kt">Cell</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">b</span> <span class="n">c</span><span class="p">)</span> <span class="n">d</span><span class="p">)</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">tard0</span> <span class="o"><-</span> <span class="n">tar</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">b</span> <span class="n">c</span><span class="p">))</span>
<span class="n">tard1</span> <span class="o"><-</span> <span class="n">tar</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">a</span> <span class="n">d</span><span class="p">)</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">tard0</span> <span class="n">tard1</span><span class="p">)</span>
<span class="kt">Cell</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">0</span><span class="p">)</span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span>
<span class="n">net</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">b</span> <span class="n">a</span><span class="p">)</span>
<span class="kt">Cell</span> <span class="kr">_</span> <span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">1</span><span class="p">)</span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span>
<span class="n">return</span> <span class="n">b</span>
<span class="kt">Cell</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Cell</span> <span class="p">(</span><span class="kt">Atom</span> <span class="mi">2</span><span class="p">)</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">b</span> <span class="n">c</span><span class="p">))</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">tard0</span> <span class="o"><-</span> <span class="n">tar</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span>
<span class="n">tard1</span> <span class="o"><-</span> <span class="n">tar</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">a</span> <span class="n">c</span><span class="p">)</span>
<span class="n">tar</span> <span class="p">(</span><span class="kt">Cell</span> <span class="n">tard0</span> <span class="n">tard1</span><span class="p">)</span>
<span class="c1">-- ... and so on</span>
</code></pre></div></div>
<p>It is particularly useful to look at ‘tar’ to get an idea of what is going on
in Nock. One is always evaluating a noun to another noun. Semantics are only
defined when the noun being evaluated is a cell, and that cell is said to have
the structure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[subject formula]
</code></pre></div></div>
<p>The subject is essentially the environment, reified explicitly. The formula
is the code to execute against the subject.</p>
<p>The resulting noun is called a ‘product’. So, evaluation proceeds as:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[subject formula] -> product
</code></pre></div></div>
<p>In any valid application of ‘tar’, i.e. for some <code class="language-plaintext highlighter-rouge">*[x y]</code>, ‘x’ is the subject,
and ‘y’ is the formula. The first case of ‘tar’, for example, is defined via:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[a 0 b] /[b a]
</code></pre></div></div>
<p>Recall that cells in Nock are right-associative, so <code class="language-plaintext highlighter-rouge">[a 0 b]</code> is really <code class="language-plaintext highlighter-rouge">[a [0
b]]</code>. In other words: <code class="language-plaintext highlighter-rouge">a</code> is the subject, <code class="language-plaintext highlighter-rouge">[0 b]</code> is the formula. The formula
<code class="language-plaintext highlighter-rouge">[0 b]</code>, read as ‘Nock-zero’, means “look up the value at address ‘b’ in the
subject,” the subject here being ‘a’. So <code class="language-plaintext highlighter-rouge">*[a 0 b]</code> means “look up the value
at address ‘b’ in ‘a’.”</p>
<p>Other formulas in the various ‘tar’ cases denote additional useful language
features. A formula <code class="language-plaintext highlighter-rouge">[1 b]</code> is the constant function, <code class="language-plaintext highlighter-rouge">[6 b c d]</code> is a
conditional, <code class="language-plaintext highlighter-rouge">[7 b c]</code> is function (viz. formula) composition, and so on.
There is a Turing-complete amount of power packed onto that t-shirt!</p>
<p>In any case, once one has constructed all of these requisite evaluation
functions, we can stitch them together in a single evaluator:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Possibly</span> <span class="kt">Noun</span>
<span class="n">eval</span> <span class="n">expr</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Noun</span> <span class="n">noun</span> <span class="o">-></span> <span class="n">return</span> <span class="n">noun</span>
<span class="kt">Wut</span> <span class="n">e</span> <span class="o">-></span> <span class="n">wut</span> <span class="n">e</span>
<span class="kt">Lus</span> <span class="n">e</span> <span class="o">-></span> <span class="n">lus</span> <span class="n">e</span>
<span class="kt">Tis</span> <span class="n">e</span> <span class="o">-></span> <span class="n">tis</span> <span class="n">e</span>
<span class="kt">Net</span> <span class="n">e</span> <span class="o">-></span> <span class="n">net</span> <span class="n">e</span>
<span class="kt">Hax</span> <span class="n">e</span> <span class="o">-></span> <span class="n">hax</span> <span class="n">e</span>
<span class="kt">Tar</span> <span class="n">e</span> <span class="o">-></span> <span class="n">tar</span> <span class="n">e</span>
</code></pre></div></div>
<h3 id="parsing">Parsing</h3>
<p>Writing a combinator-based parser is an easy job. You can use <a href="https://hackage.haskell.org/package/parsec">parsec</a>,
for example, and define your combinators as follows.</p>
<p>Given low-level ‘atom’ and ‘cell’ parsers, a noun is just an atom or a cell:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">noun</span> <span class="o">::</span> <span class="kt">Monad</span> <span class="n">m</span> <span class="o">=></span> <span class="kt">P</span><span class="o">.</span><span class="kt">ParsecT</span> <span class="kt">T</span><span class="o">.</span><span class="kt">Text</span> <span class="n">u</span> <span class="n">m</span> <span class="kt">Noun</span>
<span class="n">noun</span> <span class="o">=</span>
<span class="kt">P</span><span class="o">.</span><span class="n">try</span> <span class="n">cell</span>
<span class="o"><|></span> <span class="n">atom</span>
</code></pre></div></div>
<p>and an expression is either an operator, followed by an noun, or just a noun:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">expr</span> <span class="o">::</span> <span class="kt">Monad</span> <span class="n">m</span> <span class="o">=></span> <span class="kt">P</span><span class="o">.</span><span class="kt">ParsecT</span> <span class="kt">T</span><span class="o">.</span><span class="kt">Text</span> <span class="n">u</span> <span class="n">m</span> <span class="kt">Expr</span>
<span class="n">expr</span> <span class="o">=</span>
<span class="kt">P</span><span class="o">.</span><span class="n">try</span> <span class="n">operator</span>
<span class="o"><|></span> <span class="n">fmap</span> <span class="kt">Noun</span> <span class="n">noun</span>
<span class="n">operator</span> <span class="o">::</span> <span class="kt">Monad</span> <span class="n">m</span> <span class="o">=></span> <span class="kt">P</span><span class="o">.</span><span class="kt">ParsecT</span> <span class="kt">T</span><span class="o">.</span><span class="kt">Text</span> <span class="n">u</span> <span class="n">m</span> <span class="kt">Expr</span>
<span class="n">operator</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">op</span> <span class="o"><-</span> <span class="kt">P</span><span class="o">.</span><span class="n">oneOf</span> <span class="s">"?+=/#*"</span>
<span class="kr">case</span> <span class="n">op</span> <span class="kr">of</span>
<span class="sc">'?'</span> <span class="o">-></span> <span class="n">fmap</span> <span class="kt">Wut</span> <span class="n">noun</span>
<span class="sc">'+'</span> <span class="o">-></span> <span class="n">fmap</span> <span class="kt">Lus</span> <span class="n">noun</span>
<span class="sc">'='</span> <span class="o">-></span> <span class="n">fmap</span> <span class="kt">Tis</span> <span class="n">noun</span>
<span class="sc">'/'</span> <span class="o">-></span> <span class="n">fmap</span> <span class="kt">Net</span> <span class="n">noun</span>
<span class="sc">'#'</span> <span class="o">-></span> <span class="n">fmap</span> <span class="kt">Hax</span> <span class="n">noun</span>
<span class="sc">'*'</span> <span class="o">-></span> <span class="n">fmap</span> <span class="kt">Tar</span> <span class="n">noun</span>
<span class="kr">_</span> <span class="o">-></span> <span class="n">fail</span> <span class="s">"op: bad token"</span>
</code></pre></div></div>
<p>Note that we don’t allow expressions like the following, taken from the Nock
spec:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*[a *[[c d] 0 *[[2 3] 0 *[a 4 4 b]]]]
</code></pre></div></div>
<p>These are used to define the production rules, but are not valid Nock
expressions per se.</p>
<p>The end parser is simply:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">parse</span> <span class="o">::</span> <span class="kt">T</span><span class="o">.</span><span class="kt">Text</span> <span class="o">-></span> <span class="kt">Either</span> <span class="kt">P</span><span class="o">.</span><span class="kt">ParseError</span> <span class="kt">Expr</span>
<span class="n">parse</span> <span class="o">=</span> <span class="kt">P</span><span class="o">.</span><span class="n">runParser</span> <span class="n">expr</span> <span class="kt">[]</span> <span class="s">"input"</span>
</code></pre></div></div>
<h3 id="example">Example</h3>
<p>The final interpreter is pretty simple. Here we distinguish between cases in
order to report either parsing or evaluation errors in basic fashion,
respectively:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hnock</span> <span class="o">::</span> <span class="kt">T</span><span class="o">.</span><span class="kt">Text</span> <span class="o">-></span> <span class="kt">Noun</span>
<span class="n">hnock</span> <span class="n">input</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">parse</span> <span class="n">input</span> <span class="kr">of</span>
<span class="kt">Left</span> <span class="n">perr</span> <span class="o">-></span> <span class="n">error</span> <span class="p">(</span><span class="n">show</span> <span class="n">perr</span><span class="p">)</span>
<span class="kt">Right</span> <span class="n">ex</span> <span class="o">-></span> <span class="kr">case</span> <span class="n">eval</span> <span class="n">ex</span> <span class="kr">of</span>
<span class="kt">Left</span> <span class="n">err</span> <span class="o">-></span> <span class="n">error</span> <span class="p">(</span><span class="n">show</span> <span class="n">err</span><span class="p">)</span>
<span class="kt">Right</span> <span class="n">e</span> <span class="o">-></span> <span class="n">e</span>
</code></pre></div></div>
<p>It’s easy to test. Take the simple Nock-zero example given previously, where
<code class="language-plaintext highlighter-rouge">*[a 0 b]</code> means “look up the value at address ‘b’ in ‘a’”. Nock’s addressing
scheme is simple; address ‘2’ means the leftmost component of a cell, and ‘3’
means the rightmost component. For every additional cell you want to recurse
into, you just multiply by two and add one as is necessary. Address ‘1’ is
just the cell itself.</p>
<p>So, let’s test:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ hnock "*[[[1 2] [3 4]] [0 1]"
[[1 2] [3 4]]
</code></pre></div></div>
<p>Again, the formula <code class="language-plaintext highlighter-rouge">[0 1]</code> means look up at address one, which is the cell
itself.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ hnock "*[[[1 2] [3 4]] [0 2]]"
[1 2]
</code></pre></div></div>
<p>The formula <code class="language-plaintext highlighter-rouge">[0 2]</code> means look up at address 2, which is the leftmost component
of the cell.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ hnock "*[[[1 2] [3 4]] [0 7]]"
4
</code></pre></div></div>
<p>To understand <code class="language-plaintext highlighter-rouge">[0 7]</code>, we first look up the rightmost component <code class="language-plaintext highlighter-rouge">[3 4]</code> at
address 3, then multiply by two and add one for the rightmost component of
that.</p>
<p>The most famous Nock learner’s problem is to implement a function for
<em>decrementing</em> an atom (if you could do this in ~2010 you could receive an
Urbit galaxy for it, which is now probably worth a serious chunk of change).
Nock’s only arithmetic operator is increment, so the trick is to decrement by
actually counting up from zero.</p>
<p>(N.b. in practice this is not how one actually evaluates decrement in Nock.
The compiler is ‘jetted’, such that alternate implementations, e.g. the CPU’s
decrement instruction directly, can be used to evaluate semantically-identical
expressions.)</p>
<p>But the idea is simple enough. To warm up, we can implement this kind of
decrement in Haskell. Don’t handle the underflow condition, just to keep
things simple:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dec</span> <span class="o">::</span> <span class="kt">Integer</span> <span class="o">-></span> <span class="kt">Integer</span>
<span class="n">dec</span> <span class="n">m</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">loop</span> <span class="n">n</span>
<span class="o">|</span> <span class="n">succ</span> <span class="n">n</span> <span class="o">==</span> <span class="n">m</span> <span class="o">=</span> <span class="n">n</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">loop</span> <span class="p">(</span><span class="n">succ</span> <span class="n">n</span><span class="p">)</span>
<span class="kr">in</span> <span class="n">loop</span> <span class="mi">0</span>
</code></pre></div></div>
<p>Look at the language features we’re using to accomplish this. There’s:</p>
<ul>
<li>Variable naming / declaration, for the ‘m’, ‘loop’, and ‘n’ arguments</li>
<li>Integer increment, via the built-in ‘succ’</li>
<li>Equality testing, via ‘==’</li>
<li>Conditional expressions, via the ‘|’ guards</li>
<li>A function call, in ‘loop 0’</li>
</ul>
<p>Let’s try and duplicate this in Nock, step by step. We’ll want analogues for
all of our declared variables, to start.</p>
<p>A Nock formula that simply returns the subject it’s evaluated against is easy:
<code class="language-plaintext highlighter-rouge">[0 1]</code>. We used it above.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ hnock "*[10 [0 1]]"
10
</code></pre></div></div>
<p>That’s the equivalent of our ‘m’ declaration in the Haskell code.</p>
<p>Similarly, we can just ignore the subject and return a ‘0’ via <code class="language-plaintext highlighter-rouge">[1 0]</code>.</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ hnock "*[10 [1 0]]"
0
</code></pre></div></div>
<p>That’s the initial value, 0, that appears in ‘loop 0’.</p>
<p>We can use Nock-8 to pair these together. It evaluates a formula against the
subject, and then uses the original subject, augmented with the product, to
evaluate another one:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ hnock "*[10 [8 [1 0] [0 1]]]"
[0 10]
</code></pre></div></div>
<p>The last thing we need define is the ‘loop’ function itself. We’re going to
place it at the head of the subject, i.e. at address two. Our ‘n’ and ‘m’
variables will then be at addresses six and seven, respectively. When we
define our ‘loop’ analogue, we just need to refer to those addresses, instead
of our variable names.</p>
<p>(N.b. Nock’s addressing scheme seems more or less equivalent to a
tree-flavoured variant of de Bruijn indices.)</p>
<p>The ‘loop’ function in the Haskell code tests to see the successor of ‘n’ is
equal to ‘m’, and if so, returns ‘n’. Otherwise it loops on the successor of
‘n’.</p>
<p>Nock-4 computes the successor. Remember that ‘n’ is at address six, so our
analogue to ‘succ n’ is <code class="language-plaintext highlighter-rouge">[4 0 6]</code>. We want to check if it’s equal to ‘m’, at
address seven, gotten via <code class="language-plaintext highlighter-rouge">[0 7]</code>. The equality check is then handled by
Nock-5:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[5 [4 0 6] [0 7]]
</code></pre></div></div>
<p>Nock-6 is the conditional operator. If the equality check is true, we just
want to return ‘n’ via <code class="language-plaintext highlighter-rouge">[0 6]</code>. The whole formula will look like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[6 [5 [4 0 6] [0 7]] [0 6] <loop on successor of n>]
</code></pre></div></div>
<p>To loop, we’ll use Nock-2. We’ll compute a new subject – consisting of our
‘loop’ analogue at address two, an updated ‘n’ at address six, and our
original ‘m’ at address seven – and then evaluate the same looping formula
against it. In Nock, the formula is:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[2 [[0 2] [4 0 6] [0 7]] [0 2]]
</code></pre></div></div>
<p>And our ‘loop’ analogue in full is thus:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[6 [5 [4 0 6] [0 7]] [0 6] [2 [[0 2] [4 0 6] [0 7]] [0 2]]]
</code></pre></div></div>
<p>Finally, we now need to actually store this in the subject. Again, we’ll
ignore the subject with Nock-1, and use Nock-8 to tuple the formula up with the
‘n’ and ‘m’ analogues:</p>
<p>(N.b. at this point we’re pretty much at line noise – spacing it out doesn’t
make it that much more readable, but it does avoid word wrap.)</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[8
[1 0]
[8
[1
[6
[5 [4 0 6] [0 7]]
[0 6]
[2 [[0 2] [4 0 6] [0 7]] [0 2]]
]
]
[2 [0 1] [0 2]]
]
]
</code></pre></div></div>
<p>Let’s check if it works. We’re embedded in Haskell, after all, so let’s get
some support from our host, for convenience:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>~ let dec = \subj -> "*[" <> subj <> " [8 [1 0] [8 [1 [6 [5 [4 0 6] [0 7]] [0 6] [2 [[0 2] [4 0 6] [0 7]] [0 2]]]] [2 [0 1] [0 2]]]]]"
~ hnock (dec "10")
9
~ hnock (dec "25")
24
~ hnock (dec "200")
199
</code></pre></div></div>
<p>Phew!</p>
<h3 id="fin">Fin</h3>
<p>Nock makes for an interesting “functional assembly language” of sorts. Its
spec is so simple that writing a naïve interpreter is more or less a process of
copying out the production rules. You can do it without a lot of thought.</p>
<p>The time-honoured decrement-writing exercise, on the other hand, really makes
one think about Nock. That the environment is always explicit, you can access
it via a nice addressing scheme, that you can refer to functions via pointers
(viz. addresses), etc. It’s pretty intuitive, when you dig into the mechanics.</p>
<p>In any case – writing a decrement (or similar function) <em>and</em> writing your own
interpreter to test it is a fun exercise. From a practical perspective, it
also gives one a lot of intuition about programming in <a href="https://urbit.org/docs/learn/arvo/hoon/">Hoon</a> – it’s
probably not too far from the truth to compare the relationship between Nock
and Hoon to that between, say, x86 and C. If one is programming in Hoon, it
can’t hurt to to know your Nock!</p>
Crushing ISAAC2018-10-07T00:00:00+04:00https://jtobin.io/crushing-isaac<p>(<strong>UPDATE 2020/06/30</strong>: the good people at <a href="https://www.tweag.io/blog/2020-06-29-prng-test/">tweag.io</a> have since
published a <a href="https://github.com/tweag/random-quality">Nix shell environment</a> that appears to make testing
arbitrary PRNGs much less of a pain. I recommend you check it out!)</p>
<p>I recently needed a good cryptographically-secure and seedable pseudorandom
number generator for Javascript. This didn’t turn out to be as trivial a
procedure as I figured it’d be: most Javascript CSPRNGs I found didn’t appear
to be manually seedable, instead automatically seeding themselves
behind-the-scenes using a trusted high-quality entropy source like /dev/urandom
(as per the <a href="https://www.w3.org/TR/WebCryptoAPI/#Crypto-description">WebCryptoAPI spec</a>).</p>
<p>But! I want to use my own entropy source. So I could either implement my own
cryptographically-secure PRNG, which I would obviously need to test rigorously,
or I could find an existing implementation that had already been vetted widely
by use. I settled on the <a href="https://en.wikipedia.org/wiki/ISAAC_(cipher)">ISAAC PRNG/stream cipher</a>, both because it
seemed like a reasonable choice of PRNG (it is used in things like coreutils’
<em>shred</em>, and there are no known practical attacks against it), and also because
there was a Javascript implementation floating around on the internet. But the
interesting thing was that the implementation did <em>not</em> really seem to be
vetted widely – it hadn’t been updated in five years or so, and both the
<a href="https://github.com/StefanoBalocco/isaac.js">Github repository</a> and the <a href="https://www.npmjs.com/package/isaac">npm package</a> seem to show very little
activity around it in general.</p>
<p>(<strong>Update</strong>: Stefano, the author of the fork I used, later emailed me to point
out that the original version of this ISAAC code is located <a href="https://github.com/rubycon/isaac.js">here</a>. His
fork, on the other hand, is node-friendly.)</p>
<p>I was going to be using this thing in an application that made some nontrivial
demands in terms of security. So while the PRNG itself seemed to fit the bill,
I wanted to be assured that the implementation satisfied at least some basic
criteria for pseudorandomness. I assumed that it <em>probably works</em> – I just
wanted to have some reasonable confidence in making that assumption.</p>
<p>There are a number of statistical suites for testing PRNGs out there: I am
aware of at least the <a href="https://nvlpubs.nist.gov/nistpubs/legacy/sp/nistspecialpublication800-22r1a.pdf">NIST statistical test suite</a>, the <a href="https://en.wikipedia.org/wiki/Diehard_tests">DieHard</a>
suite, and <a href="https://en.wikipedia.org/wiki/TestU01">TestU01</a>, being most familiar with the latter (this is not
saying much). The TestU01 suite is implemented in C; you provide it with a
PRNG represented as a function pointer that returns 32-bit integers, and it
will run the desired battery of tests (SmallCrush, Crush, or BigCrush – each
of increasing size and scope) for you. These more or less consist of
frequentist tests involving the chi-square distribution, with more specialised
test statistics appearing on occasion. Results are provided in terms of
p-values on these statistics.</p>
<p>I found <a href="http://www.pcg-random.org/posts/how-to-test-with-testu01.html">this page</a> that gave some instructions for using TestU01, but
the ISAAC implementation I wanted to test is of course written in Javascript.
So I knew there was some FFI hackery to be done in order to get the two
codebases to play nicely together. I also discovered that Jan de Mooij <a href="https://github.com/jandem/TestU01.js">did
some work</a> on testing JS’s basic ‘Math.random’ generator with TestU01
using <a href="https://github.com/kripken/emscripten">Emscripten</a>, an LLVM-to-JS compiler, so these two resources seemed
a useful place to start.</p>
<p>After several hours of <del>bashing my way through emcc compilation and linker
errors</del> careful and methodical programming, I managed to get everything to
work. Since the documentation for these tools can be kind of sparse, I figured
I’d outline the steps to run the tests here, and hopefully save somebody else a
little bit of time and effort in the future. But that said, this is inevitably
a somewhat tricky procedure – when trying to reproduce my steps, I ran into a
few strange errors that required additional manual fiddling. The following
should be <em>about</em> right, but may require some tweaking on your end.</p>
<h2 id="emscripten-and-testu01">Emscripten and TestU01</h2>
<p>Install and activate <a href="https://kripken.github.io/emscripten-site/docs/getting_started/downloads.html">Emscripten</a> in the current terminal. To go from
the official git repo:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ git clone https://github.com/juj/emsdk.git
$ cd emsdk
$ ./emsdk install latest
$ ./emsdk activate latest
$ source ./emsdk_env.sh
</code></pre></div></div>
<p>Emscripten provides a couple of wrappers over existing tools; notably, there’s
<code class="language-plaintext highlighter-rouge">emconfigure</code> and <code class="language-plaintext highlighter-rouge">emmake</code> for specialising make builds for compilation via
<code class="language-plaintext highlighter-rouge">emcc</code>, the Emscripten compiler itself.</p>
<p>In some other directory, grab the TestU01 suite from the <a href="http://simul.iro.umontreal.ca/testu01/tu01.html">Université de
Montréal website</a> and extract it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ wget http://simul.iro.umontreal.ca/testu01/TestU01.zip
$ unzip -q TestU01.zip
</code></pre></div></div>
<p>This is some oldish, gnarly academic C code that uses a very wonky
autoconf/automake-based build system. There is probably a better way to do it,
but the easiest way to get this thing built without too much grief is to build
it <em>twice</em> – once as per normal, specifying the appropriate base directory,
and once again to specialise it for Emscripten’s use:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ basedir=`pwd`
$ cd TestU01-1.2.3
$ ./configure --prefix="$basedir"
$ make -j 2
$ make -j 2 install
</code></pre></div></div>
<p>If all goes well you’ll see ‘bin’, ‘include’, ‘lib’, and ‘share’ directories
pop up in ‘basedir’. Repeat the analogous steps for emscripten; note that
you’ll get some “no input files” errors here when the configure script checks
dynamic linker characteristics, but these are inconsequential:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ emconfigure ./configure --prefix="$basedir"
$ emmake make -j 2
$ emmake make -j 2 install
</code></pre></div></div>
<p>Similarly, you’ll notice some warnings re: dynamic linking when building.
We’ll handle these later. In the meantime, you can return to your ‘basedir’ to
continue working:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ cd $basedir
</code></pre></div></div>
<h2 id="a-test-shim-for-isaac">A Test Shim for ISAAC</h2>
<p>Check out the <a href="https://raw.githubusercontent.com/StefanoBalocco/isaac.js/master/isaac.js">raw ISAAC code</a>. It’s structured in a sort-of-object-y
way; the state of the PRNG is held in a bunch of opaque internal variables, and
the whole thing is initialised by calling the ‘isaac’ function and binding the
result as a variable. One then iterates the PRNG by calling either the ‘rand’
or ‘random’ property of that variable for a random integer or double,
respectively.</p>
<p>We need to write the actual testing code in C. You can get away with the
following, which I’ve adapted from <a href="http://www.pcg-random.org/posts/how-to-test-with-testu01.html">M.E. O’Neill’s code</a> – call it
something like ‘test-isaac.c’:</p>
<div class="language-c highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">#include</span> <span class="cpf"><emscripten.h></span><span class="cp">
#include</span> <span class="cpf">"TestU01.h"</span><span class="cp">
</span>
<span class="k">extern</span> <span class="kt">void</span> <span class="nf">isaac_init</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="k">extern</span> <span class="kt">unsigned</span> <span class="kt">int</span> <span class="nf">isaac_rand</span><span class="p">(</span><span class="kt">void</span><span class="p">);</span>
<span class="kt">int</span> <span class="nf">main</span><span class="p">()</span>
<span class="p">{</span>
<span class="n">isaac_init</span><span class="p">();</span>
<span class="n">unif01_Gen</span><span class="o">*</span> <span class="n">gen</span> <span class="o">=</span> <span class="n">unif01_CreateExternGenBits</span><span class="p">(</span><span class="s">"ISAAC"</span><span class="p">,</span> <span class="n">isaac_rand</span><span class="p">);</span>
<span class="n">bbattery_SmallCrush</span><span class="p">(</span><span class="n">gen</span><span class="p">);</span>
<span class="n">unif01_DeleteExternGenBits</span><span class="p">(</span><span class="n">gen</span><span class="p">);</span>
<span class="k">return</span> <span class="mi">0</span><span class="p">;</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Note the two external functions I’m declaring here: the first mimics calling
the ‘isaac’ function in Javascript and binding it to a variable, ‘isaac’, and
the second mimics a call to <code class="language-plaintext highlighter-rouge">isaac.rand()</code>. The testing code follows the same
pattern: <code class="language-plaintext highlighter-rouge">isaac_init()</code> initialises the generator state, and <code class="language-plaintext highlighter-rouge">isaac_rand</code>
produces a value from it. The surrounding code passes <code class="language-plaintext highlighter-rouge">isaac_rand</code> in as the
generator to use for the SmallCrush battery of tests.</p>
<h2 id="c-to-llvm-bitcode">C to LLVM Bitcode</h2>
<p>We can compile this to LLVM IR as it is, via Emscripten. But first, recall
those dynamic linker warnings from the initial setup step. Emscripten deals
with a lot of files, compile targets, etc. based on file extension. We thus
need to rename all the .dylib files in the ‘lib’ directory, which are in fact
all LLVM bitcode, to have the appropraite .bc extension. From the ‘lib’
directory itself, one can just do the following in bash:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ for old in *.dylib; do mv $old `basename $old .dylib`.bc; done
</code></pre></div></div>
<p>When that’s done, we can compile the C code to LLVM with <code class="language-plaintext highlighter-rouge">emcc</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ emcc -O3 -o test-isaac.bc \
-I/$basedir/include \
-L/$basedir/lib \
-ltestu01 -lprobdist -lmylib -lm -I/usr/local/include \
test-isaac.c
</code></pre></div></div>
<p>Again, Emscripten decides what its compile target should be via its file
extension – thus here, an output with the <code class="language-plaintext highlighter-rouge">.bc</code> extension means we’re
compiling to LLVM IR.</p>
<h2 id="llvm-to-javascript-and-wasm">LLVM to Javascript and WASM</h2>
<p>Now, to provide the requisite <code class="language-plaintext highlighter-rouge">isaac_init</code> and <code class="language-plaintext highlighter-rouge">isaac_rand</code> symbols to the
compiled LLVM bitcode, we need to pass the ISAAC library itself. This is
another finicky procedure, but there is a method to the madness, and one can
find a bit of documentation on it <a href="https://kripken.github.io/emscripten-site/docs/porting/connecting_cpp_and_javascript/Interacting-with-code.html#implement-a-c-api-in-javascript">here</a>.</p>
<p>Helpfully, Evan Wallace at Figma wrote an emscripten JS <a href="https://github.com/evanw/emscripten-library-generator">library generation
helper</a> that makes this task a little less painful. Install that via npm
as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ npm install -g emscripten-library-generator
</code></pre></div></div>
<p>To wrap the ISAAC code up in the appropriate format, one needs to make a few
small modifications to it. I won’t elaborate on this too much, but one needs
to:</p>
<ul>
<li>Change the <code class="language-plaintext highlighter-rouge">String.prototype.toIntArray</code> function declaration to a simple
<code class="language-plaintext highlighter-rouge">function toIntArray(string)</code> declaration, and alter its use in the code
appropriately,</li>
<li>Change the <code class="language-plaintext highlighter-rouge">var isaac = ...</code> function declaration/execution binding to a
simple <code class="language-plaintext highlighter-rouge">function isaac()</code> declaration, and,</li>
<li>
<p>Declare the two functions used in our C shim:</p>
<div class="language-js highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kd">function</span> <span class="nx">isaac_init</span><span class="p">()</span> <span class="p">{</span>
<span class="kd">var</span> <span class="nx">isaac_initialised</span> <span class="o">=</span> <span class="nx">isaac</span><span class="p">();</span>
<span class="p">}</span>
<span class="kd">function</span> <span class="nx">isaac_rand</span><span class="p">()</span> <span class="p">{</span>
<span class="k">return</span> <span class="nx">isaac_initialised</span><span class="p">.</span><span class="nx">rand</span><span class="p">();</span>
<span class="p">}</span>
</code></pre></div> </div>
</li>
</ul>
<p>Then we can package it up in an emscripten-friendly format as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ emscripten-library-generator isaac-mod.js > isaac-lib.js
</code></pre></div></div>
<p>You’ll need to make one last adjustment. In the <code class="language-plaintext highlighter-rouge">isaac-lib.js</code> file just
generated for us, add the following emscripten <a href="https://kripken.github.io/emscripten-site/docs/porting/connecting_cpp_and_javascript/Interacting-with-code.html#javascript-limits-in-library-files">‘postset’ instruction</a>
above the ‘isaac’ property:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>isaac__postset: 'var isaac_initialised = _isaac();', // add me
isaac: function () { // leave me alone
</code></pre></div></div>
<p>This makes sure that the <code class="language-plaintext highlighter-rouge">isaac_initialised</code> variable referred to in
<code class="language-plaintext highlighter-rouge">isaac_rand</code> is accessible.</p>
<p>Whew. Ok, with all that done, we’ll compile our LLVM bytecode to a Javascript
and <a href="https://webassembly.org/">wasm</a> pair. You’ll need to bump up the total memory option in order
to run the resulting code – I think I grabbed the amount I used from Jan de
Mooij’s example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ emcc --js-library isaac-lib.js \
lib/libtestu01.0.0.1.bc \
-o test-isaac.js \
-s TOTAL_MEMORY=536870912 \
test-isaac.bc
</code></pre></div></div>
<h2 id="running-smallcrush">Running SmallCrush</h2>
<p>That’s about it. If all has gone well, you should have seen no compilation or
linking errors when running emcc, and you should have a couple of
‘test-isaac.js’ and ‘test-isaac.wasm’ files kicking around in your ‘basedir’.</p>
<p>To (finally) run the Small Crush suite, execute ‘test-isaac.js’ with node:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ node test-isaac.js
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
Starting SmallCrush
Version: TestU01 1.2.3
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
***********************************************************
HOST = emscripten, Emscripten
ISAAC
smarsa_BirthdaySpacings test:
-----------------------------------------------
N = 1, n = 5000000, r = 0, d = 1073741824, t = 2, p = 1
Number of cells = d^t = 1152921504606846976
Lambda = Poisson mean = 27.1051
----------------------------------------------------
Total expected number = N*Lambda : 27.11
Total observed number : 27
p-value of test : 0.53
-----------------------------------------------
CPU time used : 00:00:00.00
Generator state:
<more output omitted>
========= Summary results of SmallCrush =========
Version: TestU01 1.2.3
Generator: ISAAC
Number of statistics: 15
Total CPU time: 00:00:00.00
All tests were passed
</code></pre></div></div>
<p>Et voilà, your SmallCrush battery output is dumped to stdout. You need only to
tweak the C code shim and recompile if you want to run the more intensive Crush
or BigCrush suites. Similarly, you can dump generator output to stdout with
<code class="language-plaintext highlighter-rouge">console.log()</code> if you want to reassure yourself that the generator is running
correctly.</p>
<h2 id="fin">Fin</h2>
<p>So: the Javascript ISAAC PRNG indeed passes TestU01. Nice! It was satisfying
enough to get this hacky sequence of steps to actually <em>run</em>, but it was even
better to see the tests actually pass, as I’d both hoped and expected.</p>
<p>I did a few extra things to convince myself that ISAAC was really passing the
tests as it seemed to be doing. I ran TestU01 on a cheap little xorshift
generator (which failed several tests) and also did some ad-hoc analysis of
values that I had ISAAC log to stdout. And, I even looked at the code, and
compared it with a reference implementation by eye. At this point, I’m
convinced it operates as advertised, and I feel very comfortable using it in my
application.</p>
Transforming to CPS2018-08-04T00:00:00+04:00https://jtobin.io/transforming-to-cps<p>I recently picked up Appel’s classic <a href="https://www.amazon.ca/Compiling-Continuations-Andrew-W-Appel/dp/052103311X">Compiling with Continuations</a> and
have been refreshing my continuation-fu more generally.</p>
<p><a href="https://en.wikipedia.org/wiki/Continuation-passing_style">Continuation-passing style</a> (CPS) itself is nothing uncommon to the
functional programmer; it simply involves writing in a manner such that
functions never return, instead passing control over to something else (a
<em>continuation</em>) to finish the job. The simplest example is just the identity
function, which in CPS looks like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">id</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="n">b</span>
<span class="n">id</span> <span class="n">x</span> <span class="n">k</span> <span class="o">=</span> <span class="n">k</span> <span class="n">x</span>
</code></pre></div></div>
<p>The first argument is the conventional identity function argument – the second
is the continuation. I wrote a little about continuations <a href="/giry-monad-implementation">in the context of
the Giry monad</a>, which is a somewhat unfamiliar setting, but one that
follows the same principles as anything else.</p>
<p>In this post I just want to summarise a few useful CPS transforms and related
techniques in one place.</p>
<h2 id="manual-cps-transformation">Manual CPS Transformation</h2>
<p>Consider a binary tree type. We’ll keep things simple here:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Tree</span> <span class="n">a</span> <span class="o">=</span>
<span class="kt">Leaf</span> <span class="n">a</span>
<span class="o">|</span> <span class="kt">Branch</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Tree</span> <span class="n">a</span><span class="p">)</span> <span class="p">(</span><span class="kt">Tree</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>Calculating the depth of a tree is done very easily:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">depth</span> <span class="o">::</span> <span class="kt">Tree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">depth</span> <span class="o">=</span> <span class="n">loop</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">tree</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">tree</span> <span class="kr">of</span>
<span class="kt">Leaf</span> <span class="kr">_</span> <span class="o">-></span> <span class="mi">1</span>
<span class="kt">Branch</span> <span class="kr">_</span> <span class="n">l</span> <span class="n">r</span> <span class="o">-></span>
<span class="kr">let</span> <span class="n">dl</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">l</span>
<span class="n">dr</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">r</span>
<span class="kr">in</span> <span class="n">succ</span> <span class="p">(</span><span class="n">max</span> <span class="n">dl</span> <span class="n">dr</span><span class="p">)</span>
</code></pre></div></div>
<p>Note however that this is not a tail-recursive function – that is, it does not
end with a call to itself (instead it ends with a call to something like ‘succ
. uncurry max’). This isn’t necessarily a big deal – the function is easy to
read and write and everything, and certainly has fine performance
characteristics in Haskell – but it is less easy to deal with for, say, an
optimising compiler that may want to handle evaluation in this or that
alternative way (primarily related to memory management).</p>
<p>One can construct a tail-recursive (depth-first) version of ‘depth’ via a
manual CPS transformation. The looping function is simply augmented to take an
additional continuation argument, like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">depth</span> <span class="o">::</span> <span class="kt">Tree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">depth</span> <span class="n">tree</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">tree</span> <span class="n">id</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">cons</span> <span class="n">k</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">cons</span> <span class="kr">of</span>
<span class="kt">Leaf</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">k</span> <span class="mi">1</span>
<span class="kt">Branch</span> <span class="kr">_</span> <span class="n">l</span> <span class="n">r</span> <span class="o">-></span>
<span class="n">loop</span> <span class="n">l</span> <span class="o">$</span> <span class="nf">\</span><span class="n">dl</span> <span class="o">-></span>
<span class="n">loop</span> <span class="n">r</span> <span class="o">$</span> <span class="nf">\</span><span class="n">dr</span> <span class="o">-></span>
<span class="n">k</span> <span class="p">(</span><span class="n">succ</span> <span class="p">(</span><span class="n">max</span> <span class="n">dl</span> <span class="n">dr</span><span class="p">))</span>
</code></pre></div></div>
<p>Notice now that the ‘loop’ function terminates with a call to itself (or just
passes control to a supplied continuation), and is thus tail-recursive.</p>
<p>Due to the presence of the continuation argument, ‘loop’ is a higher-order
function. This is fine and dandy in Haskell, but there is a neat technique
called <a href="https://en.wikipedia.org/wiki/Defunctionalization">defunctionalisation</a> that allows us to avoid the jump to
higher-order and makes sure things stay KILO (“keep it lower order”), which can
be simpler to deal with more generally.</p>
<p>The idea is just to reify the continuations as abstract syntax, and then
evaluate them as one would any embedded language. Note the continuation <code class="language-plaintext highlighter-rouge">\dl
-> ..</code>, for example – the free parameters ‘r’ and ‘k’ occuring in the function
body correspond to a tree (the right subtree) and another continuation,
respectively. And in <code class="language-plaintext highlighter-rouge">\dr -> ..</code> one has the free parameters ‘dl’ and ‘k’ –
now the depth of the left subtree, and the other continuation again. We also
have ‘id’ used on the initial call to ‘loop’. These can all be reified via the
following data type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">DCont</span> <span class="n">a</span> <span class="o">=</span>
<span class="kt">DContL</span> <span class="p">(</span><span class="kt">Tree</span> <span class="n">a</span><span class="p">)</span> <span class="p">(</span><span class="kt">DCont</span> <span class="n">a</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">DContR</span> <span class="kt">Int</span> <span class="p">(</span><span class="kt">DCont</span> <span class="n">a</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">DContId</span>
</code></pre></div></div>
<p>Note that this is a very simple recursive type – it has a simple list-like
pattern of recursion, in which each ‘level’ of a value is either a constructor,
carrying both a field of some type and a recursive point, or is the ‘DContId’
constructor, which simply terminates the recursion. The reified continuations
are, on a suitable level of abstraction, more or less the sequential operations
to be performed in the computation. In other words: by reifying the
continuations, we also reify the stack of the computation.</p>
<p>Now ‘depth’ can be rewritten such that its looping function is not
higher-order; the cost is that another function is needed, one that lets us
evaluate items (again, reified continuations) on the stack:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">depth</span> <span class="o">::</span> <span class="kt">Tree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">depth</span> <span class="n">tree</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">tree</span> <span class="kt">DContId</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">cons</span> <span class="n">k</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">cons</span> <span class="kr">of</span>
<span class="kt">Leaf</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">eval</span> <span class="n">k</span> <span class="mi">1</span>
<span class="kt">Branch</span> <span class="kr">_</span> <span class="n">l</span> <span class="n">r</span> <span class="o">-></span> <span class="n">loop</span> <span class="n">l</span> <span class="p">(</span><span class="kt">DContL</span> <span class="n">r</span> <span class="n">k</span><span class="p">)</span>
<span class="n">eval</span> <span class="n">cons</span> <span class="n">d</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">cons</span> <span class="kr">of</span>
<span class="kt">DContL</span> <span class="n">r</span> <span class="n">k</span> <span class="o">-></span> <span class="n">loop</span> <span class="n">r</span> <span class="p">(</span><span class="kt">DContR</span> <span class="n">d</span> <span class="n">k</span><span class="p">)</span>
<span class="kt">DContR</span> <span class="n">dl</span> <span class="n">k</span> <span class="o">-></span> <span class="n">eval</span> <span class="n">k</span> <span class="p">(</span><span class="n">succ</span> <span class="p">(</span><span class="n">max</span> <span class="n">dl</span> <span class="n">d</span><span class="p">))</span>
<span class="kt">DContId</span> <span class="o">-></span> <span class="n">d</span>
</code></pre></div></div>
<p>The resulting function is <em>mutually</em> tail-recursive in terms of both ‘loop’
and ‘eval’, neither of which are higher-order.</p>
<p>One can do a little better in this particular case and reify the stack using an
actual Haskell list, which simplifies evaluation somewhat – it just requires
that the list elements have a type along the lines of ‘(Tree a, Int)’ rather
than something like ‘Either (Tree a) Int’, which is effectively what we get
from ‘DCont a’. You can see an example of this <a href="https://stackoverflow.com/questions/21205213/haskell-tail-recursion-version-of-depth-of-binary-tree">in this StackOverflow
answer</a> by Chris Taylor.</p>
<h2 id="mechanical-cps-transformation">Mechanical CPS Transformation</h2>
<p>“Mechanical CPS transformation” might be translated as simply “compiling with
continuations.” <a href="http://matt.might.net/">Matt Might</a> has quite a few posts on this topic; in
particular he has one <a href="http://matt.might.net/articles/cps-conversion/">very nice post</a> on mechanical CPS conversion that
summarises various transformations described in Appel, etc.</p>
<p>Matt describes three transformations that I think illustrate the general
mechanical CPS business very well (he describes more, but they are more
specialised). The first is a “naive” transformation, which is simple, but
produces a lot of noisy “administrative redexes” that must be cleaned up in
another pass. The second is a higher-order transformation, which makes use of
the host language’s facilities for function definition and application – it
produces simpler code, but some unnecessary noise still leaks through. The
last is a “hybrid” transformation, which makes use of both the naive and
higher-order transformations, depending on which is more appropriate.</p>
<p>Let’s take a look at these in Haskell. First let’s get some imports out of the
way:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE OverloadedStrings #-}</span>
<span class="kr">import</span> <span class="nn">Data.Monoid</span>
<span class="kr">import</span> <span class="nn">Data.Text</span> <span class="p">(</span><span class="kt">Text</span><span class="p">)</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Data.Text</span> <span class="k">as</span> <span class="n">T</span>
<span class="kr">import</span> <span class="nn">Data.Unique</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Text.PrettyPrint.Leijen.Text</span> <span class="k">as</span> <span class="n">PP</span>
</code></pre></div></div>
<p>I’ll also make use of a simple, Racket-like ‘gensym’ function:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gensym</span> <span class="o">::</span> <span class="kt">IO</span> <span class="kt">Text</span>
<span class="n">gensym</span> <span class="o">=</span> <span class="n">fmap</span> <span class="n">render</span> <span class="n">newUnique</span> <span class="kr">where</span>
<span class="n">render</span> <span class="n">u</span> <span class="o">=</span>
<span class="kr">let</span> <span class="n">hu</span> <span class="o">=</span> <span class="n">hashUnique</span> <span class="n">u</span>
<span class="kr">in</span> <span class="kt">T</span><span class="o">.</span><span class="n">pack</span> <span class="p">(</span><span class="s">"$v"</span> <span class="o"><></span> <span class="n">show</span> <span class="n">hu</span><span class="p">)</span>
</code></pre></div></div>
<p>We’ll use a bare-bones lambda calculus as our input language. Many examples –
Appel’s especially – use significantly more complex languages when
illustrating CPS transforms, but I think this distracts from the meat of the
topic. Lambda does just fine:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Expr</span> <span class="o">=</span>
<span class="kt">Lam</span> <span class="kt">Text</span> <span class="kt">Expr</span>
<span class="o">|</span> <span class="kt">Var</span> <span class="kt">Text</span>
<span class="o">|</span> <span class="kt">App</span> <span class="kt">Expr</span> <span class="kt">Expr</span>
</code></pre></div></div>
<p>I want to render expressions in my input and output languages in a Lisp-like
manner. This is very easy to do using a good pretty-printing library;
here I’m using the excellent <em>wl-pprint-text</em>, and will omit the ‘Pretty’
instances in the body of my post. But I’ll link to a gist including them at
the bottom.</p>
<p>When performing a mechanical CPS transform, one targets both “atomic”
expressions – i.e., variables and lambda abstractions – and “complex”
expressions, i.e. function application. The target language is thus a
combination of the ‘AExpr’ and ‘CExpr’ types:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">AExpr</span> <span class="o">=</span>
<span class="kt">AVar</span> <span class="kt">Text</span>
<span class="o">|</span> <span class="kt">ALam</span> <span class="p">[</span><span class="kt">Text</span><span class="p">]</span> <span class="kt">CExpr</span>
<span class="kr">data</span> <span class="kt">CExpr</span> <span class="o">=</span>
<span class="kt">CApp</span> <span class="kt">AExpr</span> <span class="p">[</span><span class="kt">AExpr</span><span class="p">]</span>
</code></pre></div></div>
<p>All the mechanical CPS transformations use variants on two functions going by
the cryptic names <strong>m</strong> and <strong>t</strong>. <strong>m</strong> is responsible for converting
atomic expressions in the input languages (i.e., variables and lambda
abstractions) into atomic expressions in the target language (an atomic CPS
expression). <strong>t</strong> is the actual CPS transformation; it converts an expression
in the input language into CPS, invoking a specified continuation (already in
the target language) on the result.</p>
<p>Let’s look at the naive transform. Here are <strong>m</strong> and <strong>t</strong>, prefixed by ‘n’
to indicate that they are naive. First, <strong>m</strong>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nm</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">AExpr</span>
<span class="n">nm</span> <span class="n">expr</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="n">var</span> <span class="n">cexpr0</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">k</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="n">cexpr1</span> <span class="o"><-</span> <span class="n">nt</span> <span class="n">cexpr0</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">k</span><span class="p">)</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">ALam</span> <span class="p">[</span><span class="n">var</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="n">cexpr1</span><span class="p">)</span>
<span class="kt">Var</span> <span class="n">var</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">var</span><span class="p">)</span>
<span class="kt">App</span> <span class="p">{}</span> <span class="o">-></span> <span class="n">error</span> <span class="s">"non-atomic expression"</span>
</code></pre></div></div>
<p>(N.b. you almost never want to use ‘error’ in a production implementation of
<em>anything</em>. It’s trivial to wrap e.g. ‘MaybeT’ around the appropriate
functions to handle the bogus pattern match on ‘App’ totally, but I just want
to keep the types super simple here.)</p>
<p>The only noteworthy thing that <strong>m</strong> does here is in the case of a lambda
abstraction: a new abstract continuation is generated, and the body of the
abstraction is converted to CPS via <strong>t</strong>, such that the freshly-generated
continuation is called on the result. Remember, <strong>m</strong> is really just mapping
atomic expressions in the input language to atomic expressions in the target
language.</p>
<p>Here’s <strong>t</strong> for the naive transform. Remember, <strong>t</strong> is responsible for
converting expressions to continuation-passing style:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nt</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">AExpr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">CExpr</span>
<span class="n">nt</span> <span class="n">expr</span> <span class="n">cont</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="p">{}</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">cont</span> <span class="p">[</span><span class="n">aexpr</span><span class="p">])</span>
<span class="kt">Var</span> <span class="kr">_</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">cont</span> <span class="p">[</span><span class="n">aexpr</span><span class="p">])</span>
<span class="kt">App</span> <span class="n">f</span> <span class="n">e</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">fs</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="n">es</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="kr">let</span> <span class="n">aexpr0</span> <span class="o">=</span> <span class="kt">ALam</span> <span class="p">[</span><span class="n">es</span><span class="p">]</span> <span class="p">(</span><span class="kt">CApp</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">fs</span><span class="p">)</span> <span class="p">[</span><span class="kt">AVar</span> <span class="n">es</span><span class="p">,</span> <span class="n">cont</span><span class="p">])</span>
<span class="n">cexpr</span> <span class="o"><-</span> <span class="n">nt</span> <span class="n">e</span> <span class="n">aexpr0</span>
<span class="kr">let</span> <span class="n">aexpr1</span> <span class="o">=</span> <span class="kt">ALam</span> <span class="p">[</span><span class="n">fs</span><span class="p">]</span> <span class="n">cexpr</span>
<span class="n">nt</span> <span class="n">f</span> <span class="n">aexpr1</span>
</code></pre></div></div>
<p>For both kinds of atomic expressions (lambda and variable), the expression is
converted to the target language via <strong>m</strong>, and then the supplied continuation
is applied to it. Very simple.</p>
<p>In the case of function application (a “complex”, or non-atomic expression),
both the function to be applied, and the argument it is to be applied to, must
be converted to CPS. This is done by generating two fresh continuations,
transforming the argument, and then transforming the function. The control
flow here is always handled by stitching continuations together; notice when
transforming the function ‘f’ that the continuation to be applied has already
handled its argument.</p>
<p>Next, the higher-order transform. Here are <strong>m</strong> and <strong>t</strong>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">hom</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">AExpr</span>
<span class="n">hom</span> <span class="n">expr</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="n">var</span> <span class="n">e</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">k</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="n">ce</span> <span class="o"><-</span> <span class="n">hot</span> <span class="n">e</span> <span class="p">(</span><span class="nf">\</span><span class="n">rv</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">k</span><span class="p">)</span> <span class="p">[</span><span class="n">rv</span><span class="p">]))</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">ALam</span> <span class="p">[</span><span class="n">var</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="n">ce</span><span class="p">)</span>
<span class="kt">Var</span> <span class="n">n</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">n</span><span class="p">)</span>
<span class="kt">App</span> <span class="p">{}</span> <span class="o">-></span> <span class="n">error</span> <span class="s">"non-atomic expression"</span>
<span class="n">hot</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="p">(</span><span class="kt">AExpr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">CExpr</span><span class="p">)</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">CExpr</span>
<span class="n">hot</span> <span class="n">expr</span> <span class="n">k</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="p">{}</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">k</span> <span class="n">aexpr</span>
<span class="kt">Var</span> <span class="p">{}</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">k</span> <span class="n">aexpr</span>
<span class="kt">App</span> <span class="n">f</span> <span class="n">e</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">rv</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="n">xformed</span> <span class="o"><-</span> <span class="n">k</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">rv</span><span class="p">)</span>
<span class="kr">let</span> <span class="n">cont</span> <span class="o">=</span> <span class="kt">ALam</span> <span class="p">[</span><span class="n">rv</span><span class="p">]</span> <span class="n">xformed</span>
<span class="n">cexpr</span> <span class="n">fs</span> <span class="o">=</span> <span class="n">hot</span> <span class="n">e</span> <span class="p">(</span><span class="nf">\</span><span class="n">es</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">fs</span> <span class="p">[</span><span class="n">es</span><span class="p">,</span> <span class="n">cont</span><span class="p">]))</span>
<span class="n">hot</span> <span class="n">f</span> <span class="n">cexpr</span>
</code></pre></div></div>
<p>Both of these have the same form as they do in the naive transform – the
difference here is simply that the continuation to be applied to a transformed
expression is expressed in the host language – i.e., here, Haskell. Thus the
transform is “higher-order,” in exactly the same sense that <a href="/sharing-in-haskell-edsls">higher-order
abstract syntax</a> is higher-order.</p>
<p>The final transformation I’ll illustrate here, the hybrid transform, applies
the naive transformation to lambda and variable expressions, and applies the
higher-order transformation to function applications. Here <strong>t</strong> is split up
into <strong>tc</strong> and <strong>tk</strong> to handle these cases accordingly:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">m</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">AExpr</span>
<span class="n">m</span> <span class="n">expr</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="n">var</span> <span class="n">cexpr</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">k</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="n">xformed</span> <span class="o"><-</span> <span class="n">tc</span> <span class="n">cexpr</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">k</span><span class="p">)</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">ALam</span> <span class="p">[</span><span class="n">var</span><span class="p">,</span> <span class="n">k</span><span class="p">]</span> <span class="n">xformed</span><span class="p">)</span>
<span class="kt">Var</span> <span class="n">n</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">n</span><span class="p">)</span>
<span class="kt">App</span> <span class="p">{}</span> <span class="o">-></span> <span class="n">error</span> <span class="s">"non-atomic expression"</span>
<span class="n">tc</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">AExpr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">CExpr</span>
<span class="n">tc</span> <span class="n">expr</span> <span class="n">c</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="p">{}</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">c</span> <span class="p">[</span><span class="n">aexpr</span><span class="p">])</span>
<span class="kt">Var</span> <span class="kr">_</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">c</span> <span class="p">[</span><span class="n">aexpr</span><span class="p">])</span>
<span class="kt">App</span> <span class="n">f</span> <span class="n">e</span> <span class="o">-></span> <span class="kr">do</span>
<span class="kr">let</span> <span class="n">cexpr</span> <span class="n">fs</span> <span class="o">=</span> <span class="n">tk</span> <span class="n">e</span> <span class="p">(</span><span class="nf">\</span><span class="n">es</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">fs</span> <span class="p">[</span><span class="n">es</span><span class="p">,</span> <span class="n">c</span><span class="p">]))</span>
<span class="n">tk</span> <span class="n">f</span> <span class="n">cexpr</span>
<span class="n">tk</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="p">(</span><span class="kt">AExpr</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">CExpr</span><span class="p">)</span> <span class="o">-></span> <span class="kt">IO</span> <span class="kt">CExpr</span>
<span class="n">tk</span> <span class="n">expr</span> <span class="n">k</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">expr</span> <span class="kr">of</span>
<span class="kt">Lam</span> <span class="p">{}</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">k</span> <span class="n">aexpr</span>
<span class="kt">Var</span> <span class="p">{}</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">aexpr</span> <span class="o"><-</span> <span class="n">m</span> <span class="n">expr</span>
<span class="n">k</span> <span class="n">aexpr</span>
<span class="kt">App</span> <span class="n">f</span> <span class="n">e</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">rv</span> <span class="o"><-</span> <span class="n">gensym</span>
<span class="n">xformed</span> <span class="o"><-</span> <span class="n">k</span> <span class="p">(</span><span class="kt">AVar</span> <span class="n">rv</span><span class="p">)</span>
<span class="kr">let</span> <span class="n">cont</span> <span class="o">=</span> <span class="kt">ALam</span> <span class="p">[</span><span class="n">rv</span><span class="p">]</span> <span class="n">xformed</span>
<span class="n">cexpr</span> <span class="n">fs</span> <span class="o">=</span> <span class="n">tk</span> <span class="n">e</span> <span class="p">(</span><span class="nf">\</span><span class="n">es</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="kt">CApp</span> <span class="n">fs</span> <span class="p">[</span><span class="n">es</span><span class="p">,</span> <span class="n">cont</span><span class="p">]))</span>
<span class="n">tk</span> <span class="n">f</span> <span class="n">cexpr</span>
</code></pre></div></div>
<p>Matt illustrates these transformations on a simple expression: <code class="language-plaintext highlighter-rouge">(g a)</code>. We can
do the same:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">test</span> <span class="o">::</span> <span class="kt">Expr</span>
<span class="n">test</span> <span class="o">=</span> <span class="kt">App</span> <span class="p">(</span><span class="kt">Var</span> <span class="s">"g"</span><span class="p">)</span> <span class="p">(</span><span class="kt">Var</span> <span class="s">"a"</span><span class="p">)</span>
</code></pre></div></div>
<p>First, the naive transform. Note all the noisy administrative redexes that
come along with it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> cexpr <- nt test (AVar "halt")
> PP.pretty cexpr
((λ ($v1).
((λ ($v2).
($v1 $v2 halt)) a)) g)
</code></pre></div></div>
<p>The higher-order transform does better, containing only one such redex (an
eta-expansion). Note that since the supplied continuation must be expressed
in terms of a Haskell function, we need to write it in a more HOAS-y style:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> cexpr <- hot test (\ans -> return (CApp (AVar "halt") [ans]))
> PP.pretty cexpr
(g a (λ ($v3).
(halt $v3)))
</code></pre></div></div>
<p>Finally the hybrid transform, which, here, is literally perfect. We don’t even
need to deal with the minor annoyance of the HOAS-style continuation when
calling it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> cexpr <- tc test (AVar "halt")
> PP.pretty cexpr
(g a halt)
</code></pre></div></div>
<p>Matt goes on to describe a “partioned CPS transform” that can be used to
recover a stack, in (seemingly) much the same manner that the defunctionalised
manual CPS transform worked in the previous section. Very neat, but something
I’ll have to look at in another post.</p>
<h2 id="fin">Fin</h2>
<p>CPS is pretty gnarly. My experience in <em>compiling</em> with continuations is not
substantial, but I dig learning it. Appel’s book, in particular, is meaty –
expect more posts on the subject here eventually, probably.</p>
<p>‘Til next time! I’ve dumped the code from the latter section into a
<a href="https://gist.github.com/jtobin/5df30cf14a57579af76ef0c05211d0e0">gist</a>.</p>
Embedded DSLs for Bayesian Modelling and Inference: a Retrospective2018-07-02T00:00:00+04:00https://jtobin.io/embedded-dsls-bayes<p>Why does my blog often feature its typical motley mix of probability,
functional programming, and computer science anyway?</p>
<p>From 2011 through 2017 I slogged through a Ph.D. in statistics, working on it
full time in 2012, and part-time in every other year. It was an interesting
experience. Although everything worked out for me in the end – I managed to
do a lot of good and interesting work in industry while still picking up a
Ph.D. on the side – it’s not something I’d necessarily recommend to others.
The smart strategy is surely to choose one thing and give it one’s maximum
effort; by splitting my time between work and academia, both obviously suffered
to some degree.</p>
<p>That said, at the end of the day I was pretty happy with the results on both
fronts. On the academic side of things, the main product was a dissertation,
<a href="https://jtobin.io/assets/jtobin-dissertation.pdf"><em>Embedded Domain-Specific Languages for Bayesian Modelling and
Inference</em></a>, supporting my thesis: that novel and useful DSLs for solving
problems in Bayesian statistics can be embedded in statically-typed, purely
functional programming languages.</p>
<p>It helps to remember that in this day and age, one can still typically graduate
by, uh, “merely” submitting and defending a dissertation. Publishing in
academic venues certainly helps focus one’s work, and is obviously necessary
for a career in academia (or, increasingly, industrial research). But it’s
optional when it comes to <em>getting your degree</em>, so if it doesn’t help you
achieve your goals, you may want to reconsider it, as I did.</p>
<p>The problem with the dissertation-first approach, of course, is that nobody
reads your work. To some extent I think I’ve mitigated that; most of the
content in my dissertation is merely a fleshed-out version of various ideas
I’ve written about on this blog. Here I’ll continue that tradition and write a
brief, informal summary of my dissertation and Ph.D. more broadly – what I
did, how I approached it, and what my thoughts are on everything after the
fact.</p>
<h2 id="the-idea">The Idea</h2>
<p>Following the advice of <a href="http://www.ccs.neu.edu/home/shivers/diss-advice.html">Olin Shivers</a> (by way of <a href="http://matt.might.net/">Matt Might</a>), I
oriented my work around a concrete <strong>thesis</strong>, which wound up more or less
being that embedding DSLs in a Haskell-like language can be a useful technique
for solving statistical problems. This thesis wasn’t born into the world
fully-formed, of course – it began as quite a vague (or misguided) thing, but
matured naturally over time. Using the tools of programming languages and
compilers to do statistics and machine learning is the motivation behind
probabilistic programming in general; what I was interested in was exploring
the problem in the setting of languages <em>embedded</em> in a purely functional host.
Haskell was the obvious choice of host for all of my implementations.</p>
<p>It may sound obvious that putting together a <em>thesis</em> is a good strategy for a
Ph.D. But here I’m talking about a thesis in the original (Greek) sense
of <em>a proposition</em>, i.e. a falsifiable idea or claim (in contrast to a
<em>dissertation</em>, from the Latin <em>disserere</em>, i.e. to examine or to discuss).
Having a central idea to orient your work around can be immensely useful in
terms of focus. When you read a dissertation with a clear thesis, it’s easy to
know what the writer is generally on about – without one it can (increasingly)
be tricky.</p>
<p>My thesis is pretty easy to defend in the abstract. A DSL really exposes the
structure of one’s problem while also constraining it appropriately, and
<em>embedding</em> one in a host language means that one doesn’t have to implement an
entire compiler toolchain to support it. I reckoned that simply pointing the
artillery of “language engineering” at the statistical domain would lead to
some interesting insight on structure, and maybe even produce some useful
tools. And it did!</p>
<h2 id="the-contributions">The Contributions</h2>
<p>Of course, one needs to do a little more defending than that to satisfy his or
her examination committee. Doctoral research is supposed to be substantial and
novel. In my experience, reviewers are concerned with your answers to the
following questions:</p>
<ul>
<li>What, specifically, are your claims?</li>
<li>Are they novel contributions to your field?</li>
<li>Have you backed them up sufficiently?</li>
</ul>
<p>At the end of the day, I claimed the following advances from my work.</p>
<ul>
<li>
<p>Novel <strong>probabilistic interpretations</strong> of the Giry monad’s algebraic
structure. The Giry monad (<a href="https://github.com/mattearnshaw/lawvere/blob/master/pdfs/1962-the-category-of-probabilistic-mappings.pdf">Lawvere, 1962</a>; <a href="https://www.chrisstucchio.com/blog_media/2016/probability_the_monad/categorical_probability_giry.pdf">Giry, 1981</a>) is the
“canonical” probability monad, in a meaningful sense, and I demonstrated that
one can characterise the measure-theoretic notion of <strong>image measure</strong> by its
functorial structure, as well as the notion of <strong>product measure</strong> by its
monoidal structure. Having the former around makes it easy to transform the
support of a probability distribution while leaving its density structure
invariant, and the latter lets one encode probabilistic independence, enabling
things like measure convolution and the like. What’s more, the analogous
semantics carry over to other probability monads – for example the well-known
sampling monad, or more abstract variants.</p>
</li>
<li>
<p>A novel <strong>characterisation of the Giry monad as a restricted continuation
monad</strong>. <a href="https://www.cs.tufts.edu/~nr/pubs/pmonad.pdf">Ramsey & Pfeffer (2002)</a> discussed an “expectation monad,”
and I had independently come up with my own “measure monad” based on
continuations. But I showed both reduce to a restricted form of the
continuation monad of <a href="https://pdfs.semanticscholar.org/3d22/31608c7ba19935c610afd60f13bbe89d6b55.pdf">Wadler (1994)</a> – and that indeed, when the
return type of Wadler’s continuation monad is restricted to the reals, it <em>is</em>
the Giry monad.</p>
<p>To be precise it’s actually somewhat more general – it permits integration
with respect to <em>any</em> measure, not only a probability measure – but that
definition strictly subsumes the Giry monad. I also showed that product
measure, via the applicative instance, yields <strong>measure convolution</strong> and
associated operations.</p>
</li>
<li>
<p>A novel technique for <strong>embedding a statically-typed probabilistic
programming language in a purely functional language</strong>. The general idea
itself is well-known to those who have worked with DSLs in Haskell: one
constructs a base functor and wraps it in the free monad. But the reason
that technique is appropriate in the probabilistic programming domain is that
probabilistic models are fundamentally <em>monadic</em> constructs – merely recall
the existence of the Giry monad for proof!</p>
<p>To construct the requisite base functor, one maps some core set of concrete
probability distributions denoted by the Giry monad to a collection of
<em>abstract</em> probability distributions represented only by unique names. These
constitute the branches of one’s base functor, which is then wrapped in the
familiar ‘Free’ machinery that gives one access to the functorial,
applicative, and monadic structure that I talked about above. This abstract
representation of a probabilistic model allows one to implement other
probability monads, such as the well-known sampling monad (<a href="https://www.cs.tufts.edu/~nr/pubs/pmonad.pdf">Ramsey & Pfeffer,
2002</a>; <a href="https://www.cs.cmu.edu/~fp/papers/toplas08.pdf">Park et al., 2008</a>) or the Giry monad, by way of interpreters.</p>
<p>(N.b. <a href="https://www.repository.cam.ac.uk/bitstream/handle/1810/249132/Scibior%20et%20al%202015%20Haskell%20Symposium%202015.pdf?sequence=1">Ścibior et al. (2015)</a> did some very similar work to this,
although the monad they used was arguably more operational in its flavour.)</p>
</li>
<li>
<p>A novel <strong>characterisation of execution traces as cofree comonads</strong>. The
idea of an “execution trace” is that one runs a probabilistic program
(typically generating a sample) and then records how it executed – what
randomness was used, the execution path of the program, etc. To do Bayesian
inference, one then runs a Markov chain <em>over the space of possible execution
traces</em>, calculating statistics about the resulting distribution in trace
space (<a href="http://proceedings.mlr.press/v15/wingate11a/wingate11a.pdf">Wingate et al., 2011</a>).</p>
<p>Remarkably, a cofree comonad over the same abstract probabilistic base
functor described above allows us to <strong>represent an execution trace at the
embedded language level itself</strong>. In practical terms, that means one can
denote a probabilistic model, and then run a Markov chain over the space of
possible ways it could have executed, <em>without leaving GHCi</em>. You can
alternatively examine and perturb the way the program executes, stepping
through it piece by piece, as I believe was originally a feature in Venture
(<a href="https://arxiv.org/abs/1404.0099">Mansinghka et al., 2014</a>).</p>
<p>(N.b. this really blew my mind when I first started toying with it,
converting programs into execution traces and then manipulating them as
first-class values, defining <em>other</em> probabilistic programs over spaces of
execution traces, etc. <em>Meta</em>.)</p>
</li>
<li>
<p>A novel technique for <strong>statically encoding conditional independence</strong> of
terms in this kind of embedded probabilistic programming language. If you
recall that I previously demonstrated the monoidal (i.e. applicative)
structure of the Giry monad encodes the notion of product measure, it will
not be too surprising to hear that I used the free applicative functor
(<a href="https://arxiv.org/pdf/1403.0749">Capriotti & Kaposi, 2014</a>) (again, over the same kind of abstract
probabilistic base functor) to reify applicative expressions such that they can
be identified statically.</p>
</li>
<li>
<p>A novel shallowly-embedded language for <strong>building custom transition
operators for use in Markov chain Monte Carlo</strong>. MCMC is the de-facto
standard way to perform inference on Bayesian models (although it is not
limited to Bayesian models in particular). By wrapping a simple state monad
transformer around a probability monad, one can denote Markov transition
operators, combine them, and transform them in a few ways that are useful for
doing MCMC.</p>
<p>The framework here was inspired by the old parallel “strategies” idea of
<a href="https://dl.acm.org/citation.cfm?id=969618">Trinder et al. (1998)</a>. The idea is that you want to “evaluate” a
posterior via MCMC, and want to choose a strategy by which to do so – e.g.
Metropolis (<a href="https://pdfs.semanticscholar.org/5abf/e3209b1699fd92c66678d8ec286194c6f40c.pdf">Metropolis, 1953</a>), slice sampling (<a href="https://projecteuclid.org/download/pdf_1/euclid.aos/1056562461">Neal, 2003</a>),
Hamiltonian (<a href="https://arxiv.org/pdf/1206.1901">Neal, 2011</a>), etc. Since Markov transition operators
are closed under composition and convex combinations, it is easy to write a
little shallowly-embedded combinator language for working with them –
effectively building evaluation strategies in a manner familiar to those
who’ve worked with Haskell’s <em>parallel</em> library.</p>
<p>(N.b. although this was the most trivial part of my research, theoretical or
implementation-wise, it remains the most useful for day-to-day practical
work.)</p>
</li>
</ul>
<h2 id="the-execution">The Execution</h2>
<p>One needs to stitch his or her contributions together in some kind of
over-arching narrative that supports the underlying thesis. Mine went
something like this:</p>
<p>The Giry monad is appropriate for denoting probabilistic semantics in
languages with purely-functional hosts. Its functorial, applicative, and
monadic structure denote probability distributions, independence, and
marginalisation, respectively, and these are necessary and sufficient for
encoding probabilistic models. An embedded language based on the Giry monad is
type-safe and composable.</p>
<p>Probabilistic models in an embedded language, semantically denoted in terms of
the Giry monad, can be made abstract and interpretation-independent by defining
them in terms of a probabilistic base functor and a free monad instead. They
can be forward-interpreted using standard free monad recursion schemes in order
to compute probabilities (via a measure intepretation) or samples (via a
sampling interpretation); the latter interpretation is useful for performing
limited forms of Bayesian inference, in particular. These free-encoded models
can also be transformed into cofree-encoded models, under which they represent
execution traces that can be perturbed arbitrarily by standard comonadic
machinery. This representation is amenable to more elaborate forms of Bayesian
inference. To accurately denote conditional independence in the embedded
language, the free applicative functor can also be used.</p>
<p>One can easily construct a shallowly-embedded language for building custom
Markov transitions. Markov chains that use these compound transitions can
outperform those that use only “primitive” transitions in certain settings.
The shallowly embedded language guarantees that transitions can only be
composed in well-defined, type-safe ways that preserve the properties desirable
for MCMC. What’s more, one can implement “transition transformers” for
implementing still more complex inference techniques, e.g. annealing or
tempering, over existing transitions.</p>
<p><strong>Thus</strong>: novel and useful domain-specific languages for solving problems in
Bayesian statistics can be embedded in statically-typed, purely-functional
programming languages.</p>
<p>I used the twenty-minute talk period of my <a href="https://jtobin.io/assets/jtobin-defence.pdf">defence</a> to go through this
narrative and point out my claims, after which I was grilled on them for an
hour or two. The defence was probably the funnest part of my whole Ph.D.</p>
<h2 id="the-product">The Product</h2>
<p>In the end, I mainly produced a <a href="https://jtobin.io/assets/jtobin-dissertation.pdf">dissertation</a>, a few blog posts, and
some code. By my count, the following repos came out of the work:</p>
<ul>
<li>
<p><em>deanie</em>: An embedded probabilistic programming language. <br />
<a href="http://github.com/jtobin/deanie">http://github.com/jtobin/deanie</a></p>
</li>
<li>
<p><em>declarative</em>: DIY Markov Chains. <br />
<a href="http://github.com/jtobin/declarative">http://github.com/jtobin/declarative</a></p>
</li>
<li>
<p><em>flat-mcmc</em>: Painless general-purpose sampling. <br />
<a href="http://github.com/jtobin/flat-mcmc">http://github.com/jtobin/flat-mcmc</a></p>
</li>
<li>
<p><em>hasty-hamiltonian</em>: Speedy traversal through parameter space. <br />
<a href="http://github.com/jtobin/hasty-hamiltonian">http://github.com/jtobin/hasty-hamiltonian</a></p>
</li>
<li>
<p><em>hnuts</em>: Automatic gradient-based sampling. <br />
<a href="http://github.com/jtobin/hnuts">http://github.com/jtobin/hnuts</a></p>
</li>
<li>
<p><em>lazy-langevin</em>: Gradient-based diffusion. <br />
<a href="http://github.com/jtobin/lazy-langevin">http://github.com/jtobin/lazy-langevin</a></p>
</li>
<li>
<p><em>mcmc-types</em>: Common types for implementing MCMC algorithms. <br />
<a href="https://github.com/jtobin/mcmc-types">https://github.com/jtobin/mcmc-types</a></p>
</li>
<li>
<p><em>measurable</em>: A shallowly-embedded DSL for basic measure wrangling. <br />
<a href="http://github.com/jtobin/measurable">http://github.com/jtobin/measurable</a></p>
</li>
<li>
<p><em>mighty-metropolis</em>: The Metropolis sampling algorithm. <br />
<a href="http://github.com/jtobin/mighty-metropolis">http://github.com/jtobin/mighty-metropolis</a></p>
</li>
<li>
<p><em>mwc-probability</em>: Sampling function-based probability distributions. <br />
<a href="http://github.com/jtobin/mwc-probability">http://github.com/jtobin/mwc-probability</a></p>
</li>
<li>
<p><em>sampling</em>: Tools for sampling from collections. <br />
<a href="https://github.com/jtobin/sampling">https://github.com/jtobin/sampling</a></p>
</li>
<li>
<p><em>speedy-slice</em>: Speedy slice sampling. <br />
<a href="http://github.com/jtobin/speedy-slice">http://github.com/jtobin/speedy-slice</a></p>
</li>
</ul>
<p>If any of this stuff is or was useful to you, that’s great! I still use the
<em>declarative</em> libraries, <em>flat-mcmc</em>, <em>mwc-probability</em>, and <em>sampling</em> pretty
regularly. They’re fast and convenient for practical work.</p>
<p>Some of the other stuff, e.g. <em>measurable</em>, is useful for building intuition,
but not so much in practice, and <em>deanie</em>, for example, is a work-in-progress
that will probably not see much more progress (from me, at least). Continuing
from where I left off might be a good idea for someone who wants to explore
problems in this kind of setting in the future.</p>
<h2 id="general-thoughts">General Thoughts</h2>
<p>When I first read about probabilistic (functional) programming in <a href="http://danroy.org/papers/Roy-PHD-2011.pdf">Dan Roy’s
2011 dissertation</a> I was absolutely blown away by the idea. It seemed
that, since there was such an obvious connection between the structure of
Bayesian models and programming languages (via the underlying semantic graph
structure, something that has been exploited to some degree as far back as
<a href="https://www.mrc-bsu.cam.ac.uk/software/bugs/the-bugs-project-winbugs/">BUGS</a>), it was only a matter of time until someone was able to <em>really</em>
create a tool that would revolutionize the practice of Bayesian statistics.</p>
<p>Now I’m much more skeptical. It’s true that probabilistic programming tends to
expose some beautiful structure in statistical models, and that a probabilistic
programming language that was easy to use and “just worked” for inference would
be a very useful tool. But putting something expressive and usable together
that also “just works” for that inference step is very, very difficult. Very
difficult indeed.</p>
<p>Almost every probabilistic programming framework of the past ten years, from
Church down to my own stuff, has more or less wound up as “thesisware,” or
remains the exclusive publication-generating mechanism of a single research
group. The exceptions are almost unexceptional in of themselves: JAGS and Stan
are probably the most-used such frameworks, certainly in statistics (I will
mention the very honourable PyMC here as well), but they innovate little, if at
all, over the original BUGS in terms of expressiveness. Similarly it’s very
questionable whether the fancy MCMC algo <em>du jour</em> is <em>really</em> any better than
some combination of Metropolis-Hastings (even plain Metropolis), Gibbs (or its
approximate variant, slice sampling), or <a href="https://github.com/eggplantbren/NestedSampling.hs">nested sampling</a> in anything
outside of favourably-engineered examples (I will note that Hamiltonian Monte
Carlo could probably be counted in there too, but it can still be quite a pain
to use, its variants are probably overrated, and it is comparatively
expensive).</p>
<p>Don’t get me wrong. I am a militant Bayesian. Bayesian statistics, i.e., as
far as I’m concerned, <em>probability theory</em>, describes the world accurately.
And there’s nothing wrong with thesisware, either. Research is research, and
this is a very thorny problem area. I hope to see <em>more</em> abandoned, innovative
software that moves the ball up the field, or kicks it into another stadium
entirely. Not less. The more ingenious implementations and sampling schemes
out there, the better.</p>
<p>But more broadly, I often find myself in the camp of <a href="http://www2.math.uu.se/~thulin/mm/breiman.pdf">Leo Breiman</a>, who
in 2001 characterised the two predominant cultures in statistics as those of
<em>data modelling</em> and <em>algorithmic modelling</em> respectively, the latter now known
as <em>machine learning</em>, of course. The crux of the data modelling argument,
which is of course predominant in probabilistic programming research and
Bayesian statistics more generally, is that a practitioner, by means of his or
her ingenuity, is able to suss out the essence of a problem and distill it into
a useful equation or program. Certainly there is something to this: science is
a matter of creating hypotheses, testing them against the world, and iterating
on that, and the “data modelling” procedure is absolutely scientific in
principle. Moreover, with a hat tip to <a href="https://www.quantopian.com/posts/max-dama-on-automated-trading-pdf">Max Dama</a>, one often wants to
<em>impose</em> a lot of structure on a problem, especially if the problem is in a
domain where there is a tremendous amount of noise. There are many areas where
this approach is just the thing one is looking for.</p>
<p>That said, it seems to me that a lot of the data modelling-flavoured side of
probabilistic programming, Bayesian nonparametrics, etc., is to some degree
geared more towards being, uh, “research paper friendly” than anything else.
These are extremely seductive areas for curious folks who like to play at the
intersection of math, statistics, and computer science (raises hand), and one
can easily spend a lifetime chasing this or that exquisite theoretical
construct into any number of rabbit holes. But at the end of the day, the data
modelling culture, per Breiman:</p>
<blockquote>
<p>.. has at its heart the belief that a statistician, by imagination
and by looking at the data, can invent a reasonably good parametric class of
models for a complex mechanism devised by nature.</p>
</blockquote>
<p>Certainly the traditional statistics that Breiman wrote about in 2001 was very
different from probabilistic programming and similar fields in 2018. But I
think there is the same element of hubris in them, and to some extent, a
similar dissociation from reality. I have cultivated some of the applied bent
of a Breiman, or a Dama, or a <a href="https://scottlocklin.wordpress.com/">Locklin</a>, so perhaps this should not be
too surprising.</p>
<p>I feel that the 2012-ish resurgence of neural networks jolted the machine
learning community out of a large-scale descent into some rather dubious
Bayesian nonparametrics research, which, much as I enjoy that subject area,
seemed more geared towards generating fun machine learning summer school
lectures and NIPS papers than actually getting much practical work done. I
can’t help but feel that probabilistic programming may share a few of those
same characteristics. When all is said and done, answering the question <em>is
this stuff useful?</em> often feels like a stretch.</p>
<p>So: onward & upward and all that, but my enthusiasm has been tempered somewhat,
is all.</p>
<h2 id="fini">Fini</h2>
<p>Administrative headaches and the existential questions associated with grad
school aside, I had a great time working in this area for a few years, if in my
own aloof and eccentric way.</p>
<p>If you ever interacted with this area of my work, I hope you got some utility
out of it: research ideas, use of my code, or just some blog post that you
thought was interesting during a slow day at the office. If you’re working in
the area, or are considering it, I wish you success, whether your goal is to
build practical tools, or to publish sexy papers. :-)</p>
Fubini and Applicatives2018-06-27T00:00:00+04:00https://jtobin.io/fubini<p>Take an iterated integral, e.g. \(\int_X \int_Y f(x, y) dy dx\). <a href="https://en.wikipedia.org/wiki/Fubini%27s_theorem#For_integrable_functions">Fubini’s
Theorem</a> describes the conditions under which the order of integration can
be swapped on this kind of thing while leaving its value invariant. If
Fubini’s conditions are met, you can convert your integral into \(\int_Y \int_X
f(x, y) dx dy\) and be guaranteed to obtain the same result you would have
gotten by going the other way.</p>
<p>What are these conditions? Just that you can glue your individual measures
together as a product measure, and that \(f\) is integrable with respect to it.
I.e.,</p>
\[\int_{X \times Y} | f(x, y) | d(x \times y) < \infty.\]
<p>Say you have a <a href="/giry-monad-foundations">Giry monad</a> <a href="/giry-monad-implementation">implementation</a> kicking around and you
want to see how Fubini’s Theorem works in terms of applicative functors,
monads, continuations, and all that. It’s pretty easy. You could start with
my old <a href="https://github.com/jtobin/measurable">measurable library</a> that sits on GitHub and attracts curious stars
from time to time and cook up the following example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">import</span> <span class="nn">Control.Applicative</span> <span class="p">((</span><span class="o"><$></span><span class="p">),</span> <span class="p">(</span><span class="o"><*></span><span class="p">))</span>
<span class="kr">import</span> <span class="nn">Measurable</span>
<span class="n">dx</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Int</span>
<span class="n">dx</span> <span class="o">=</span> <span class="n">bernoulli</span> <span class="mf">0.5</span>
<span class="n">dy</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span>
<span class="n">dy</span> <span class="o">=</span> <span class="n">beta</span> <span class="mi">1</span> <span class="mi">1</span>
<span class="n">dprod</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="p">(</span><span class="kt">Int</span><span class="p">,</span> <span class="kt">Double</span><span class="p">)</span>
<span class="n">dprod</span> <span class="o">=</span> <span class="p">(,)</span> <span class="o"><$></span> <span class="n">dx</span> <span class="o"><*></span> <span class="n">dy</span>
</code></pre></div></div>
<p>Note that ‘dprod’ is clearly a product measure (I’ve constructed it using the
Applicative instance for the Giry monad, so it <a href="/giry-monad-applicative">must</a> be a product
measure) and take a simple, obviously integrable function:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">add</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Int</span><span class="p">,</span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">add</span> <span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="n">fromIntegral</span> <span class="n">m</span> <span class="o">+</span> <span class="n">x</span>
</code></pre></div></div>
<p>Since ‘dprod’ is a product measure, Fubini’s Theorem guarantees that the
following are equivalent:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">i0</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="n">i0</span> <span class="o">=</span> <span class="n">integrate</span> <span class="n">add</span> <span class="n">dprod</span>
<span class="n">i1</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="n">i1</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="nf">\</span><span class="n">x</span> <span class="o">-></span> <span class="n">integrate</span> <span class="p">(</span><span class="n">curry</span> <span class="n">add</span> <span class="n">x</span><span class="p">)</span> <span class="n">dy</span><span class="p">)</span> <span class="n">dx</span>
<span class="n">i2</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="n">i2</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="nf">\</span><span class="n">y</span> <span class="o">-></span> <span class="n">integrate</span> <span class="p">(</span><span class="nf">\</span><span class="n">x</span> <span class="o">-></span> <span class="n">curry</span> <span class="n">add</span> <span class="n">x</span> <span class="n">y</span><span class="p">)</span> <span class="n">dx</span><span class="p">)</span> <span class="n">dy</span>
</code></pre></div></div>
<p>And indeed they are – you can verify them yourself if you don’t believe me (or
our boy Fubini).</p>
<p>For an example of a where interchanging the order of integration would be
impossible, we can construct some other measure:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">dpair</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="p">(</span><span class="kt">Int</span><span class="p">,</span> <span class="kt">Double</span><span class="p">)</span>
<span class="n">dpair</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">x</span> <span class="o"><-</span> <span class="n">dx</span>
<span class="n">y</span> <span class="o"><-</span> <span class="n">fmap</span> <span class="p">(</span><span class="o">*</span> <span class="n">fromIntegral</span> <span class="n">x</span><span class="p">)</span> <span class="n">dy</span>
<span class="n">return</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">y</span><span class="p">)</span>
</code></pre></div></div>
<p>It can be integrated as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">i3</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="n">i3</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="nf">\</span><span class="n">x</span> <span class="o">-></span> <span class="n">integrate</span> <span class="p">(</span><span class="n">curry</span> <span class="n">add</span> <span class="n">x</span><span class="p">)</span> <span class="p">(</span><span class="n">fmap</span> <span class="p">(</span><span class="o">*</span> <span class="n">fromIntegral</span> <span class="n">x</span><span class="p">)</span> <span class="n">dy</span><span class="p">))</span> <span class="n">dx</span>
</code></pre></div></div>
<p>But notice how ‘dpair’ is constructed: it is strictly <em>monadic</em>, not
applicative, so the order of the expressions matters. Since ‘dpair’ can’t be
expressed as a product measure (i.e. by an applicative expression), Fubini says
that swapping the order of integration is a no-no.</p>
<p>Note that if you were to just look at the types of ‘dprod’ and ‘dpair’ – both
‘Measure (Int, Double)’ – you wouldn’t be able to tell immediately that one
represents a product measure while the other one does not. If being able to
tell these things apart statically is important to you (say, you want to
statically apply order-of-integration optimisations to integral expressions or
what have you), you need look no further than the <a href="/encoding-independence-statically">free applicative
functor</a> to help you out.</p>
<p>Fun fact: there is a well-known variant of Fubini’s Theorem, called Tonelli’s
Theorem, that was developed by another Italian guy at around the same time.
I’m not sure how early-20th century Italy became so strong in
order-of-integration research, exactly.</p>
Byzantine Generals and Nakamoto Consensus2018-01-22T00:00:00+04:00https://jtobin.io/byzantine-generals-nakamoto-consensus<blockquote>
<p>You can recognize truth by its beauty and simplicity.</p>
<p>– Richard Feynman (attributed)</p>
</blockquote>
<p>In one of his <a href="http://satoshi.nakamotoinstitute.org/emails/cryptography/11/">early emails</a> on the Cryptography mailing list, Satoshi
claimed that the proof-of-work chain is a solution to the <a href="http://research.cs.wisc.edu/areas/os/Qual/papers/byzantine-generals.pdf">Byzantine Generals
Problem</a> (BGP). He describes this via an example where a bunch of
generals – Byzantine ones, of course – collude to break a king’s wifi.</p>
<p>It’s interesting to look at this a little closer in the language of the
originally-stated BGP itself. One doesn’t need to be too formal to glean
useful intuition here.</p>
<p>What, more precisely, did Satoshi claim?</p>
<h2 id="the-decentralized-timestamp-server">The Decentralized Timestamp Server</h2>
<p><a href="https://bitcoin.org/bitcoin.pdf">Satoshi’s problem</a> is that of a <em>decentralized timestamp server</em> (DTS).
Namely, he posits that any number of nodes, following some protocol, can
together act as a timestamping server – producing some consistent ordering on
what we’ll consider to be abstract ‘blocks’.</p>
<p>The decentralized timestamp server reduces to an instance of the Byzantine
Generals Problem as follows. There are a bunch of nodes, who could each be
honest or dishonest. All honest nodes want to agree on some ordering – a
<em>history</em> – of blocks, and a small number of dishonest nodes should not easily
be able to compromise that history – say, by convincing the honest nodes to
adopt some alternate one of their choosing.</p>
<p>(N.b. it’s unimportant here to be concerned about the <em>contents</em> of blocks.
Since the decentralized timestamp server problem is only concerned about
block orderings, we don’t need to consider the case of invalid transactions
<em>within</em> blocks or what have you, and can safely assume that any history must
be internally consistent. We only need to assume that child blocks depend
utterly on their parents, so that rewriting a history by altering some parent
block also necessitates rewriting its children, and that honest nodes are
constantly trying to append blocks.)</p>
<p>As demonstrated in the introduction to the original paper, the Byzantine
Generals Problem can be reduced to the problem of how any given node
communicates its information to others. In our context, it reduces to the
following:</p>
<blockquote>
<p><strong>Byzantine Generals Problem</strong> (DTS)</p>
<p>A node must broadcast a history of blocks to its peers, such that:</p>
<ul>
<li>(IC1) All honest peers agree on the history.</li>
<li>(IC2) If the node is honest, then all honest peers agree with the history
it broadcasts.</li>
</ul>
</blockquote>
<p>To produce consensus, every node will communicate its history to others by
<em>using a solution to the Byzantine Generals Problem</em>.</p>
<h2 id="longest-proof-of-work-chain">Longest Proof-of-Work Chain</h2>
<p>Satoshi’s proposed solution to the BGP has since come to be known as ‘Nakamoto
Consensus’. It is the following protocol:</p>
<blockquote>
<p><strong>Nakamoto Consensus</strong></p>
<ul>
<li>Always use the longest history.</li>
<li>Appending a block to any history requires a proof that a certain amount of
work – proportional in expectation to the total ‘capability’ of the
network – has been completed.</li>
</ul>
</blockquote>
<p>To examine how it works, consider an abstract network and communication medium.
We can assume that messages are communicated instantly (it suffices that
communication is dwarfed in time by actually producing a proof of work) and
that the network is static and fixed, so that only active or ‘live’ nodes
actually contribute to consensus.</p>
<p>The crux of Nakamoto consensus is that nodes must always use the longest
available history – the one that provably has the largest amount of work
invested in it – and appending to any history requires a nontrivial amount of
work in of itself. Consider a set of nodes, each having some (not necessarily
shared) history. Whenever <em>any</em> node broadcasts a one-block longer history,
<em>all</em> honest nodes will immediately agree on it, and conditions (IC1) and (IC2)
are thus automatically satisfied whether or not the broadcasting node is
honest. Nakamoto Consensus trivially solves the BGP in this most important
case; we can examine other cases by examining how they reduce to this one.</p>
<p>If two or more nodes broadcast longer histories at approximately the same time,
then honest nodes may not agree on a single history for as long as it takes a
<em>longer</em> history to be produced and broadcast. As soon as this occurs (which,
in all probability, is only a matter of time), we reduce to the previous case
in which all honest nodes agree with each other again, and the BGP is resolved.</p>
<p>The ‘bad’ outcome we’re primarily concerned about is that of dishonest nodes
<em>rewriting history in their favour</em>, i.e. by replacing some history \(\{\ldots,
B_1, B_2, B_3, \ldots\}\) by another one \(\{\ldots, B_1, B_2', B_3',
\ldots\}\) that somehow benefits them. The idea here is that some dishonest
node (or nodes) intends to use block \(B_2\) as some sort of commitment, but
later wants to renege. To do so, the node needs to rewrite not only \(B_2\),
but all other blocks that depend on \(B_2\) (here \(B_3\), etc.), ultimately
producing a longer history than is currently agreed upon by honest peers.</p>
<p>Moreover, it needs to do this faster than honest nodes are able to produce
longer histories on their own. Catching up to and exceeding the honest nodes
becomes exponentially unlikely in the number of blocks to be rewritten, and so
a measure of confidence can be ascribed to agreement on the state of any
sub-history that has been ‘buried’ by a certain number of blocks (see the
penultimate section of Satoshi’s paper for details).</p>
<p>Dishonest nodes that seek to replace some well-established, agreed-upon history
with another will thus find it effectively impossible (i.e. the probability is
<em>negligible</em>) unless they control a majority of the network’s capability – at
which point they no longer constitute a small number of peers.</p>
<h2 id="summary">Summary</h2>
<p>So in the language of the originally-stated BGP: Satoshi claimed that the
decentralized timestamp server is an instance of the Byzantine Generals
Problem, and that Nakamoto Consensus (as it came to be known) is a solution to
the Byzantine Generals Problem. Because Nakamoto Consensus solves the BGP,
honest nodes that always use the longest proof-of-work history in the
decentralized timestamp network will eventually come to consensus on the
ordering of blocks.</p>
Recursive Stochastic Processes2017-03-01T00:00:00+04:00https://jtobin.io/recursive-stochastic-processes<p>Last week Dan Peebles asked me on Twitter if I knew of any writing on the use
of recursion schemes for expressing stochastic processes or other probability
distributions. And I don’t! So I’ll write some of what I do know myself.</p>
<p>There are a number of popular statistical models or stochastic processes that
have an overtly recursive structure, and when one has some recursive structure
lying around, the elegant way to represent it is by way of a recursion scheme.
In the case of stochastic processes, this typically boils down to using an
anamorphism to drive things. Or, if you actually want to be able to observe
the thing (note: you do), an apomorphism.</p>
<p>By representing a stochastic process in this way one can really isolate the
probabilistic phenomena involved in it. One bundles up the essence of a
process in a coalgebra, and then drives it via some appropriate recursion
scheme.</p>
<p>Let’s take a look at three stochastic processes and examine their probabilistic
and recursive structures.</p>
<h2 id="foundations">Foundations</h2>
<p>To start, I’m going to construct a simple embedded language in the spirit of
the ones used in my <a href="/simple-probabilistic-programming">simple probabilistic programming</a> and <a href="/comonadic-mcmc">comonadic
inference</a> posts. Check those posts out if this stuff looks too
unfamiliar. Here’s a preamble that constitutes the skeleton of the code we’ll
be working with.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE FlexibleContexts #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="cp">{-# LANGUAGE RankNTypes #-}</span>
<span class="cp">{-# LANGUAGE TypeFamilies #-}</span>
<span class="kr">import</span> <span class="nn">Control.Monad</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Free</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Control.Monad.Trans.Free</span> <span class="k">as</span> <span class="n">TF</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
<span class="kr">import</span> <span class="nn">Data.Random</span> <span class="p">(</span><span class="kt">RVar</span><span class="p">,</span> <span class="nf">sample</span><span class="p">)</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Data.Random.Distribution.Bernoulli</span> <span class="k">as</span> <span class="n">RF</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Data.Random.Distribution.Beta</span> <span class="k">as</span> <span class="n">RF</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Data.Random.Distribution.Normal</span> <span class="k">as</span> <span class="n">RF</span>
<span class="c1">-- probabilistic instruction set, program definitions</span>
<span class="kr">data</span> <span class="kt">ModelF</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">BernoulliF</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Bool</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">GaussianF</span> <span class="kt">Double</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">BetaF</span> <span class="kt">Double</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">DiracF</span> <span class="n">a</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
<span class="kr">type</span> <span class="kt">Program</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">ModelF</span> <span class="n">a</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Model</span> <span class="n">b</span> <span class="o">=</span> <span class="n">forall</span> <span class="n">a</span><span class="o">.</span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">b</span>
<span class="kr">type</span> <span class="kt">Terminating</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">a</span>
<span class="c1">-- core language terms</span>
<span class="n">bernoulli</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">bernoulli</span> <span class="n">p</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">vp</span> <span class="n">id</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">vp</span>
<span class="o">|</span> <span class="n">p</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">|</span> <span class="n">p</span> <span class="o">></span> <span class="mi">1</span> <span class="o">=</span> <span class="mi">1</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">p</span>
<span class="n">gaussian</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">gaussian</span> <span class="n">m</span> <span class="n">s</span>
<span class="o">|</span> <span class="n">s</span> <span class="o"><=</span> <span class="mi">0</span> <span class="o">=</span> <span class="n">error</span> <span class="s">"gaussian: variance out of bounds"</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">GaussianF</span> <span class="n">m</span> <span class="n">s</span> <span class="n">id</span><span class="p">)</span>
<span class="n">beta</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">beta</span> <span class="n">a</span> <span class="n">b</span>
<span class="o">|</span> <span class="n">a</span> <span class="o"><=</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">b</span> <span class="o"><=</span> <span class="mi">0</span> <span class="o">=</span> <span class="n">error</span> <span class="s">"beta: parameter out of bounds"</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">id</span><span class="p">)</span>
<span class="n">dirac</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">dirac</span> <span class="n">x</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">DiracF</span> <span class="n">x</span><span class="p">)</span>
<span class="c1">-- interpreter</span>
<span class="n">rvar</span> <span class="o">::</span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">RVar</span> <span class="n">a</span>
<span class="n">rvar</span> <span class="o">=</span> <span class="n">iterM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">RF</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">GaussianF</span> <span class="n">m</span> <span class="n">s</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">RF</span><span class="o">.</span><span class="n">normal</span> <span class="n">m</span> <span class="n">s</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">RF</span><span class="o">.</span><span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">DiracF</span> <span class="n">x</span> <span class="o">-></span> <span class="n">return</span> <span class="n">x</span>
<span class="c1">-- utilities</span>
<span class="n">free</span> <span class="o">::</span> <span class="kt">Functor</span> <span class="n">f</span> <span class="o">=></span> <span class="kt">Fix</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span>
<span class="n">free</span> <span class="o">=</span> <span class="n">cata</span> <span class="kt">Free</span>
<span class="n">affine</span> <span class="o">::</span> <span class="kt">Num</span> <span class="n">a</span> <span class="o">=></span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">affine</span> <span class="n">translation</span> <span class="n">scale</span> <span class="o">=</span> <span class="p">(</span><span class="o">+</span> <span class="n">translation</span><span class="p">)</span> <span class="o">.</span> <span class="p">(</span><span class="o">*</span> <span class="n">scale</span><span class="p">)</span>
</code></pre></div></div>
<p>Just as a quick review, we’ve got:</p>
<ul>
<li>A probabilistic instruction set defined by ‘ModelF’. Each constructor
represents a foundational probability distribution that we can use in our
embedded programs.</li>
<li>Three types corresponding to probabilistic programs. The ‘Program’ type
simply wraps our instruction set up in a naïve free monad. The ‘Model’
type denotes probabilistic programs that may not necessarily
terminate (in some weak sense), while the ‘Terminating’ type denotes
probabilistic programs that terminate (ditto).</li>
<li>A bunch of embedded language terms. These are just probability
distributions; here we’ll manage with the Bernouli, Gaussian, and beta
distributions. We also have a ‘dirac’ term for constructing a Dirac
distribution at a point.</li>
<li>A single interpeter ‘rvar’ that interprets a probabilistic program into a
random variable (where the ‘RVar’ type is provided by <a href="https://hackage.haskell.org/package/random-fu">random-fu</a>).
Typically I use <a href="https://hackage.haskell.org/package/mwc-probability">mwc-probability</a> for this but <em>random-fu</em> is quite
nice. When a program has been interpreted into a random variable we can use
‘sample’ to sample from it.</li>
</ul>
<p>So: we can write simple probabilistic programs in standard monadic fashion,
like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">betaBernoulli</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">betaBernoulli</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">p</span> <span class="o"><-</span> <span class="n">beta</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">bernoulli</span> <span class="n">p</span>
</code></pre></div></div>
<p>and then interpret them as needed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> replicateM 10 (sample (rvar (betaBernoulli 1 8)))
[False,False,False,False,False,False,False,True,True,False]
</code></pre></div></div>
<h2 id="the-geometric-distribution">The Geometric Distribution</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Geometric_distribution">geometric distribution</a> is not a stochastic process <em>per se</em>, but it
can be represented by one. If we repeatedly flip a coin and then count the
number of flips until the first head, and then consider the probability
distribution over that count, voilà. That’s the geometric distribution. You
might see a head right away, or you might be infinitely unlucky and <em>never</em> see
a head. So the distribution is supported over the entirety of the natural
numbers.</p>
<p>For illustration, we can encode the coin flipping process in a straightforward
recursive manner:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">simpleGeometric</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Terminating</span> <span class="kt">Int</span>
<span class="n">simpleGeometric</span> <span class="n">p</span> <span class="o">=</span> <span class="n">loop</span> <span class="mi">1</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">n</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">accept</span> <span class="o"><-</span> <span class="n">bernoulli</span> <span class="n">p</span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">dirac</span> <span class="n">n</span>
<span class="kr">else</span> <span class="n">loop</span> <span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>We start flipping Bernoulli-distributed coins, and if we observe a head we stop
and return the number of coins flipped thus far. Otherwise we keep flipping.</p>
<p>The underlying probabilistic phenomena here are the Bernoulli draw, which
determines if we’ll terminate, and the dependent Dirac return, which will wrap
a terminating value in a point mass. The recursive procedure itself has the
pattern of:</p>
<ul>
<li>If some condition is met, abort the recursion and return a value.</li>
<li>Otherwise, keep recursing.</li>
</ul>
<p>This pattern describes an <em>apomorphism</em>, and the <em>recursion-schemes</em> type
signature of ‘apo’ is:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">apo</span> <span class="o">::</span> <span class="kt">Corecursive</span> <span class="n">t</span> <span class="o">=></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="kt">Base</span> <span class="n">t</span> <span class="p">(</span><span class="kt">Either</span> <span class="n">t</span> <span class="n">a</span><span class="p">))</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">t</span>
</code></pre></div></div>
<p>It takes a coalgebra that returns an ‘Either’ value wrapped up in a base
functor, and uses that coalgebra to drive the recursion. A ‘Left’-returned
value halts the recursion, while a ‘Right’-returned value keeps it going.</p>
<p>Don’t be put off by the type of the coalgebra if you’re unfamiliar with
apomorphisms - its bark is worse than its bite. Check out <a href="/sorting-slower-with-style">my older post on
apomorphisms</a> for a brief introduction to them.</p>
<p>With reference to the ‘apo’ type signature, The main thing to choose here is
the <a href="/tour-of-some-recursive-types">recursive type</a> that we’ll use to wrap up the ‘ModelF’ base functor.
‘Fix’ might be conceivably simpler to start, so I’ll begin with that. The
coalgebra defining the model looks like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">geoCoalg</span> <span class="n">p</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="kt">Left</span> <span class="p">(</span><span class="kt">Fix</span> <span class="p">(</span><span class="kt">DiracF</span> <span class="n">n</span><span class="p">))</span>
<span class="kr">else</span> <span class="kt">Right</span> <span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">))</span>
</code></pre></div></div>
<p>Then given the coalgebra, we can just wrap it up in ‘apo’ to represent the
geometric distribution.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">geometric</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Terminating</span> <span class="kt">Int</span>
<span class="n">geometric</span> <span class="n">p</span> <span class="o">=</span> <span class="n">free</span> <span class="p">(</span><span class="n">apo</span> <span class="p">(</span><span class="n">geoCoalg</span> <span class="n">p</span><span class="p">)</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Since the geometric distribution (weakly) terminates, the program has return
type ‘Terminating Int’.</p>
<p>Since we’ve encoded the coalgebra using ‘Fix’, we have to explicitly convert
to ‘Free’ via the ‘free’ utility function I defined in the preamble. Recent
versions of <em>recursion-schemes</em> have added a ‘Corecursive’ instance for ‘Free’,
though, so the superior alternative is to just use that:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">geometric</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Terminating</span> <span class="kt">Int</span>
<span class="n">geometric</span> <span class="n">p</span> <span class="o">=</span> <span class="n">apo</span> <span class="n">coalg</span> <span class="mi">1</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="p">(</span><span class="nf">\</span><span class="n">accept</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="kt">Left</span> <span class="p">(</span><span class="n">dirac</span> <span class="n">n</span><span class="p">)</span>
<span class="kr">else</span> <span class="kt">Right</span> <span class="p">(</span><span class="n">n</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)))</span>
</code></pre></div></div>
<p>The point of all this is that we can <em>isolate the core probabilistic phenomena
of the recursive process</em> by factoring it out into a coalgebra. The recursion
itself takes the form of an apomorphism, which knows nothing about probability
or flipping coins or what have you - it just knows how to recurse, or stop.</p>
<p>For illustration, here’s a histogram of samples drawn from the geometric via:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> replicateM 100 (sample (rvar (geometric 0.2)))
</code></pre></div></div>
<p><img src="images/geo_hist.png" alt="" class="center-image" /></p>
<h2 id="an-autoregressive-process">An Autoregressive Process</h2>
<p>Autoregressive (AR) processes simply use a previous epoch’s output as the
current epoch’s input; the number of previous epochs used as input on any given
epoch is called the <em>order</em> of the process. An AR(1) process looks like this,
for example:</p>
\[y_t = \alpha + \beta y_{t - 1} + \epsilon_t\]
<p>Here \(\epsilon_t\) are independent and identically-distributed random
variables that follow some error distribution. In other words, in this model
the value \(\alpha + \beta y_{t - 1}\) follows some probability distribution
given the last epoch’s output \(y_{t - 1}\) and some parameters \(\alpha\) and
\(\beta\).</p>
<p>An autoregressive process doesn’t have any notion of termination built into it,
so the purest way to represent one is via an anamorphism. We’ll focus on AR(1)
processes in this example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ar1</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">ar1</span> <span class="n">a</span> <span class="n">b</span> <span class="n">s</span> <span class="o">=</span> <span class="n">ana</span> <span class="n">coalg</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">GaussianF</span> <span class="p">(</span><span class="n">affine</span> <span class="n">a</span> <span class="n">b</span> <span class="n">x</span><span class="p">)</span> <span class="n">s</span> <span class="p">(</span><span class="n">affine</span> <span class="n">a</span> <span class="n">b</span><span class="p">))</span>
</code></pre></div></div>
<p>Each epoch is just a Gaussian-distributed affine transformation of the previous
epochs’s output. But the problem with using an anamorphism here is that it will
just shoot off to infinity, recursing endlessly. This doesn’t do us a ton of
good if we want to actually <em>observe</em> the process, so if we want to do that
we’ll need to bake in our own conditions for termination. Again we’ll rely on
an apomorphism for this; we can just specify how many periods we want to
observe the process for, and stop recursing as soon as we exceed that.</p>
<p>There are two ways to do this. We can either get a view of the process <em>at</em>
\(n\) periods in the future, or we can get a view of the process <em>over</em> \(n\)
periods in the future. I’ll write both, for illustration. The coalgebra for
the first is simpler, and looks like:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">arCoalg</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">GaussianF</span> <span class="p">(</span><span class="n">affine</span> <span class="n">a</span> <span class="n">b</span> <span class="n">x</span><span class="p">)</span> <span class="n">s</span> <span class="p">(</span><span class="nf">\</span><span class="n">y</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">n</span> <span class="o"><=</span> <span class="mi">0</span>
<span class="kr">then</span> <span class="kt">Left</span> <span class="p">(</span><span class="n">dirac</span> <span class="n">x</span><span class="p">)</span>
<span class="kr">else</span> <span class="kt">Right</span> <span class="p">(</span><span class="n">pred</span> <span class="n">m</span><span class="p">,</span> <span class="n">y</span><span class="p">)))</span>
</code></pre></div></div>
<p>The coalgebra is saying:</p>
<ul>
<li>Given \(x\), let \(z\) have a Gaussian distribution with mean \(\alpha +
\beta x\) and standard deviation \(s\).</li>
<li>If we’re on the last epoch, return \(x\) as a Dirac point mass.</li>
<li>Otherwise, continue recursing with \(z\) as input to the next epoch.</li>
</ul>
<p>Now, to observe the process <em>over</em> the next \(n\) periods we can just collect
the observations we’ve seen so far in a list. An implementation of the
process, apomorphism and all, looks like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ar</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Terminating</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span>
<span class="n">ar</span> <span class="n">n</span> <span class="n">a</span> <span class="n">b</span> <span class="n">s</span> <span class="n">origin</span> <span class="o">=</span> <span class="n">apo</span> <span class="n">coalg</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="p">[</span><span class="n">origin</span><span class="p">])</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">epochs</span><span class="p">,</span> <span class="n">history</span><span class="o">@</span><span class="p">(</span><span class="n">x</span><span class="o">:</span><span class="kr">_</span><span class="p">))</span> <span class="o">=</span>
<span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">GaussianF</span> <span class="p">(</span><span class="n">affine</span> <span class="n">a</span> <span class="n">b</span> <span class="n">x</span><span class="p">)</span> <span class="n">s</span> <span class="p">(</span><span class="nf">\</span><span class="n">y</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">epochs</span> <span class="o"><=</span> <span class="mi">0</span>
<span class="kr">then</span> <span class="kt">Left</span> <span class="p">(</span><span class="n">dirac</span> <span class="p">(</span><span class="n">reverse</span> <span class="n">history</span><span class="p">))</span>
<span class="kr">else</span> <span class="kt">Right</span> <span class="p">(</span><span class="n">pred</span> <span class="n">epochs</span><span class="p">,</span> <span class="n">y</span><span class="o">:</span><span class="n">history</span><span class="p">)))</span>
</code></pre></div></div>
<p>(Note that I’m deliberately not handling the error condition here so as to
focus on the essence of the coalgebra.)</p>
<p>We can generate some traces for it in the standard way. Here’s how we’d sample
a 100-long trace from an AR(1) process originating at 0 with \(\alpha = 0\),
\(\beta = 1\), and \(s = 1\):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> sample (rvar (ar 100 0 1 1 0))
</code></pre></div></div>
<p>and here’s a visualization of 10 of those traces:</p>
<p><img src="/images/ar_traces.png" alt="" class="center-image" /></p>
<h2 id="the-stick-breaking-process">The Stick-Breaking Process</h2>
<p>The stick breaking process is one of any number of whimsical stochastic
processes used as prior distributions in nonparametric Bayesian models. The
idea here is that we want to take a stick and endlessly break it into smaller
and smaller pieces. Every time we break a stick, we recursively take the rest
of the stick and break it again, ad infinitum.</p>
<p>Again, if we wanted to represent this endless process very faithfully, we’d use
an anamorphism to drive it. But in practice we’re going to only want to break
a stick some finite number of times, so we’ll follow the same pattern as the AR
process and use an apomorphism to do that:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sbp</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Terminating</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span>
<span class="n">sbp</span> <span class="n">n</span> <span class="n">a</span> <span class="o">=</span> <span class="n">apo</span> <span class="n">coalg</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="mi">1</span><span class="p">,</span> <span class="kt">[]</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="p">(</span><span class="n">epochs</span><span class="p">,</span> <span class="n">stick</span><span class="p">,</span> <span class="n">sticks</span><span class="p">)</span> <span class="o">=</span> <span class="kt">TF</span><span class="o">.</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">BetaF</span> <span class="mi">1</span> <span class="n">a</span> <span class="p">(</span><span class="nf">\</span><span class="n">p</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">epochs</span> <span class="o"><=</span> <span class="mi">0</span>
<span class="kr">then</span> <span class="kt">Left</span> <span class="p">(</span><span class="n">dirac</span> <span class="p">(</span><span class="n">reverse</span> <span class="p">(</span><span class="n">stick</span> <span class="o">:</span> <span class="n">sticks</span><span class="p">)))</span>
<span class="kr">else</span> <span class="kt">Right</span> <span class="p">(</span><span class="n">pred</span> <span class="n">epochs</span><span class="p">,</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">*</span> <span class="n">stick</span><span class="p">,</span> <span class="p">(</span><span class="n">p</span> <span class="o">*</span> <span class="n">stick</span><span class="p">)</span><span class="o">:</span><span class="n">sticks</span><span class="p">)))</span>
</code></pre></div></div>
<p>The coalgebra that defines the process says the following:</p>
<ul>
<li>Let the location \(p\) of the break on the next (normalized) stick be
beta\((1, \alpha)\)-distributed.</li>
<li>If we’re on the last epoch, return all the pieces of the stick that we broke
as a Dirac point mass.</li>
<li>Otherwise, break the stick again and recurse.</li>
</ul>
<p>Here’s a plot of five separate draws from a stick breaking process with
\(\alpha = 0.2\), each one observed for five breaks. Note that each draw
encodes a categorical distribution over the set \(\{1, \ldots, 6\}\); the stick
breaking process is a ‘distribution over distributions’ in that sense:</p>
<p><img src="images/sbp_plots.png" alt="" class="center-image" /></p>
<p>The stick breaking process is useful for developing mixture models with an
unknown number of components, for example. The \(\alpha\) parameter can be
tweaked to concentrate or disperse probability mass as needed.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This seems like enough for now. I’d be interested in exploring other models
generated by recursive processes just to see how they can be encoded, exactly.
Basically all of Bayesian nonparametrics is based on using recursive processses
as prior distributions, so the Dirichlet process, Chinese Restaurant Process,
Indian Buffet Process, etc. should work beautifully in this setting.</p>
<p>Fun fact: back in 2011 before <del>neural networks</del> deep learning had taken over
machine learning, Bayesian nonparametrics was probably the hottest research
area in town. I used to joke that I’d create a new prior called the Malaysian
Takeaway Process for some esoteric nonparametric model and thus achieve machine
learning fame, but never did get around to that.</p>
<h2 id="addendum">Addendum</h2>
<p>I got a question about how I produce these plots. And the answer is the only
sane way when it comes to visualization in Haskell: dump the output to disk and
plot it with something else. I use R for most of my interactive/exploratory
data science-fiddling, as well as for visualization. Python with matplotlib is
obviously a good choice too.</p>
<p>Here’s how I made the autoregressive process plot, for example. First, I just
produced the actual samples in GHCi:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> samples <- replicateM 10 (sample (rvar (ar 100 0 1 1 0)))
</code></pre></div></div>
<p>Then I wrote them to disk:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> let render handle = hPutStrLn handle . filter (`notElem` "[]") . show
> withFile "trace.dat" WriteMode (\handle -> mapM_ (render handle) samples)
</code></pre></div></div>
<p>The following R script will then get you the plot:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">require</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="n">require</span><span class="p">(</span><span class="n">reshape2</span><span class="p">)</span><span class="w">
</span><span class="n">raw</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="s1">'trace.dat'</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="n">d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">data.frame</span><span class="p">(</span><span class="n">t</span><span class="p">(</span><span class="n">raw</span><span class="p">),</span><span class="w"> </span><span class="n">x</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">seq_along</span><span class="p">(</span><span class="n">raw</span><span class="p">))</span><span class="w">
</span><span class="n">m</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">melt</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">id.vars</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="s1">'x'</span><span class="p">)</span><span class="w">
</span><span class="n">ggplot</span><span class="p">(</span><span class="n">m</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">value</span><span class="p">,</span><span class="w"> </span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">variable</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_line</span><span class="p">()</span><span class="w">
</span></code></pre></div></div>
<p>I used <a href="http://docs.ggplot2.org/">ggplot2</a> for the other plots as well; check out the ggplot2
functions <code class="language-plaintext highlighter-rouge">geom_histogram</code>, <code class="language-plaintext highlighter-rouge">geom_bar</code>, and <code class="language-plaintext highlighter-rouge">facet_grid</code> in particular.</p>
The Applicative Structure of the Giry Monad2017-02-26T00:00:00+04:00https://jtobin.io/giry-monad-applicative<p>In my <a href="/giry-monad-foundations">last</a> <a href="/giry-monad-implementation">two</a> posts about the Giry monad I derived the thing
from its categorical and measure-theoretic foundations. I kind of thought that
those posts wouldn’t be of much interest to people but they turned out to be a
hit. I clearly can’t tell what the internet likes.</p>
<p>Anyway, something I left out was the theoretical foundation of the Giry monad’s
Applicative instance, which seemed a bit odd. I also pointed out that
applicativeness in the context of probability implies independence between
probability measures.</p>
<p>In this article I’m going to look at each of these issues. After playing
around with the foundations, it looks like the applicative instance for the
Giry monad can be put on a sound footing in terms of the standard
measure-theoretic concept of <em>product measure</em>. Also, it turns out that the
claim I made of applicativeness \(\implies\) independence is somewhat
ill-posed. But, using the shiny new intuition we’ll glean from a better
understanding of the applicative structure, we can put that on a solid footing
too.</p>
<p>So let’s look at both of these things and see what’s up.</p>
<h2 id="monoidal-functors">Monoidal Functors</h2>
<p>The foundational categorical concept behind applicative functors is the
<em>monoidal functor</em>, which is a functor between monoidal categories that
preserves monoidal structure.</p>
<p>Formally: for monoidal categories \((C, \otimes, I)\) and \((D, \oplus, J)\),
a monoidal functor \(F : C \to D\) is a functor and associated natural
transformations \(\phi : F(A) \oplus F(B) \to F(A \otimes B)\) and \(i : J \to
F(I)\) that satisfy some coherence conditions that I won’t mention here.
Notably, if \(\phi\) and \(i\) are isomorphisms (i.e. are invertible) then
\(F\) is called a <em>strong</em> monoidal functor. Otherwise it’s called <em>lax</em>.
Applicative functors in particular are lax monoidal functors.</p>
<p>This can be made much clearer for endofunctors on a monoidal category \((C,
\otimes, I)\). Then you only have \(F : C \to C\) and \(\phi : F(A) \otimes
F(B) \to F(A \otimes B)\) to worry about. If we sub in the Giry monad
\(\mathcal{P}\) from the last couple of posts, we’d want \(\mathcal{P} :
\textbf{Meas} \to \textbf{Meas}\) and \(\phi : \mathcal{P}(M) \otimes
\mathcal{P}(N) \to \mathcal{P}(M \otimes N)\).</p>
<p>Does the category of measurable spaces \(\textbf{Meas}\) have a monoidal
structure? Yup. Take measurable spaces \(M = (X, \mathcal{X})\) and \(N = (Y,
\mathcal{Y})\). From the Giry monad derivation we already have that the
monoidal identity \(i : M \to \mathcal{P}(M)\) corresponds to a Dirac measure
at a point, so that’s well and good. And we can define the tensor product
\(\otimes\) between \(M\) and \(N\) as follows: let \(X \times Y\) be the
standard Cartesian product on \(X\) and \(Y\) and let \(\mathcal{X} \otimes
\mathcal{Y}\) be the smallest \(\sigma\)-algebra generated by the Cartesian
product \(A \times B\) of measurable sets \(A \in \mathcal{X}\) and \(B \in
\mathcal{Y}\). Then \((X \times Y, \mathcal{X} \otimes \mathcal{Y})\) is a
measurable space, and so \((\textbf{Meas}, \otimes, i)\) is monoidal.</p>
<p>Recall that \(\mathcal{P}(M)\) and \(\mathcal{P}(N)\) - the space of measures
over \(M\) and \(N\) respectively - are themselves objects in
\(\textbf{Meas}\). So, clearly \(\mathcal{P}(M) \otimes \mathcal{P}(N)\) is a
measurable space, and if \(\mathcal{P}\) is monoidal then there must exist a
natural transformation that can take us from there to \(\mathcal{P}(M \otimes
N)\). This is the space of measures over the product \(M \otimes N\).</p>
<p>So the question is: does \(\mathcal{P}\) have the required monoidal structure?</p>
<p>Yes. It must, since \(\mathcal{P}\) is a monad, and any monad can generate the
required natural transformation. Let \(\mu\) be the monadic ‘join’ operator
\(\mathcal{P}^2 \to \mathcal{P}\) and \(\eta\) be the monadic identity
\(I \to \mathcal{P}\). We have, evaluating right-to-left:</p>
\[\phi_{\nu \times \rho} =
\mu \mathcal{P} \left\{ \lambda m .
\mu \mathcal{P}\left(\lambda n. \eta_{m \times n}\right)\mathcal{P}(\rho) \right\}
\mathcal{P}(\nu).\]
<p>Using \(\gg\!\!=\) makes this much easier to read:</p>
\[\phi_{\nu \times \rho} =
\nu \gg\!\!= \lambda m. \rho \gg\!\!= \lambda n. \eta_{m \times n}\]
<p>or in code, just:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">phi</span> <span class="o">::</span> <span class="kt">Monad</span> <span class="n">m</span> <span class="o">=></span> <span class="p">(</span><span class="n">m</span> <span class="n">a</span><span class="p">,</span> <span class="n">m</span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="n">m</span> <span class="p">(</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">)</span>
<span class="n">phi</span> <span class="p">(</span><span class="n">m</span><span class="p">,</span> <span class="n">n</span><span class="p">)</span> <span class="o">=</span> <span class="n">liftM2</span> <span class="p">(,)</span> <span class="n">m</span> <span class="n">n</span>
</code></pre></div></div>
<p>So with that we have that \((\mathcal{P}, \phi, i)\) is a (lax) monoidal
functor. And you can glean a monad-generated applicative operator from
that immediately (this leads to the function called ‘ap’ in <code class="language-plaintext highlighter-rouge">Control.Monad</code>):</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ap</span> <span class="o">::</span> <span class="kt">Monad</span> <span class="n">m</span> <span class="o">=></span> <span class="n">m</span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="n">b</span>
<span class="n">ap</span> <span class="n">f</span> <span class="n">x</span> <span class="o">=</span> <span class="n">fmap</span> <span class="p">(</span><span class="nf">\</span><span class="p">(</span><span class="n">g</span><span class="p">,</span> <span class="n">z</span><span class="p">)</span> <span class="o">-></span> <span class="n">g</span> <span class="n">z</span><span class="p">)</span> <span class="p">(</span><span class="n">phi</span> <span class="n">f</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>(Note: I won’t refer to \(\mu\) as the join operator from this point out in
order to free it up for denoting measures.)</p>
<h2 id="probabilistic-interpretations">Probabilistic Interpretations</h2>
<h3 id="product-measure">Product Measure</h3>
<p>The correct probabilistic interpretation here is that \(\phi\) takes a pair of
probability measures to the <strong>product measure</strong> over the appropriate product
space. For probability measures \(\mu\) and \(\nu\) on measurable spaces \(M\)
and \(N\) respectively, the product measure is the (unique) measure \(\mu
\times \nu\) on \(M \otimes N\) such that:</p>
\[(\mu \times \nu)(A \times B) = \mu(A) \nu(B)\]
<p>for \(A \times B\) a measurable set in \(M \otimes N\).</p>
<p>Going through the monoidal functor route seems to put the notion of the Giry
applicative instance on a more firm measure-theoretic foundation. Instead of
considering the following from the Giry monad <a href="/giry-monad-foundations">foundations article</a>:</p>
\[(\rho \, \langle \ast \rangle \, \nu)(f) = \int_{\mathcal{P}(M \to N)} \left\{\lambda T . \int_{M \to N} (f \circ T) d\nu \right\} d \rho\]
<p>which is defined in terms of the dubious space of measures over measurable
functions \(M \to N\), we can better view things using the monoidal
structure-preserving natural transformation \(\phi\). For measures \(\mu\) and
\(\nu\) on \((X, \mathcal{X})\) and \((Y, \mathcal{Y})\) respectively, we have:</p>
\[\phi(\mu, \nu)(f) = \int_{X \times Y}f d(\mu \times \nu)\]
<p>and then for \(g : Z \to X \otimes Y\) we can use the functor structure of
\(\mathcal{P}\) to do:</p>
\[(\text{fmap} \, g \, \phi(\mu, \nu))(f) = \int_{Z} (f \circ g) d((\mu \times \nu) \circ g^{-1})\]
<p>which corresponds to a standard applicative expression <code class="language-plaintext highlighter-rouge">g <$> mu <*> nu</code>. I
suspect there’s then some sort of Yoneda argument or something that makes
currying and partial function application acceptable.</p>
<h3 id="independence">Independence</h3>
<p>Now. What does this have to say about independence?</p>
<p>In particular, it’s too fast and loose to claim measures can be ‘independent’
at all. Independence is a property of measurable sets, measurable functions,
and \(\sigma\)-algebras. Not of measures! But there <em>is</em> a really useful
connection, so let’s illustrate that.</p>
<p>First, let’s define independence formally as follows. Take a probability space
\((X, \mathcal{X}, \mathbb{P})\). Then any measurable sets \(A\) and \(B\)
in \(\mathcal{X}\) are independent if</p>
\[\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B).\]
<p>That’s the simplest notion.</p>
<p>Next, consider two sub-\(\sigma\)-algebras \(\mathcal{A}\) and \(\mathcal{B}\)
of \(\mathcal{X}\) (a sub-\(\sigma\)-algebra is just a a subset of a
\(\sigma\)-algebra that itself happens to be a \(\sigma\) algebra). Then
\(\mathcal{A}\) and \(\mathcal{B}\) are independent if, for <em>any</em> \(A\) in
\(\mathcal{A}\) and <em>any</em> \(B\) in \(\mathcal{B}\), we have that \(A\) and
\(B\) are independent.</p>
<p>The final example is independence of measurable functions. Take measurable
functions \(f\) and \(g\) both from \(X\) to the real numbers equipped with some
appropriate \(\sigma\)-algebra \(\mathcal{B}\). Then each of these functions
<em>generates</em> a sub-\(\sigma\) algebra of \(\mathcal{X}\) as follows:</p>
\[\begin{align*}
\mathcal{X}_{f} & = \{ f^{-1}(B) : B \in \mathcal{B} \} \\
\mathcal{X}_{g} & = \{ g^{-1}(B) : B \in \mathcal{B} \}.
\end{align*}\]
<p>Then \(f\) and \(g\) are independent if the generated \(\sigma\)-algebras
\(\mathcal{X}_{f}\) and \(\mathcal{X}_{g}\) are independent.</p>
<p>Note that in every case independence is defined in terms of a <em>single
measure</em>, \(\mathbb{P}\). We can’t talk about different measures being
independent. To <a href="https://terrytao.wordpress.com/2015/10/12/275a-notes-2-product-measures-and-independence/">paraphrase Terry Tao</a> here:</p>
<blockquote>
<p>The notion of independence between [measurable functions] does not make sense
if the [measurable functions] are being modeled by separate probability
spaces; they have to be coupled together into a single probability space
before independence becomes a meaningful notion.</p>
</blockquote>
<p>To be pedantic and explicitly specify the measure by which some things are
independent, some authors state that measurable functions \(f\) and \(g\) are
\(\mathbb{P}\)-independent, for example.</p>
<p>We can see a connection to independence when we look at convolution and
associated operators. Recall that for measures \(\mu\) and \(\nu\) on the same
measurable space \(M = (X, \mathcal{X})\) that supports some notion of
addition, their convolution looks like:</p>
\[(\mu + \nu)(f) = \int_{X}\int_{X} f(x + y) d\mu(x) d\nu(y).\]
<p>The probabilistic interpretation here (<a href="https://terrytao.wordpress.com/2013/07/26/computing-convolutions-of-measures/">see Terry Tao</a> on this too) is
that \(\mu + \nu\) is the measure corresponding to the sum of independent
measurable functions \(g\) and \(h\) with corresponding measures \(\mu\) and
\(\nu\) respectively.</p>
<p>That looks weird though, since we clearly defined independence between
measurable functions using a single probability measure. How is it we can talk
about independent measurable functions \(g\) and \(h\) having different
corresponding measures?</p>
<p>We first need to couple everything together into a single probability space as
per Terry’s quote. Complete \(M\) with some abstract probability measure
\(\mathbb{P}\) to form the probability space \((X, \mathcal{X}, \mathbb{P})\).
Now we have \(g\) and \(h\) measurable functions from \(X\) to \(\mathbb{R}\).</p>
<p>To say that \(g\) and \(h\) are independent is to say that their generated
\(\sigma\)-algebras are \(\mathbb{P}\)-independent. And the measures that they
correspond to are the pushforwards of \(\mathbb{P}\) under \(g\) and \(h\)
respectively. So, \(\mu = \mathbb{P} \circ g^{-1}\) and \(\nu = \mathbb{P}
\circ h^{-1}\). The result is that the measurable functions correspond to
different (pushforward) measures \(\mu\) and \(\nu\), but are independent with
respect to the same underlying probability measure \(\mathbb{P}\).</p>
<p>The monoidal structure of \(\mathcal{P}\) then gets us to convolution. Given a
product of measures \(\mu\) and \(\nu\) each on \((X, \mathcal{X})\) we can
immediately retrieve their product measure \(\mu \times \nu\) via
\(\phi\). And from there we can get to \(\mu + \nu\) via the functor structure
of \(\mathcal{P}\) - we just find the pushforward of \(\mu \times \nu\) with
respect to a function \(\rho\) that collapses a product via addition. So
\(\rho : X \times X \to \mathbb{R}\) is defined as:</p>
\[\rho(a \times b) = a + b\]
<p>and then the convolution \(\mu + \nu\) is thus:</p>
\[\mu + \nu = (\mu \times \nu) \circ \rho^{-1}.\]
<p>Other operations can be defined similarly, e.g. for \(\sigma(a \times b) = a -
b\) we get:</p>
\[\mu - \nu = (\mu \times \nu) \circ \sigma^{-1}.\]
<p>The crux of all this is whenever we apply a measurable function to a product
measure, we can <em>always</em> extract notions of independent measurable functions
from the result. And the measures corresponding to those independent
measurable functions will be the components of the product measure
respectively.</p>
<p>This is super useful and lets one claim something stronger than what the
monadic structure gives you. In an expression like <code class="language-plaintext highlighter-rouge">g <$> mu <*> nu <*> rho</code>,
you are <strong>guaranteed</strong> that the corresponding random variables \(g_\mu\),
\(g_\nu\), \(g_\rho\) (for suitable projections) are independent. The same
cannot be said if you use the monadic structure to do something like <code class="language-plaintext highlighter-rouge">g mu nu
rho</code> where the product structure is not enforced - in that case you’re not
guaranteed anything of the sort. This is why the applicative structure is
useful for <a href="/encoding-independence-statically">encoding</a> independence in a way that the monadic structure is
not.</p>
<h2 id="conclusion">Conclusion</h2>
<p>So there you have it. Applicativeness can seemingly be put on a
straightforward measure-theoretic grounding and has some useful implications
for independence.</p>
<p>It’s worth noting that, in the case of the Giry monad, we don’t <em>need</em> to go
through its monadic structure in order to recover an applicative instance. We
can do so entirely by hacking together continuations without using a single
monadic bind. This is actually how I defined the applicative instance in the
Giry monad <a href="/giry-monad-implementation">implementation article</a> previously:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">instance</span> <span class="kt">Applicative</span> <span class="kt">Measure</span> <span class="kr">where</span>
<span class="n">pure</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="p">(</span><span class="nf">\</span><span class="n">f</span> <span class="o">-></span> <span class="n">f</span> <span class="n">x</span><span class="p">)</span>
<span class="kt">Measure</span> <span class="n">g</span> <span class="o"><*></span> <span class="kt">Measure</span> <span class="n">h</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">$</span> <span class="nf">\</span><span class="n">f</span> <span class="o">-></span>
<span class="n">g</span> <span class="p">(</span><span class="nf">\</span><span class="n">k</span> <span class="o">-></span> <span class="n">h</span> <span class="p">(</span><span class="n">f</span> <span class="o">.</span> <span class="n">k</span><span class="p">))</span>
</code></pre></div></div>
<p>Teasing out the exact structure of this and its relation to the codensity monad
is again something I’ll leave <a href="https://arxiv.org/pdf/1410.4432.pdf">to others</a>.</p>
Implementing the Giry Monad2017-02-13T00:00:00+04:00https://jtobin.io/giry-monad-implementation<p>In my <a href="/giry-monad-foundations">last post</a> I went over the categorical and measure-theoretic
foundations of the Giry monad, the ‘canonical’ probability monad that operates
on the level of probability measures.</p>
<p>In this post I’ll pick up from where I left off and talk about a neat and
faithful (if impractical) implementation of the Giry monad that one can put
together in Haskell.</p>
<h2 id="measure-integral-and-continuation">Measure, Integral, and Continuation</h2>
<p>So. For a quick review, we’ve established the Giry monad as a triple
\((\mathcal{P}, \mu, \eta)\), where \(\mathcal{P}\) is an endofunctor on the
category of measurable spaces \(\textbf{Meas}\), \(\mu\) is a marginalizing
integration operation defined by:</p>
\[\mu(\rho)(A) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M \chi_A d \nu \right\} d \rho\]
<p>and \(\eta\) is a monoidal identity, defined by the Dirac measure at a point:</p>
\[\eta(x)(A) = \chi_A(x).\]
<p>How do we actually implement this beast? If we’re looking to be suitably
general then it is unlikely that we’re going to be able to easily represent
something like a \(\sigma\)-algebra over some space of measures on a computer,
so that route is sort of a non-starter.</p>
<p>But it can be done. The key to implementing a general-purpose Giry monad is to
notice that the fundamental operation involved in it is <em>integration</em>, and that
we can avoid working with \(\sigma\)-algebras and measurable spaces directly if
we focus on dealing with measurable <em>functions</em> instead of measurable <em>sets</em>.</p>
<p>Consider the integration map on measurable functions \(\tau_f\) that we’ve been
using this whole time. For some measurable function \(f\), \(\tau_f\) takes a
measure on some measurable space \(M = (X, \mathcal{X})\) and uses it to
integrate \(f\) over \(X\). In other words:</p>
\[\tau_f(\nu) = \int_X f d\nu.\]
<p>A measure in \(\mathcal{P}(M)\) has type \(X \to \mathbb{R}\), so \(\tau_f\)
has corresponding type \((X \to \mathbb{R}) \to \mathbb{R}\).</p>
<p>This might look familiar to you; it’s very similar to the type signature for a
<em>continuation</em>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">newtype</span> <span class="kt">Cont</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span> <span class="kt">Cont</span> <span class="p">((</span><span class="n">a</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
</code></pre></div></div>
<p>Indeed, if we restrict the carrier type of ‘Cont’ to the reals, we can be
really faithful to the type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">newtype</span> <span class="kt">Integral</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Integral</span> <span class="p">((</span><span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, let’s overload notation and call the integration map \(\tau_f\) <em>itself</em> a
measure. That is, \(\tau_f\) is a mapping \(\nu \mapsto \int_{X}fd\nu\), so
we’ll just interpret the notation \(\nu(f)\) to mean the same thing -
\(\int_{X}fd\nu\). This is convenient because we can dispense with \(\tau\)
and just pretend measures can be applied directly to measurable functions.
There’s no way we can get confused here; measures operate on <em>sets</em>, not
functions, so notation like \(\nu(f)\) is not currently in use. We just set
\(\nu(f) = \tau_f(\nu)\) and that’s that. Let’s rename the ‘Integral’ type
to match:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">newtype</span> <span class="kt">Measure</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="p">((</span><span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span>
</code></pre></div></div>
<p>We can extract a very nice shallowly-embedded language for integration here,
the core of which is a single term:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">integrate</span> <span class="o">::</span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">integrate</span> <span class="n">f</span> <span class="p">(</span><span class="kt">Measure</span> <span class="n">nu</span><span class="p">)</span> <span class="o">=</span> <span class="n">nu</span> <span class="n">f</span>
</code></pre></div></div>
<p>Note that this is the same way we’d express integration mathematically; we
specify that we want to integrate a measurable function \(f\) with respect to
some measure \(\nu\):</p>
\[\int f d\nu = \texttt{integrate f nu}.\]
<p>The only subtle difference here is that we don’t specify the space we’re
integrating over in the integral expression - instead, we’ll bake that into the
definition of the measures we create themselves. Details in a bit.</p>
<p>What’s interesting here is that the Giry monad <em>is</em> the continuation monad with
the carrier type restricted to the reals. This isn’t surprising when you think
about what’s going on here - we’re representing measures as <em>integration
procedures</em>, that is, <strong>programs</strong> that take a measurable function as input and
then compute its integral in some particular way. A measure, as we’ve
implemented it here, is just a ‘program with a missing piece’. And this is
<a href="http://www.haskellforall.com/2012/12/the-continuation-monad.html">exactly the essence</a> of the continuation monad in Haskell.</p>
<h2 id="typeclass-instances">Typeclass Instances</h2>
<p>We can fill out the functor, applicative, and monad instances mechanically by
reference to the a standard continuation monad implementation, and each
instance gives us some familiar conceptual structure or operation on
probability measures. Let’s take a look.</p>
<p>The functor instance lets us <em>transform the support</em> of a measurable space
while keeping its density structure invariant. If we have:</p>
\[\nu(f) = \int_X f d\nu\]
<p>then mapping a measurable function over the measure corresponds to:</p>
\[(\texttt{fmap} \, g \, \nu)(f) = \int_{X} (f \circ g) d\nu.\]
<p>The functor structure allows us to precisely express a pushforward measure or
distribution of \(\nu\) under \(g\). It lets us ‘adapt’ a measure to other
measurable spaces, <a href="http://www.haskellforall.com/2012/09/the-functor-design-pattern.html">just like a good functor should</a>.</p>
<p>In Haskell, the functor instance corresponds exactly to the math:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">instance</span> <span class="kt">Functor</span> <span class="kt">Measure</span> <span class="kr">where</span>
<span class="n">fmap</span> <span class="n">g</span> <span class="n">nu</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">$</span> <span class="nf">\</span><span class="n">f</span> <span class="o">-></span>
<span class="n">integrate</span> <span class="p">(</span><span class="n">f</span> <span class="o">.</span> <span class="n">g</span><span class="p">)</span> <span class="n">nu</span>
</code></pre></div></div>
<p>The monad instance is exactly the Giry monad structure that we developed
previously, and it allows us to sequence probability measures together by
<em>marginalizing</em> one into another. We’ll write it in terms of bind, of course,
which went like:</p>
\[(\rho \gg\!\!= g)(f) = \int_M \left\{\lambda m . \int_N f dg(m) \right\} d \rho.\]
<p>The Haskell translation is verbatim:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">instance</span> <span class="kt">Monad</span> <span class="kt">Measure</span> <span class="kr">where</span>
<span class="n">return</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="p">(</span><span class="nf">\</span><span class="n">f</span> <span class="o">-></span> <span class="n">f</span> <span class="n">x</span><span class="p">)</span>
<span class="n">rho</span> <span class="o">>>=</span> <span class="n">g</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">$</span> <span class="nf">\</span><span class="n">f</span> <span class="o">-></span>
<span class="n">integrate</span> <span class="p">(</span><span class="nf">\</span><span class="n">m</span> <span class="o">-></span> <span class="n">integrate</span> <span class="n">f</span> <span class="p">(</span><span class="n">g</span> <span class="n">m</span><span class="p">))</span> <span class="n">rho</span>
</code></pre></div></div>
<p>Finally there’s the Applicative instance, which as I mentioned in the <a href="/giry-monad-foundations">last
post</a> is sort of conceptually weird here. So in the spirit of that
comment, I’m going to dodge any formal justification for now and just use the
following instance which works in practice:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">instance</span> <span class="kt">Applicative</span> <span class="kt">Measure</span> <span class="kr">where</span>
<span class="n">pure</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="p">(</span><span class="nf">\</span><span class="n">f</span> <span class="o">-></span> <span class="n">f</span> <span class="n">x</span><span class="p">)</span>
<span class="kt">Measure</span> <span class="n">g</span> <span class="o"><*></span> <span class="kt">Measure</span> <span class="n">h</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">$</span> <span class="nf">\</span><span class="n">f</span> <span class="o">-></span>
<span class="n">g</span> <span class="p">(</span><span class="nf">\</span><span class="n">k</span> <span class="o">-></span> <span class="n">h</span> <span class="p">(</span><span class="n">f</span> <span class="o">.</span> <span class="n">k</span><span class="p">))</span>
</code></pre></div></div>
<h2 id="conceptual-example">Conceptual Example</h2>
<p>It’s worth taking a look at an example of how things should conceivably work
here. Consider the following probabilistic model:</p>
\[\begin{align*}
\pi & \sim \text{beta}(\alpha, \beta) \\
\mu \, | \, \pi & \sim \text{binomial}(n, \pi)
\end{align*}\]
<p>It’s a standard hierarchical presentation. A ‘compound’ measure can be
obtained here by marginalizing over the beta measure \(\pi\), and that’s called
the <em>beta-binomial</em> measure. Let’s find it.</p>
<p>The beta distribution has support on the \([0, 1]\) subset of the reals, and
the binomial distribution with argument \(n\) has support on the \(\{0, \ldots,
n\}\) subset of the integers, so we know that things should proceed like so:</p>
\[\begin{align*}
\psi(f)
& = (\pi \gg\!\!= \mu)(f) \\
& = \int_{\mathbb{R}} \left\{\lambda p . \int_{\mathbb{Z}} f d\mu(p) \right\} d \pi.
\end{align*}\]
<p>Eliding some theory of integration, I can tell you that \(\pi\) is <a href="https://en.wikipedia.org/wiki/Absolute_continuity">absolutely
continuous</a> with respect to <a href="https://en.wikipedia.org/wiki/Lebesgue_measure">Lebesgue measure</a> and that \(\mu(p)\)
is absolutely continuous w/respect to <a href="https://en.wikipedia.org/wiki/Counting_measure">counting measure</a> for appropriate
\(p\). So, \(\pi\) <a href="https://en.wikipedia.org/wiki/Radon%E2%80%93Nikodym_theorem">admits a density</a> \(d\pi/dx = g_\pi\) and \(\mu(p)\)
admits a density \(d\mu(p)/d\# = g_{\mu(p)}\), defined as:</p>
\[g_\pi(p \, | \, \alpha, \beta) = \frac{1}{B(\alpha, \beta)} p^{\alpha - 1} (1 - p)^{\beta - 1}\]
<p>and</p>
\[g_{\mu(p)}(x \, | \, n, p) = \binom{n}{x} p^x (1 - p)^{n - x}\]
<p>respectively, for \(B\) the <a href="https://en.wikipedia.org/wiki/Beta_function">beta function</a> and \(\binom{n}{x}\) a
<a href="https://en.wikipedia.org/wiki/Binomial_coefficient">binomial coefficient</a>. Again, we can reduce the integral as follows,
transforming the outermost integral into a standard <a href="https://en.wikipedia.org/wiki/Riemann_integral">Riemann integral</a>
and the innermost integral into a simple sum of products:</p>
\[\psi(f) =
\int_{0}^{1}
\lambda p. \left\{ \lambda \alpha. \lambda \beta. g_{\pi}(p \, | \alpha, \beta)
\sum_{z \in \{0, \ldots, n\}}
f(z) \left( \lambda n. g_{\mu(p)}(z \, | \, n, p) \right)
\right\} dx.\]
<p>where \(dx\) denotes Lebesgue measure. I could expand this further or simplify
things a little more (the beta and binomial are <a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugates</a>) but you get
the point, which is that we have a way to evaluate the integral.</p>
<p>What is really required here then is to be able to encode into the
definitions of measures like \(\pi\) and \(\mu(p)\) the method of integration
to use when evaluating them. For measures absolutely continuous w/respect to
Lebesgue measure, we can use the Riemann integral over the reals. For measures
absolutely continuous w/respect to counting measure, we can use a sum of
products. In both cases, we’ll also need to supply the density or mass
function by which the integral should be evaluated.</p>
<h2 id="creating-measures">Creating Measures</h2>
<p>Recall that we are representing measures as <em>integration procedures</em>. So to
create one is to define a program by which we’ll perform integration.</p>
<p>Let’s start with the conceptually simpler case of a probability measure that’s
absolutely continuous with respect to counting measure. We need to provide
a support (the region for which probability is greater than 0) and a
probability mass function (so that we can weight every point appropriately).
Then we just want to integrate a function by evaluating it at every point in
the support, multiplying the result by that point’s probability mass, and
summing everything up. In code, this translates to:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fromMassFunction</span> <span class="o">::</span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="n">a</span>
<span class="n">fromMassFunction</span> <span class="n">f</span> <span class="n">support</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">$</span> <span class="nf">\</span><span class="n">g</span> <span class="o">-></span>
<span class="n">foldl'</span> <span class="p">(</span><span class="nf">\</span><span class="n">acc</span> <span class="n">x</span> <span class="o">-></span> <span class="n">acc</span> <span class="o">+</span> <span class="n">f</span> <span class="n">x</span> <span class="o">*</span> <span class="n">g</span> <span class="n">x</span><span class="p">)</span> <span class="mi">0</span> <span class="n">support</span>
</code></pre></div></div>
<p>So if we want to construct a binomial measure, we can do that like so (where
<code class="language-plaintext highlighter-rouge">choose</code> comes from <code class="language-plaintext highlighter-rouge">Numeric.SpecFunctions</code>):</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">binomial</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="kt">Int</span>
<span class="n">binomial</span> <span class="n">n</span> <span class="n">p</span> <span class="o">=</span> <span class="n">fromMassFunction</span> <span class="p">(</span><span class="n">pmf</span> <span class="n">n</span> <span class="n">p</span><span class="p">)</span> <span class="p">[</span><span class="mi">0</span><span class="o">..</span><span class="n">n</span><span class="p">]</span> <span class="kr">where</span>
<span class="n">pmf</span> <span class="n">n</span> <span class="n">p</span> <span class="n">x</span>
<span class="o">|</span> <span class="n">x</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">n</span> <span class="o"><</span> <span class="n">x</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">choose</span> <span class="n">n</span> <span class="n">x</span> <span class="o">*</span> <span class="n">p</span> <span class="o">^^</span> <span class="n">x</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">^^</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>The second example involves measures over the real line that are absolutely
continuous with respect to Lebesgue measure. In this case we want to evaluate
a Riemann integral over the entire real line, which is going to necessitate
approximation on our part. There are a bunch of methods out there for
approximating integrals, but a simple one for one-dimensional problems like
this is <a href="https://en.wikipedia.org/wiki/Numerical_integration">quadrature</a>, an implementation for which Ed Kmett has handily
packaged up in his <a href="https://hackage.haskell.org/package/integration">integration</a> package:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fromDensityFunction</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="kt">Double</span>
<span class="n">fromDensityFunction</span> <span class="n">d</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">$</span> <span class="nf">\</span><span class="n">f</span> <span class="o">-></span>
<span class="n">quadratureTanhSinh</span> <span class="p">(</span><span class="nf">\</span><span class="n">x</span> <span class="o">-></span> <span class="n">f</span> <span class="n">x</span> <span class="o">*</span> <span class="n">d</span> <span class="n">x</span><span class="p">)</span>
<span class="kr">where</span>
<span class="n">quadratureTanhSinh</span> <span class="o">=</span> <span class="n">result</span> <span class="o">.</span> <span class="n">last</span> <span class="o">.</span> <span class="n">everywhere</span> <span class="n">trap</span>
</code></pre></div></div>
<p>Here we’re using quadrature to approximate the integral, but otherwise it has
a similar form as ‘fromMassFunction’. The difference here is that we’re
integrating over the entire real line, and so don’t have to supply a support
explicitly.</p>
<p>We can use this to create a beta measure (where <code class="language-plaintext highlighter-rouge">logBeta</code> again comes from
<code class="language-plaintext highlighter-rouge">Numeric.SpecFunctions</code>):</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">beta</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="kt">Double</span>
<span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">fromDensityFunction</span> <span class="p">(</span><span class="n">density</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">density</span> <span class="n">a</span> <span class="n">b</span> <span class="n">p</span>
<span class="o">|</span> <span class="n">p</span> <span class="o"><</span> <span class="mi">0</span> <span class="o">||</span> <span class="n">p</span> <span class="o">></span> <span class="mi">1</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="mi">1</span> <span class="o">/</span> <span class="n">exp</span> <span class="p">(</span><span class="n">logBeta</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span> <span class="o">*</span> <span class="n">p</span> <span class="o">**</span> <span class="p">(</span><span class="n">a</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="p">(</span><span class="mi">1</span> <span class="o">-</span> <span class="n">p</span><span class="p">)</span> <span class="o">**</span> <span class="p">(</span><span class="n">b</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Note that since we’re going to be integrating over the entire real line and the
beta distribution has support only over \([0, 1]\), we need to implicitly
define the support here by specifying which regions of the domain will lead to
a density of 0.</p>
<p>In any case, now that we’ve constructed those things we can just use
a monadic bind to create the beta-binomial measure we described before. It
masks a lot of under-the-hood complexity.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">betaBinomial</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="kt">Int</span>
<span class="n">betaBinomial</span> <span class="n">n</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">>>=</span> <span class="n">binomial</span> <span class="n">n</span>
</code></pre></div></div>
<p>There are a couple of other useful ways to create measures, but the most
notable is to use a sample in order to create an <a href="https://en.wikipedia.org/wiki/Empirical_measure">empirical measure</a>.
This is equivalent to passing in a specific support for which the mass function
assigns equal probability to every element; I’ll use Gabriel Gonzalez’s
<a href="https://hackage.haskell.org/package/foldl">foldl</a> package here as it’s pretty elegant:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fromSample</span> <span class="o">::</span> <span class="kt">Foldable</span> <span class="n">f</span> <span class="o">=></span> <span class="n">f</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="n">a</span>
<span class="n">fromSample</span> <span class="o">=</span> <span class="kt">Measure</span> <span class="o">.</span> <span class="n">flip</span> <span class="n">weightedAverage</span>
<span class="n">weightedAverage</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Foldable</span> <span class="n">f</span><span class="p">,</span> <span class="kt">Fractional</span> <span class="n">r</span><span class="p">)</span> <span class="o">=></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span> <span class="o">-></span> <span class="n">f</span> <span class="n">a</span> <span class="o">-></span> <span class="n">r</span>
<span class="n">weightedAverage</span> <span class="n">f</span> <span class="o">=</span> <span class="kt">Foldl</span><span class="o">.</span><span class="n">fold</span> <span class="p">(</span><span class="n">weightedAverageFold</span> <span class="n">f</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">weightedAverageFold</span> <span class="o">::</span> <span class="kt">Fractional</span> <span class="n">r</span> <span class="o">=></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Fold</span> <span class="n">a</span> <span class="n">r</span>
<span class="n">weightedAverageFold</span> <span class="n">f</span> <span class="o">=</span> <span class="kt">Foldl</span><span class="o">.</span><span class="n">premap</span> <span class="n">f</span> <span class="n">averageFold</span>
<span class="n">averageFold</span> <span class="o">::</span> <span class="kt">Fractional</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">Fold</span> <span class="n">a</span> <span class="n">a</span>
<span class="n">averageFold</span> <span class="o">=</span> <span class="p">(</span><span class="o">/</span><span class="p">)</span> <span class="o"><$></span> <span class="kt">Foldl</span><span class="o">.</span><span class="n">sum</span> <span class="o"><*></span> <span class="kt">Foldl</span><span class="o">.</span><span class="n">genericLength</span>
</code></pre></div></div>
<p>Using ‘fromSample’ you can create an empirical measure using just about
anything you’d like:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Foo</span> <span class="o">=</span> <span class="kt">Foo</span> <span class="o">|</span> <span class="kt">Bar</span> <span class="o">|</span> <span class="kt">Baz</span>
<span class="n">foos</span> <span class="o">::</span> <span class="p">[</span><span class="kt">Foo</span><span class="p">]</span>
<span class="n">foos</span> <span class="o">=</span> <span class="p">[</span><span class="kt">Foo</span><span class="p">,</span> <span class="kt">Foo</span><span class="p">,</span> <span class="kt">Bar</span><span class="p">,</span> <span class="kt">Foo</span><span class="p">,</span> <span class="kt">Baz</span><span class="p">,</span> <span class="kt">Foo</span><span class="p">,</span> <span class="kt">Bar</span><span class="p">,</span> <span class="kt">Foo</span><span class="p">,</span> <span class="kt">Foo</span><span class="p">,</span> <span class="kt">Foo</span><span class="p">,</span> <span class="kt">Bar</span><span class="p">]</span>
<span class="n">nu</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Foo</span>
<span class="n">nu</span> <span class="o">=</span> <span class="n">fromSample</span> <span class="n">foos</span>
</code></pre></div></div>
<p>Though I won’t demonstrate it here, you can use this approach to also create
measures from sampling functions or random variables that use a source of
randomness - just draw a sample from the function and pipe the result into
‘fromSample’.</p>
<h2 id="querying-measures">Querying Measures</h2>
<p>To <em>query</em> a measure is to simply get some result out of it, and we do that by
integrating some measurable function against it. The easiest thing to do is to
just take a straightforward expectation by integrating the identity function;
for example, here’s the expected value of a beta(10, 10) measure:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> integrate id (beta 10 10)
0.49999999999501316
</code></pre></div></div>
<p>The expected value of a beta(\(\alpha\), \(\beta\)) distribution is \(\alpha /
(\alpha + \beta)\), so we can verify analytically that the result should be
0.5. We observe a bit of numerical imprecision here because, if you’ll recall,
we’re just <em>approximating</em> the integral via quadrature. For measures created
via ‘fromMassFunction’ we don’t need to use quadrature, so we won’t observe the
same kind of approximation error. Here’s the expected value of a binomial(10,
0.5) measure, for example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> integrate fromIntegral (binomial 10 0.5)
5.0
</code></pre></div></div>
<p>Note here that we’re integrating the ‘fromIntegral’ function against the
binomial measure. This is because the binomial measure is defined over the
integers, rather than the reals, and we <em>always</em> need to evaluate to a real
when we integrate. That’s part of the definition of a measure!</p>
<p>Let’s calculate the expectation of the beta-binomial distribution with \(n =
10\), \(\alpha = 1\), and \(\beta = 8\):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> integrate fromIntegral (betaBinomial 10 1 8)
1.108635884924813
</code></pre></div></div>
<p>Neato. And since we can integrate like this, we can really compute any of the
<a href="https://en.wikipedia.org/wiki/Moment_(mathematics)">moments</a> of a measure. The first raw moment is what we’ve been doing
here, and is called the expectation:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">expectation</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">expectation</span> <span class="o">=</span> <span class="n">integrate</span> <span class="n">id</span>
</code></pre></div></div>
<p>The second (central) moment is the <em>variance</em>. Here I mean variance in the
moment-based sense, rather than as the possibly better-known sample variance:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">variance</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">variance</span> <span class="n">nu</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="o">^</span> <span class="mi">2</span><span class="p">)</span> <span class="n">nu</span> <span class="o">-</span> <span class="n">expectation</span> <span class="n">nu</span> <span class="o">^</span> <span class="mi">2</span>
</code></pre></div></div>
<p>The variance of a binomial(\(n\), \(p\)) distribution is known to be
\(np(1-p)\), so for \(n = 10\) and \(p = 0.5\) we should get 2.5:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> variance (binomial 10 0.5)
<interactive>:87:11: error:
• Couldn't match type ‘Int’ with ‘Double’
Expected type: Measure Double
Actual type: Measure Int
• In the first argument of ‘variance’, namely ‘(binomial 10 0.5)’
In the expression: variance (binomial 10 0.5)
In an equation for ‘it’: it = variance (binomial 10 0.5)
</code></pre></div></div>
<p>Ahhh, but remember: the binomial measure is defined over the <em>integers</em>, so we
can’t integrate it directly. No matter - the functorial structure allows us to
adapt it to any other measurable space via a measurable function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> variance (fmap fromIntegral (binomial 10 0.5))
2.5
</code></pre></div></div>
<p>Expectation and variance (and other moments) are pretty well-known, but you can
do more exotic things as well. You can calculate the <a href="https://en.wikipedia.org/wiki/Moment-generating_function">moment generating
function</a> for a measure, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">momentGeneratingFunction</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">momentGeneratingFunction</span> <span class="n">nu</span> <span class="n">t</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="nf">\</span><span class="n">x</span> <span class="o">-></span> <span class="n">exp</span> <span class="p">(</span><span class="n">t</span> <span class="o">*</span> <span class="n">x</span><span class="p">))</span> <span class="n">nu</span>
</code></pre></div></div>
<p>and the <a href="https://en.wikipedia.org/wiki/Cumulant">cumulant generating function</a> follows naturally:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cumulantGeneratingFunction</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">cumulantGeneratingFunction</span> <span class="n">nu</span> <span class="o">=</span> <span class="n">log</span> <span class="o">.</span> <span class="n">momentGeneratingFunction</span> <span class="n">nu</span>
</code></pre></div></div>
<p>A particularly useful construct is the <a href="https://en.wikipedia.org/wiki/Cumulative_distribution_function">cumulative distribution function</a>
for a measure, which calculates the probability of a region less than or equal
to some number:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cdf</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">cdf</span> <span class="n">nu</span> <span class="n">x</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="n">negativeInfinity</span> <span class="p">`</span><span class="n">to</span><span class="p">`</span> <span class="n">x</span><span class="p">)</span> <span class="n">nu</span>
<span class="n">negativeInfinity</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="n">negativeInfinity</span> <span class="o">=</span> <span class="n">negate</span> <span class="p">(</span><span class="mi">1</span> <span class="o">/</span> <span class="mi">0</span><span class="p">)</span>
<span class="n">to</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Num</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Ord</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">to</span> <span class="n">a</span> <span class="n">b</span> <span class="n">x</span>
<span class="o">|</span> <span class="n">x</span> <span class="o">>=</span> <span class="n">a</span> <span class="o">&&</span> <span class="n">x</span> <span class="o"><=</span> <span class="n">b</span> <span class="o">=</span> <span class="mi">1</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div>
<p>The beta(2, 2) distribution is symmetric around its mean 0.5, so the
probability of the region \([0, 0.5]\) should itself be 0.5. This checks out
as expected, modulo approximation error due to quadrature:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> cdf (beta 2 2) 0.5
0.4951814897381374
</code></pre></div></div>
<p>Similarly for measurable spaces without any notion of order, there’s a simple
CDF analogue that calculates the probability of a region that contains the
given points:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">containing</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Num</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Eq</span> <span class="n">b</span><span class="p">)</span> <span class="o">=></span> <span class="p">[</span><span class="n">b</span><span class="p">]</span> <span class="o">-></span> <span class="n">b</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">containing</span> <span class="n">xs</span> <span class="n">x</span>
<span class="o">|</span> <span class="n">x</span> <span class="p">`</span><span class="n">elem</span><span class="p">`</span> <span class="n">xs</span> <span class="o">=</span> <span class="mi">1</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="mi">0</span>
</code></pre></div></div>
<p>And probably the least interesting query of all is the simple ‘volume’, which
calculates the total measure of a space. For any probability measure this must
obviously be one, so it can at least be used as a quick sanity check:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">volume</span> <span class="o">::</span> <span class="kt">Measure</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">volume</span> <span class="o">=</span> <span class="n">integrate</span> <span class="p">(</span><span class="n">const</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<h2 id="convolution-and-friends">Convolution and Friends</h2>
<p>I mentioned in the <a href="/giry-monad-foundations">last post</a> that applicativeness corresponds to
independence in some sense, and that independent measures over the same
measurable space can be <a href="https://en.wikipedia.org/wiki/Convolution#Convolution_of_measures">convolved</a> together, à la:</p>
\[(\nu + \zeta)(f) = \int_{M}\int_{M}f(x + y)d\nu(x)d\zeta(y)\]
<p>for measures \(\nu\) and \(\zeta\) on \(M\). In Haskell-land it’s well-known
that any applicative instance gives you a free ‘Num’ instance, and the story is
no different here:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">instance</span> <span class="kt">Num</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">Num</span> <span class="p">(</span><span class="kt">Measure</span> <span class="n">a</span><span class="p">)</span> <span class="kr">where</span>
<span class="p">(</span><span class="o">+</span><span class="p">)</span> <span class="o">=</span> <span class="n">liftA2</span> <span class="p">(</span><span class="o">+</span><span class="p">)</span>
<span class="p">(</span><span class="o">-</span><span class="p">)</span> <span class="o">=</span> <span class="n">liftA2</span> <span class="p">(</span><span class="o">-</span><span class="p">)</span>
<span class="p">(</span><span class="o">*</span><span class="p">)</span> <span class="o">=</span> <span class="n">liftA2</span> <span class="p">(</span><span class="o">*</span><span class="p">)</span>
<span class="n">abs</span> <span class="o">=</span> <span class="n">fmap</span> <span class="n">abs</span>
<span class="n">signum</span> <span class="o">=</span> <span class="n">fmap</span> <span class="n">signum</span>
<span class="n">fromInteger</span> <span class="o">=</span> <span class="n">pure</span> <span class="o">.</span> <span class="n">fromInteger</span>
</code></pre></div></div>
<p>There are a few neat ways to demonstrate this kind of thing. Let’s use a
Gaussian measure here as a running example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">gaussian</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="kt">Double</span>
<span class="n">gaussian</span> <span class="n">m</span> <span class="n">s</span> <span class="o">=</span> <span class="n">fromDensityFunction</span> <span class="p">(</span><span class="n">density</span> <span class="n">m</span> <span class="n">s</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">density</span> <span class="n">m</span> <span class="n">s</span> <span class="n">x</span>
<span class="o">|</span> <span class="n">s</span> <span class="o"><=</span> <span class="mi">0</span> <span class="o">=</span> <span class="mi">0</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span>
<span class="mi">1</span> <span class="o">/</span> <span class="p">(</span><span class="n">s</span> <span class="o">*</span> <span class="n">sqrt</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="n">pi</span><span class="p">))</span> <span class="o">*</span> <span class="n">exp</span> <span class="p">(</span><span class="n">negate</span> <span class="p">((</span><span class="n">x</span> <span class="o">-</span> <span class="n">m</span><span class="p">)</span> <span class="o">^^</span> <span class="mi">2</span><span class="p">)</span> <span class="o">/</span> <span class="p">(</span><span class="mi">2</span> <span class="o">*</span> <span class="p">(</span><span class="n">s</span> <span class="o">^^</span> <span class="mi">2</span><span class="p">)))</span>
</code></pre></div></div>
<p>First, consider a <a href="https://en.wikipedia.org/wiki/Chi-squared_distribution">chi-squared</a> measure with \(k\) degrees of freedom.
We could create this directly using a density function, but instead we can
represent it by summing up independent squared Gaussian measures:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chisq</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="kt">Double</span>
<span class="n">chisq</span> <span class="n">k</span> <span class="o">=</span> <span class="n">sum</span> <span class="p">(</span><span class="n">replicate</span> <span class="n">k</span> <span class="n">normal</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">normal</span> <span class="o">=</span> <span class="n">fmap</span> <span class="p">(</span><span class="o">^</span> <span class="mi">2</span><span class="p">)</span> <span class="p">(</span><span class="n">gaussian</span> <span class="mi">0</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>To sanity check the result, we can compute the mean and variance of a
\(\chi^2(2)\) measure, which should be \(k\) and \(2k\) respectively for \(k =
2\):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> expectation (chisq 2)
2.0000000000000004
> variance (chisq 2)
4.0
</code></pre></div></div>
<p>As a second example, consider a product of independent Gaussian measures. This
is a trickier distribution to deal with analytically (see <a href="https://math.stackexchange.com/questions/161757/what-is-the-distribution-of-a-random-variable-that-is-the-product-of-the-two-nor">here</a>), but we
can use some well-known identities for general independent measures in order to
verify our results. For any independent measures \(\mu\) and \(\nu\), we have:</p>
\[\mathbb{E}(\mu\nu) = \mathbb{E}\mu \mathbb{E}\nu\]
<p>and</p>
\[\text{var}(\mu\nu) = \text{var}(\mu)\text{var}(\nu) + \text{var}(\mu)(\mathbb{E}\nu)^2 + \text{var}(\nu)(\mathbb{E}\mu)^2\]
<p>for the expectation and variance of their product. So for a product of
independent Gaussians w/parameters (1, 2) and (2, 3) respectively, we expect to
see 2 for its expectation and 61 for its variance:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> expectation (gaussian 1 2 * gaussian 2 3)
2.0000000000000001
> variance (gaussian 1 2 * gaussian 2 3)
61.00000000000003
</code></pre></div></div>
<p>Woop!</p>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>And there you have it, a continuation-based implementation of the Giry monad.
You can find a bunch of code with similar functionality to this packaged up in
my old <a href="http://github.com/jtobin/measurable">measurable</a> library on GitHub if you’d like to play around with
the concepts.</p>
<p>That library has accumulated a few stars since I first pushed it up in 2013. I
think a lot of people are curious about these weird measure things, and this
framework at least gives you the ability to play around with a representation
for measures directly. I found it particularly useful for really grokking,
say, that integrating some function \(f\) against a probability measure \(\nu\)
is identical to integrating the identity function against the probability
measure \(\texttt{fmap} \, f \, \nu\). And there are a few similar concepts
there that I find really pop out when you play with measures directly, rather
than when one just works with them on paper.</p>
<p>But let me now tell you why the Giry monad <strong>sucks</strong> in practice.</p>
<p>Take a look at this integral expression, which is brought about due to a
monadic bind:</p>
\[(\nu \gg\!\!= \mu)(f)
= \int_{M} \left\{\lambda m . \int_{M} f d\mu(m) \right\} d \nu.\]
<p>For simplicitly, let’s assume that \(M\) is discrete and has cardinality
\(|M|\). This means that the integral reduces to</p>
\[(\nu \gg\!\!= \mu)(f)
= \underbrace{\sum_{m \in M} d\nu(m) \underbrace{ \sum_{n \in M} f(n) d\mu(m)(n) }_{O(|M|)}}_{O(|M|)}\]
<p>for \(d\mu(m)\) and \(d\nu\) the appropriate Radon-Nikodym derivatives. You
can see that the total number of operations involved in the integral is
\(O(|M|^2)\), and indeed, for \(p\) monadic binds the computational complexity
involved in evaluating all the integrals involved is exponential, on the order
of \(|M|^p\). It was no coincidence that I demonstrated a variance calculation
for a \(\chi^2(2)\) distribution instead of for a \(\chi^2(10)\).</p>
<p>This isn’t really much of a surprise - the cottage industry of approximating
integrals exists <em>because</em> integration is hard in practice, and integration is
surely best avoided whenever one can get away with doing so. Vikash
Mansinghka’s quote on this topic is fitting: “don’t calculate probabilities -
sample good guesses.” I’ll also add: relegate the measures to measure theory,
where they seem to belong.</p>
<p>The Giry monad is a lovely abstract construction for formalizing the monadic
structure of probability, and as canonical probabilistic objects, measures and
integrals are tremendously useful when working theoretically. But they’re a
complete non-starter when it comes to getting anything nontrivial done in
practice. For that, there are far more useful representations for probability
distributions in Haskell - notably, the sampling function or random variable
representation found in things like
<a href="https://github.com/jtobin/mwc-probability">mwc-probability</a>/<a href="https://hackage.haskell.org/package/mwc-random-monad">mwc-random-monad</a> and <a href="https://hackage.haskell.org/package/random-fu">random-fu</a>, or even
better, the structural representation based on free or operational monads like
I’ve <a href="/simple-probabilistic-programming">written about before</a>, or that you can find in something like
<a href="https://github.com/adscib/monad-bayes">monad-bayes</a>.</p>
<p>The intuitions gleaned from playing with the Giry monad carry over precisely to
other representations for the probability monad. In all cases, ‘return’ will
correspond, semantically, to constructing a Dirac distribution at a point,
while ‘bind’ will correspond to a marginalizing operator. The same is true for
the underlying (applicative) functor structure: ‘fmap’ always corresponds to a
density-preserving transformation of the support, while applicativeness
corresponds to independence (yielding convolution, etc.). And you have to
admit, the connection to continuations is pretty cool.</p>
<p>There is clearly some connection to the <a href="https://hackage.haskell.org/package/kan-extensions-5.0.1/docs/Control-Monad-Codensity.html">codensity monad</a> as well, but I
think I’ll let someone else figure out the specifics of that one. Something
something right-Kan extension..</p>
Foundations of the Giry Monad2017-02-10T00:00:00+04:00https://jtobin.io/giry-monad-foundations<p>The Giry monad is the canonical probability monad that operates on the level of
measures, which are the abstract constructs that canonically represent
probability distributions. It’s sort of the baseline by which all other
probability monads can be judged.</p>
<p>In this article I’m going to go through the categorical and measure-theoretic
foundations of the Giry monad. In another article, I’ll describe how you can
implement it in a very faithful sense in Haskell.</p>
<p>I was putting some notes together for another project and wound up writing up
things up in a somewhat blog-friendly style, but this isn’t intended to be a
tutorial <em>per se</em>. Really this isn’t the kind of content I’d usually post
here, but since I’ve jotted everything up, I figured I may as well. If you
like extremely dry mathematics and computer science, you’re in the right place.</p>
<p>I won’t define everything under the sun here - for properties or coherence
conditions or other things that I’ve elided details on, check out something
like Mac Lane or Aliprantis & Border. I’ll include some references at the end.</p>
<p>This is the game plan we’re working with:</p>
<ul>
<li>Define monads and their supporting machinery in a categorical sense.</li>
<li>Define probability measures and some required background around that.</li>
<li>Construct the functor that maps a measurable space to the collection of all
probability measures on that space.</li>
<li>Demonstrate that it’s a monad.</li>
</ul>
<p>Let’s get started.</p>
<h2 id="categorical-foundations">Categorical Foundations</h2>
<p>A <em>category</em> \(C\) is a collection of <em>objects</em> and <em>morphisms</em> between them.
So if \(W\), \(X\), \(Y\), and \(Z\) are objects in \(C\), then \(f : W \to X\),
\(g : X \to Y\), and \(h : Y \to Z\) are examples of morphisms. These
morphisms can be composed in the obvious associative way, i.e.</p>
\[f \circ (g \circ h) = (f \circ g) \circ h\]
<p>and there exist identity morphisms (or <em>automorphisms</em>) that simply map objects
to themselves.</p>
<p>A <em>functor</em> is a mapping between categories (equivalently, it’s a morphism in
the category of so-called ‘small’ categories). The functor \(F : C \to D\)
takes every object in \(C\) to some object in \(D\), and every morphism in
\(C\) to some morphism in \(D\), such that the structure of morphism
composition is preserved. An <em>endofunctor</em> is a functor from a category to
itself, and a <em>bifunctor</em> is a functor from a pair of categories to another
category, i.e. \(F : A \times B \to C\).</p>
<p>A <em>natural transformation</em> is a mapping between functors. So for two functors
\(F, G : C \to D\), a natural transformation \(\epsilon : F \to G\) associates
to every object \(c\) in \(C\) a morphism \(\epsilon_c : F(c) \to G(c)\) in
\(D\).</p>
<p>A <em>monoidal category</em> \(C\) is a category with some additional monoidal
structure, namely an identity object \(I\) and a bifunctor \(\otimes : C \times
C \to C\) called the <em>tensor product</em>, plus several natural isomorphisms that
provide the associativity of the tensor product and its right and left identity
with the identity object \(I\).</p>
<p>A <em>monoid</em> \((M, \mu, \eta)\) in a monoidal category \(C\) is an object \(M\)
in \(C\) together with two morphisms (obeying the standard associativity and
identity properties) that make use of the category’s monoidal structure: the
associative binary operator \(\mu : M \otimes M \to M\), and the identity
\(\eta : I \to M\).</p>
<p>A <em>monad</em> is (infamously) a ‘monoid in the category of endofunctors’. So take
the category of endofunctors \(\mathcal{F}\) whose objects are endofunctors and
whose morphisms are natural transformations between them. This is a monoidal
category; there exists an identity endofunctor \(1_\mathcal{F}(F) = F\) for all
\(F\) in \(\mathcal{F}\), plus a tensor product \(\otimes : \mathcal{F} \times
\mathcal{F} \to \mathcal{F}\) defined by functor composition such that the
required associativity and identity properties hold. \(\mathcal{F}\) is thus a
monoidal category, and any specific monoid \((F, \mu, \eta)\) we construct on
it is a specific monad.</p>
<h2 id="probabilistic-foundations">Probabilistic Foundations</h2>
<p>A <em>measurable space</em> \((X, \mathcal{X})\) is a set \(X\) equipped with a
topology-like structure called a \(\sigma\)-algebra \(\mathcal{X}\) that
essentially contains every well-behaved subset of \(X\) in some sense. A
<em>measure</em> \(\nu : \mathcal{X} \to \mathbb{R}\) is a particular kind of set
function from the \(\sigma\)-algebra to the nonnegative real line. A measure
just assigns a generalized notion of area or volume to well-behaved subsets of
\(X\). In particular, if the total possible area or volume of the underlying
set is 1 then we’re dealing with a <em>probability measure</em>. A measurable space
completed with a measure, e.g. \((X, \mathcal{X}, \nu)\) is called a <em>measure
space</em>, and a measurable space completed with a probability measure is called a
<em>probability space</em>.</p>
<p>There is a lot of <a href="/on-measurability">overloaded lingo</a> around the word ‘measurable’. A
‘measurable set’ is an element of a \(\sigma\)-algebra in a measurable space.
A <em>measurable mapping</em> is a mapping between measurable spaces. Given a
‘source’ measurable space \((X, \mathcal{X})\) and ‘target’ measurable space
\((Y, \mathcal{Y})\), a measurable mapping \((X, \mathcal{X}) \to (Y,
\mathcal{Y})\) is a map \(T : X \to Y\) with the property that, for any
measurable set in the target, the inverse image is measurable in the source.
Or, formally, for any \(B\) in \(\mathcal{Y}\), you have that \(T^{-1}(B)\) is
in \(\mathcal{X}\).</p>
<h2 id="the-space-of-probability-measures-on-a-measurable-space">The Space of Probability Measures on a Measurable Space</h2>
<p>If you consider the collection of all measurable spaces and measurable mappings
between them, you get a category. Define \(\textbf{Meas}\) to be the category
of measurable spaces. So, objects are measurable spaces and morphisms are the
measurable mappings between them.</p>
<p>For any specific measurable space \(M\) in \(\textbf{Meas}\), we can consider
the space of all possible probability measures that could be placed on it and
denote that \(\mathcal{P}(M)\). To be clear, \(\mathcal{P}(M)\) is a <em>space of
measures</em> - that is, a space in which the points themselves are probability
measures.</p>
<p>What’s remarkable about \(\mathcal{P}(M)\) is that it is <em>itself</em> a measurable
space. Let me explain.</p>
<p>As a probability measure, any element of \(\mathcal{P}(M)\) is a function from
measurable subsets of \(M\) to the interval \([0, 1]\) in \(\mathbb{R}\). That
is: if \(M\) is the measurable space \((X, \mathcal{X})\), then a point \(\nu\)
in \(\mathcal{P}(M)\) is a function \(\mathcal{X} \to \mathbb{R}\). For any
measurable \(A\) in \(M\), there just naturally exists a sort of ‘evaluation’
mapping I’ll call \(\tau_A: \mathcal{P}(M) \to \mathbb{R}\) that takes a
measure on \(M\) and evaluates it on the set \(A\). To be explicit: if \(\nu\)
is a measure in \(\mathcal{P}(M)\), then \(\tau_A\) simply evaluates
\(\nu(A)\). It ‘runs’ the measure in a sense; in Haskell, \(\tau_A\) would be
analogous to a function like <code class="language-plaintext highlighter-rouge">\f -> f a</code> for some <code class="language-plaintext highlighter-rouge">a</code>.</p>
<p>This evaluation map \(\tau_A\) corresponds to an <em>integral</em>. If you have a
measurable space \((X, \mathcal{X})\), then for any \(A\) a subset in
\(\mathcal{X}\), \(\tau_A(\nu) = \nu(A) = \int_{X}\chi_A d\nu\) for \(\chi\)
the characteristic or indicator function of \(A\) (where \(\chi(x)\) is \(1\)
if \(x\) is in \(A\), and is \(0\) otherwise). And we can actually extend
\(\tau\) to operate over measurable mappings from \((X, \mathcal{X})\) to
\((\mathbb{R}, \mathcal{B}(\mathbb{R}))\), where \(\mathcal{B}(\mathbb{R})\) is
a suitable \(\sigma\)-algebra on \(\mathbb{R}\). Here we typically use what’s
called the <em>Borel</em> \(\sigma\)-algebra, which takes a topology on the set and
then generates a \(\sigma\)-algebra from the open sets in the topology (for
\(\mathbb{R}\) we can just use the ‘usual’ topology generated by the Euclidean
metric). For \(f : X \to \mathbb{R}\) a measurable function, we can define the
evaluation mapping \(\tau_f : \mathcal{P}(M) \to \mathbb{R}\) as \(\tau_f(\nu)
= \int_X f d\nu\).</p>
<p>We can abuse notation here a bit and just use \(\tau\) to refer to ‘duck typed’
mappings that evaluate measures over measurable sets or measurable functions
depending on context. If we treat \(\tau_A(\nu)\) as a function
\(\tau(\nu)(A)\), then \(\tau(\nu)\) has type \(\mathcal{X} \to \mathbb{R}\).
If we treat \(\tau_f(\nu)\) as a function \(\tau(\nu)(f)\), then \(\tau(\nu)\)
has type \((X \to \mathbb{R}) \to \mathbb{R}\). I’ll say \(\tau_{\{A, f\}}\)
to refer to the mappings that accept either measurable sets or functions.</p>
<p>In any case. For a measurable space \(M\), there exists a topology on
\(\mathcal{P}(M)\) called the <em>weak-* topology</em> that makes all the evaluation
mappings \(\tau_{\{A, f\}}\) continuous for any measurable set \(A\) or
measurable function \(f\). From there, we can generate the Borel
\(\sigma\)-algebra \(\mathcal{B}(\mathcal{P}(M))\) that makes the evaluation
functions \(\tau_{\{A, f\}}\) measurable. The result is that
\((\mathcal{P}(M), \mathcal{B}(\mathcal{P}(M)))\) is itself a measurable space,
and thus an object in \(\textbf{Meas}\).</p>
<p>The space \(\mathcal{P}(M)\) actually has all sorts of insane properties that
one wouldn’t expect - there are implications on convexity, completeness,
compactness and such that carry over from \(M\). But I digress.</p>
<h2 id="mathcalp-is-a-functor">\(\mathcal{P}\) is a Functor</h2>
<p>So: for any \(M\) an object in \(\textbf{Meas}\), we have that
\(\mathcal{P}(M)\) is also an object in \(\textbf{Meas}\). And if you look at
\(\mathcal{P}\) like a functor, you notice that it takes objects of
\(\textbf{Meas}\) to objects of \(\textbf{Meas}\). Indeed, you can define an
analogous procedure on morphisms in \(\textbf{Meas}\) as follows. Take \(N\)
to be another object (read: measurable space) in \(\textbf{Meas}\) and \(T : M
\to N\) to be a morphism (read: measurable mapping) between them. Now, for any
measure \(\nu\) in \(\mathcal{P}(M)\) we can define \(\mathcal{P}(T)(\nu) = \nu
\circ T^{-1}\) (this is called the image, distribution, or pushforward of
\(\nu\) under \(T\)). For some \(T\) and \(\nu\), \(\mathcal{P}(T)(\nu)\)
thus takes measurable sets in \(N\) to a value in the interval \([0, 1]\) -
that is, it is a measure on \(\mathcal{P}(N)\). So we have that:</p>
\[\mathcal{P}(T) : \mathcal{P}(M) \to \mathcal{P}(N)\]
<p>and so \(\mathcal{P}\) is an endofunctor on \(\textbf{Meas}\).</p>
<h2 id="mathcalp-is-a-monad">\(\mathcal{P}\) is a Monad</h2>
<p>See where we’re going here? If we can define natural transformations \(\mu\)
and \(\eta\) such that \((\mathcal{P}, \mu, \eta)\) is a monoid in the category
of endofunctors, we’ll have defined a monad. We thus need to come up with a
suitable monoidal structure, et voilà.</p>
<p>First the identity. We want a natural transformation \(\eta\) between the
identity functor \(1_{\mathcal{F}}\) and the functor \(\mathcal{P}\) such
that \(\eta_M : 1_{\mathcal{F}}(M) \to \mathcal{P}(M)\) for any measurable
space \(M\) in \(\textbf{Meas}\). Evaluating the identity functor simplifies
things to \(\eta_M : M \to \mathcal{P}(M)\).</p>
<p>We can define this concretely as follows. Grab a measurable space \(M\) in
\(\textbf{Meas}\) and define \(\eta(x)(A) = \chi_A(x)\) for any point \(x \in
M\) and any measurable set \(A \subseteq M\). \(\eta(x)\) is thus a
probability measure on \(M\) - we assign \(1\) to measurable sets that contain
\(x\), and 0 to those that don’t. If we peel away another argument, we have
that \(\eta : M \to \mathcal{P}(M)\), as required.</p>
<p>So \(\eta\) takes points in measurable spaces to probability measures on those
spaces. In technical parlance, it takes a point \(x\) to the <em>Dirac
measure</em> at \(x\) - the probability measure that places the entirety of its
mass at \(x\).</p>
<p>Now for the other part of the monoidal structure, \(\mu\). I initially found
this next part to be a bit of a trip, but let me see what I can do about that.</p>
<p>Recall that the category of endofunctors, \(\mathcal{F}\), is monoidal, so
there exists a tensor product \(\otimes : \mathcal{F} \times \mathcal{F} \to
\mathcal{F}\) that we can deal with, which here just corresponds to functor
composition. We’re looking for a natural transformation:</p>
\[\mu : \mathcal{P} \circ \mathcal{P} \to \mathcal{P}\]
<p>which is often written as:</p>
\[\mu : \mathcal{P}^2 \to \mathcal{P}.\]
<p>Take \(M = (X, \mathcal{X})\) a measurable space in \(\textbf{Meas}\) and then
consider the space of probability measures over it, \(\mathcal{P}(M)\). Then
take the space of probability measures <em>over the space of probability measures</em>
on \(M\), \(\mathcal{P}(\mathcal{P}(M))\). Since \(\mathcal{P}\) is an
endofunctor, this is again a measurable space, and for any measurable subset
\(A\) of \(M\) we again have a family of mappings \(\tau_A\) that take a
probability measure in \(\mathcal{P}(\mathcal{P}(M))\) and evaluate it on
\(A\). We want \(\mu\) to be the thing that turns a measure over measures
\(\rho\) into a plain old probability measure on \(\mathcal{P}(M)\).</p>
<p>In the context of probability theory, this kind of semigroup action is a
<em>marginalizing</em> operator. We’re taking the ‘uncertainty’ captured in
\(\mathcal{P}(\mathcal{P}(M))\) via the probability measure \(\rho\) and
smearing it into the probability measures in \(\mathcal{P}(M)\).</p>
<p>Take \(\rho\) in \(\mathcal{P}(\mathcal{P}(M))\) and some \(A\) a measurable
subset of \(M\). We can define \(\mu\) as follows:</p>
\[\mu(\rho)(A) = \int_{\mathcal{P}(M)} \tau_A d\rho.\]
<p>Using some lambda calculus notation to see the argument for \(\tau_A\), we can
expand the integrals to get the following gnarly expression:</p>
\[\mu(\rho)(A) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M \chi_A d \nu \right\} d \rho.\]
<p>Notice what’s happening here. For \(M\) a measurable space, we’re integrating
over \(\mathcal{P}(M)\) the space of probability measures on \(M\), with
respect to the probability measure \(\rho\), which itself is a point in the
space of probability measures over probability measures on \(M\),
\(\mathcal{P}(\mathcal{P}(M))\). Whew.</p>
<p>The spaces we’re integrating over here are unusual, but \(\rho\) is still a
probability measure, so when applied to a measurable set in
\(\mathcal{B}(\mathcal{P}(M))\) it results in a probability in \([0, 1]\).
So, peeling back an argument, we have that \(\mu(\rho)\) has type \(\mathcal{X}
\to \mathbb{R}\). In other words, it’s a probability measure on \(M\), and
thus is in \(\mathcal{P}(M)\). And if we peel back <em>another</em> argument, we find
that:</p>
\[\mu_M : \mathcal{P}(\mathcal{P}(M)) \to \mathcal{P}(M)\]
<p>so, as required, that</p>
\[\mu : \mathcal{P}^{2} \to \mathcal{P}.\]
<p>It’s also worth noting that we can overload the notation for \(\mu\) in the
same way we did for \(\tau\), i.e. to supply measurable functions in addition
to measurable sets:</p>
\[\mu(\rho)(f) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M f d \nu \right\} d \rho.\]
<p>Combining the three components, we get \((\mathcal{P}, \mu, \eta)\), the
canonical Giry monad.</p>
<p>In Haskell, when we’re dealing with monads we typically use the bind operator
\(\gg\!\!=\) instead of manually dealing with the functorial structure and
\(\mu\) (called ‘join’). Bind has the type:</p>
\[\gg\!\!= : \mathcal{P}(M) \to (M \to \mathcal{P}(N)) \to \mathcal{P}(N)\]
<p>and for illustration, we can define \(\gg\!\!=\) for the Giry monad like so:</p>
\[(\rho \gg\!\!= g)(f) = \int_{M} \left\{ \lambda m . \int_N f d g(m) \right\} d\rho.\]
<p>Here \(\rho\) is in \(\mathcal{P}(M)\), \(g\) is in \(M \to \mathcal{P}(N)\),
and \(f\) is in \(N \to \mathbb{R}\), so note that we potentially simplify the
outermost integral enormously. It now operates over a <em>general</em> measurable
space, rather than a space of measures in particular, and this will come in
handy when we get to implementation details in the next post.</p>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>That’s about it for now. It’s worth noting as a kind of footnote here that the
existence of the Giry monad also obviously implies the existence of a Giry
applicative functor. But the official situation for applicative functors seems
kind of weird in this context, and I’m not yet up to the task of dealing with
it formally.</p>
<p>Intuitively, one should be able to define the binary applicative operator
characteristic of its lax monoidal structure as follows:</p>
\[(\rho \, \langle \ast \rangle \, \nu)(f) = \int_{\mathcal{P}(M \to N)} \left\{\lambda T . \int_{M \to N} (f \circ T) d\nu \right\} d \rho.\]
<p>But this has some really weird measure-theoretic implications - namely, that it
assumes the existence of a space of probability measures over the space of
all measurable functions \(M \to N\), which is <a href="https://mathoverflow.net/questions/1388/is-there-a-natural-measures-on-the-space-of-measurable-functions">not trivial to define</a>
and indeed may not even exist. It seems like some people are looking into this
problem as I just happened to stumble on <a href="https://arxiv.org/pdf/1701.02547.pdf">this paper</a> on the arXiv while
doing some googling. I notice that some people on e.g. nLab require categories
with additional structure beyond \(\textbf{Meas}\) for the development of the
Giry monad as well, for example the category of Polish (separable, completely
metrizable) spaces \(\textbf{Pol}\), so maybe the extra structure there takes
care of the quirks.</p>
<p>Anyway. Applicatives are neat here because applicative probability measures
<a href="/encoding-independence-statically">are independent</a> probability measures. And the existence of
applicativeness means you can do all the things with independent probability
measures that you might be used to. Measure convolution and friends are good
examples. Given a measurable space \(M\) that supports some notion of addition
and two probability measures \(\nu\) and \(\zeta\) in \(\mathcal{P}(M)\), we
can add measures together via:</p>
\[(\nu + \zeta)(f) = \int_{M}\int_{M}f(x + y)d\nu(x)d\zeta(y)\]
<p>where \(x\) and \(y\) are both points in \(M\). Subtraction and multiplication
translate trivially as well.</p>
<p>In another article I’ll detail how the Giry monad can be implemented in Haskell
and point out some neat extensions. There are some cool connections to
continuations and codensity monads, and seemingly de Finetti’s theorem and
exchangeability. That kind of thing. It’d also be worth trying to justify
independence of probability measures from a categorical perspective, which
seems easier than resolving the nitty-gritty measurability qualms I mentioned
above.</p>
<p>‘Til then! Thanks to <a href="https://www.jforbes.io">Jason Forbes</a> for sifting through this stuff and
providing some great comments.</p>
<h2 id="references">References:</h2>
<ul>
<li><a href="https://ncatlab.org/nlab/show/Giry+monad">The Giry Monad</a></li>
<li><a href="https://golem.ph.utexas.edu/category/2014/10/where_do_probability_measures.html">Where Do Probability Measures Come From?</a></li>
<li><a href="https://www.amazon.com/Infinite-Dimensional-Analysis-Hitchhikers-Guide/dp/3540326960">Infinite Dimensional Analysis</a> (esp. chapter 15)</li>
<li><a href="https://www.amazon.com/Categories-Working-Mathematician-Graduate-Mathematics/dp/0387984038">Categories for the Working Mathematician</a></li>
<li><a href="https://terrytao.wordpress.com/2013/07/26/computing-convolutions-of-measures/">Computing Convolutions of Measures</a></li>
<li><a href="https://www.cs.tufts.edu/~nr/pubs/pmonad.pdf">Stochastic Lambda Calculus and Monads of Probability Distributions</a></li>
</ul>
Rotating Squares2017-01-04T00:00:00+04:00https://jtobin.io/rotating-squares<p>Here’s a short one.</p>
<p>I use Colin Percival’s <a href="http://www.daemonology.net/hn-daily/">Hacker News Daily</a> to catch the top ten articles
of the day on Hacker News. Today an article called <a href="http://raganwald.com/2016/12/27/recursive-data-structures.html">Why Recursive Data
Structures?</a> popped up, which illustrates that recursive algorithms can
become both intuitive and borderline trivial when a suitable data structure is
used to implement them. This is exactly the motivation for using recursion
schemes.</p>
<p>In the above article, Reginald rotates squares by representing them via a
<a href="https://en.wikipedia.org/wiki/Quadtree">quadtree</a>. If we have a square of bits, something like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.x..
..x.
xxx.
....
</code></pre></div></div>
<p>then we want to be able to easily rotate it 90 degrees clockwise, for example.
So let’s define a quadtree in Haskell:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
<span class="kr">import</span> <span class="nn">Data.List.Split</span>
<span class="kr">data</span> <span class="kt">QuadTreeF</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">NodeF</span> <span class="n">r</span> <span class="n">r</span> <span class="n">r</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">LeafF</span> <span class="n">a</span>
<span class="o">|</span> <span class="kt">EmptyF</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">QuadTree</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">QuadTreeF</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>The four fields of the ‘NodeF’ constructor correspond to the upper left, upper
right, lower right, and lower left quadrants of the tree respectively.</p>
<p>Gimme some embedded language terms:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">node</span> <span class="o">::</span> <span class="kt">QuadTree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">QuadTree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">QuadTree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">QuadTree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">QuadTree</span> <span class="n">a</span>
<span class="n">node</span> <span class="n">ul</span> <span class="n">ur</span> <span class="n">lr</span> <span class="n">ll</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">NodeF</span> <span class="n">ul</span> <span class="n">ur</span> <span class="n">lr</span> <span class="n">ll</span><span class="p">)</span>
<span class="n">leaf</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">QuadTree</span> <span class="n">a</span>
<span class="n">leaf</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="o">.</span> <span class="kt">LeafF</span>
<span class="n">empty</span> <span class="o">::</span> <span class="kt">QuadTree</span> <span class="n">a</span>
<span class="n">empty</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">EmptyF</span>
</code></pre></div></div>
<p>That lets us define quadtrees easily. Here’s the tree that the previous
diagram corresponds to:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">tree</span> <span class="o">::</span> <span class="kt">QuadTree</span> <span class="kt">Bool</span>
<span class="n">tree</span> <span class="o">=</span> <span class="n">node</span> <span class="n">ul</span> <span class="n">ur</span> <span class="n">lr</span> <span class="n">ll</span> <span class="kr">where</span>
<span class="n">ul</span> <span class="o">=</span> <span class="n">node</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">True</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span>
<span class="n">ur</span> <span class="o">=</span> <span class="n">node</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">True</span><span class="p">)</span>
<span class="n">lr</span> <span class="o">=</span> <span class="n">node</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">True</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span>
<span class="n">ll</span> <span class="o">=</span> <span class="n">node</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">True</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">True</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span> <span class="p">(</span><span class="n">leaf</span> <span class="kt">False</span><span class="p">)</span>
</code></pre></div></div>
<p>Rotating is then really easy - we rotate each quadrant recursively. Just reach
for a catamorphism:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rotate</span> <span class="o">::</span> <span class="kt">QuadTree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">QuadTree</span> <span class="n">a</span>
<span class="n">rotate</span> <span class="o">=</span> <span class="n">cata</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">NodeF</span> <span class="n">ul</span> <span class="n">ur</span> <span class="n">lr</span> <span class="n">ll</span> <span class="o">-></span> <span class="n">node</span> <span class="n">ll</span> <span class="n">ul</span> <span class="n">ur</span> <span class="n">lr</span>
<span class="kt">LeafF</span> <span class="n">a</span> <span class="o">-></span> <span class="n">leaf</span> <span class="n">a</span>
<span class="kt">EmptyF</span> <span class="o">-></span> <span class="n">empty</span>
</code></pre></div></div>
<p>Notice that you just have to shift each field of ‘NodeF’ rightward, with
wraparound. Then if you rotate and render the original tree you get:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>.x..
.x.x
.xx.
....
</code></pre></div></div>
<p>Rotating things more times yields predictable results.</p>
<p>If you want to rotate another structure - say, a flat list - you can go through
a quadtree as an intermediate representation using the same pattern I described
in <a href="/sorting-with-style">Sorting with Style</a>. Build yourself a coalgebra and algebra pair:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">builder</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="kt">QuadTreeF</span> <span class="n">a</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">builder</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">[]</span> <span class="o">-></span> <span class="kt">EmptyF</span>
<span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">-></span> <span class="kt">LeafF</span> <span class="n">x</span>
<span class="n">xs</span> <span class="o">-></span> <span class="kt">NodeF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">c</span> <span class="n">d</span> <span class="kr">where</span>
<span class="p">[</span><span class="n">a</span><span class="p">,</span> <span class="n">b</span><span class="p">,</span> <span class="n">c</span><span class="p">,</span> <span class="n">d</span><span class="p">]</span> <span class="o">=</span> <span class="n">chunksOf</span> <span class="p">(</span><span class="n">length</span> <span class="n">xs</span> <span class="p">`</span><span class="n">div</span><span class="p">`</span> <span class="mi">4</span><span class="p">)</span> <span class="n">xs</span>
<span class="n">consumer</span> <span class="o">::</span> <span class="kt">QuadTreeF</span> <span class="n">a</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">consumer</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">EmptyF</span> <span class="o">-></span> <span class="kt">[]</span>
<span class="kt">LeafF</span> <span class="n">a</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="kt">NodeF</span> <span class="n">ul</span> <span class="n">ur</span> <span class="n">lr</span> <span class="n">ll</span> <span class="o">-></span> <span class="n">concat</span> <span class="p">[</span><span class="n">ll</span><span class="p">,</span> <span class="n">ul</span><span class="p">,</span> <span class="n">ur</span><span class="p">,</span> <span class="n">lr</span><span class="p">]</span>
</code></pre></div></div>
<p>and then glue them together with a hylomorphism:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>rotateList :: [a] -> [a]
rotateList = hylo consumer builder
</code></pre></div></div>
<p>Neato.</p>
<p>For a recent recursion scheme resource I’ve spotted on the Twitters, check out
Pascal Hartig’s <a href="https://github.com/passy/awesome-recursion-schemes">compendium in progress</a>.</p>
Promorphisms, Pre and Post2016-11-26T00:00:00+04:00https://jtobin.io/promorphisms-pre-post<p>To the.. uh, ‘layperson’, pre- and postpromorphisms are probably well into the
WTF category of recursion schemes. This is a mistake - they’re simple and
useful, and I’m going to try and convince you of this in short order.</p>
<p>Preliminaries:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
<span class="kr">import</span> <span class="nn">Prelude</span> <span class="k">hiding</span> <span class="p">(</span><span class="nf">sum</span><span class="p">)</span>
</code></pre></div></div>
<p>For simplicity, let’s take a couple of standard interpreters on lists. We’ll
define ‘sumAlg’ as an interpreter for adding up list contents and ‘lenAlg’ for
just counting the number of elements present:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sumAlg</span> <span class="o">::</span> <span class="kt">Num</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">sumAlg</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span> <span class="n">h</span> <span class="o">+</span> <span class="n">t</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="mi">0</span>
<span class="n">lenAlg</span> <span class="o">::</span> <span class="kt">ListF</span> <span class="n">a</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">lenAlg</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span> <span class="mi">1</span> <span class="o">+</span> <span class="n">t</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="mi">0</span>
</code></pre></div></div>
<p>Easy-peasy. We can use <a href="/practical-recursion-schemes">cata</a> to make these
things useful:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">sum</span> <span class="o">::</span> <span class="kt">Num</span> <span class="n">a</span> <span class="o">=></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">sum</span> <span class="o">=</span> <span class="n">cata</span> <span class="n">sumAlg</span>
<span class="n">len</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">len</span> <span class="o">=</span> <span class="n">cata</span> <span class="n">lenAlg</span>
</code></pre></div></div>
<p>Nothing new there; ‘sum [1..10]’ will give you 55 and ‘len [1..10]’ will give
you 10.</p>
<p>An interesting twist is to consider only <em>small</em> elements in some sense; say,
we only want to add or count elements that are less than or equal to 10, and
ignore any others.</p>
<p>We could rewrite the previous interpreters, manually checking for the condition
we’re interested in and handling it accordingly:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">smallSumAlg</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Ord</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Num</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">smallSumAlg</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">h</span> <span class="o"><=</span> <span class="mi">10</span>
<span class="kr">then</span> <span class="n">h</span> <span class="o">+</span> <span class="n">t</span>
<span class="kr">else</span> <span class="mi">0</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="mi">0</span>
<span class="n">smallLenAlg</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Ord</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Num</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="kt">ListF</span> <span class="n">a</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">smallLenAlg</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span>
<span class="kr">if</span> <span class="n">h</span> <span class="o"><=</span> <span class="mi">10</span>
<span class="kr">then</span> <span class="mi">1</span> <span class="o">+</span> <span class="n">t</span>
<span class="kr">else</span> <span class="mi">0</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="mi">0</span>
</code></pre></div></div>
<p>And you get ‘smallSum’ and ‘smallLen’ by using ‘cata’ on them respectively.
They work like you’d expect - ‘smallLen [1, 5, 20]’ ignores the 20 and just
returns 2, for example.</p>
<p>You can do better though. Enter the prepromorphism.</p>
<p>Instead of writing additional special-case interpreters for the ‘small’ case,
consider the following <em>natural transformation</em> on the list base functor. It
maps the list base functor to itself, without needing to inspect the carrier
type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">small</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Ord</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Num</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">b</span> <span class="o">-></span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">small</span> <span class="kt">Nil</span> <span class="o">=</span> <span class="kt">Nil</span>
<span class="n">small</span> <span class="n">term</span><span class="o">@</span><span class="p">(</span><span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span><span class="p">)</span>
<span class="o">|</span> <span class="n">h</span> <span class="o"><=</span> <span class="mi">10</span> <span class="o">=</span> <span class="n">term</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="kt">Nil</span>
</code></pre></div></div>
<p>A <em>prepromorphism</em> is a ‘cata’-like recursion scheme that proceeds by first
applying a natural transformation before interpreting via a supplied algebra.
That’s.. surprisingly simple. Here are ‘smallSum’ and ‘smallLen’, defined
without needing to clumsily create new special-case algebras:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">smallSum</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Ord</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Num</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">smallSum</span> <span class="o">=</span> <span class="n">prepro</span> <span class="n">small</span> <span class="n">sumAlg</span>
<span class="n">smallLen</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Ord</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Num</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">smallLen</span> <span class="o">=</span> <span class="n">prepro</span> <span class="n">small</span> <span class="n">lenAlg</span>
</code></pre></div></div>
<p>They work great:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> smallSum [1..100]
55
> smallLen [1..100]
10
</code></pre></div></div>
<p>In pseudo category-theoretic notation you visualize how a prepromorphism works
via the following commutative diagram:</p>
<p><img src="/images/prepro.png" alt="" class="center-image" /></p>
<p>The only difference, when compared to a <a href="/monadic-recursion-schemes">standard
catamorphism</a>, is the presence of the natural
transformation applied via the looping arrow in the top left. The natural
transformation ‘h’ has type ‘forall r. Base t r -> Base t r’, and ‘embed’ has
type ‘Base t t -> t’, so their composition gets you exactly the type you need
for an algebra, which is then the input to ‘cata’ there. Mapping the
catamorphism over the type ‘Base t t’ brings it right back to ‘Base t t’.</p>
<p>A <em>postpromorphism</em> is dual to a prepromorphism. It’s ‘ana’-like; proceed with
your corecursive production, applying natural transformations as you go.</p>
<p>Here’s a streaming coalgebra:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">streamCoalg</span> <span class="o">::</span> <span class="kt">Enum</span> <span class="n">a</span> <span class="o">=></span> <span class="n">a</span> <span class="o">-></span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">a</span>
<span class="n">streamCoalg</span> <span class="n">n</span> <span class="o">=</span> <span class="kt">Cons</span> <span class="n">n</span> <span class="p">(</span><span class="n">succ</span> <span class="n">n</span><span class="p">)</span>
</code></pre></div></div>
<p>A normal anamorphism would just send this thing shooting off into infinity, but
we can use the existing ‘small’ natural transformation to cap it at 10:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">smallStream</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Ord</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Num</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Enum</span> <span class="n">a</span><span class="p">)</span> <span class="o">=></span> <span class="n">a</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">smallStream</span> <span class="o">=</span> <span class="n">postpro</span> <span class="n">small</span> <span class="n">streamCoalg</span>
</code></pre></div></div>
<p>You get what you might expect:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> smallStream 3
[3,4,5,6,7,8,9,10]
</code></pre></div></div>
<p>And similarly, you can visualize a postpromorphism like so:</p>
<p><img src="/images/postpro.png" alt="" class="center-image" /></p>
<p>In this case the natural transformation is applied <em>after</em> mapping the
postpromorphism over the base functor (hence the ‘post’ namesake).</p>
Comonadic Markov Chain Monte Carlo2016-10-26T00:00:00+04:00https://jtobin.io/comonadic-mcmc<p>Some time ago I came across a way to in-principle perform inference on certain
probabilistic programs using comonadic structures and operations.</p>
<p>I decided to dig it up and try to use it to extend the <a href="/simple-probabilistic-programming">simple probabilistic
programming language</a> I talked about a few days ago with a stateful,
experimental inference backend. In this post we’ll</p>
<ul>
<li>Represent probabilistic programs as recursive types parameterized by
a terminating instruction set.</li>
<li>Represent execution traces of probabilistic programs via a simple
transformation of our program representation.</li>
<li>Implement the Metropolis-Hastings algorithm over this space of execution
traces and thus do some inference.</li>
</ul>
<p>Let’s get started!</p>
<h2 id="representing-programs-that-terminate">Representing Programs That Terminate</h2>
<p>I like thinking of embedded languages in terms of <em>instruction sets</em>. That is:
I want to be able to construct my embedded language by first defining a
collection of abstract instructions and then using some appropriate <a href="/tour-of-some-recursive-types">recursive
structure</a> to represent programs over that set.</p>
<p>In the case of probabilistic programs, our instructions are <em>probability
distributions</em>. Last time we used the following simple instruction set to
define our embedded language:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">ModelF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">BernoulliF</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Bool</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">BetaF</span> <span class="kt">Double</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
</code></pre></div></div>
<p>We then created an embedded language by just wrapping it up in the
higher-kinded <code class="language-plaintext highlighter-rouge">Free</code> type to denote programs of type <code class="language-plaintext highlighter-rouge">Model</code>.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span>
<span class="kt">Pure</span> <span class="n">a</span>
<span class="o">|</span> <span class="kt">Free</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span><span class="p">))</span>
<span class="kr">type</span> <span class="kt">Model</span> <span class="o">=</span> <span class="kt">Free</span> <span class="kt">ModelF</span>
</code></pre></div></div>
<p>Recall that <code class="language-plaintext highlighter-rouge">Free</code> represents programs that can <em>terminate</em>, either by some
instruction in the underlying instruction set, or via the <code class="language-plaintext highlighter-rouge">Pure</code> constructor of
the <code class="language-plaintext highlighter-rouge">Free</code> type itself. The language defined by <code class="language-plaintext highlighter-rouge">Free ModelF</code> is expressive
enough to easily construct a ‘forward-sampling’ interpreter, as well as a
simple rejection sampler for performing inference.</p>
<p>Notice that we don’t have a terminating <em>instruction</em> in <code class="language-plaintext highlighter-rouge">ModelF</code> itself - if
we’re using it, then we need to rely on the <code class="language-plaintext highlighter-rouge">Pure</code> constructor of <code class="language-plaintext highlighter-rouge">Free</code> to
terminate programs. Otherwise they’d just have to recurse forever. This can
be a bit limiting if we want to transform a program of type <code class="language-plaintext highlighter-rouge">Free ModelF</code> to
something else that doesn’t have a notion of termination baked-in (<code class="language-plaintext highlighter-rouge">Fix</code>, for
example).</p>
<p>Let’s tweak the <code class="language-plaintext highlighter-rouge">ModelF</code> type to get the following:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">ModelF</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">BernoulliF</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Bool</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">BetaF</span> <span class="kt">Double</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">NormalF</span> <span class="kt">Double</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">DiracF</span> <span class="n">a</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
</code></pre></div></div>
<p>Aside from adding another foundational distribution - <code class="language-plaintext highlighter-rouge">NormalF</code> - we’ve also
added a new constructor, <code class="language-plaintext highlighter-rouge">DiracF</code>, which carries a parameter with type <code class="language-plaintext highlighter-rouge">a</code>. We
need to incorporate this carrier type in the overall type of <code class="language-plaintext highlighter-rouge">ModelF</code> as well,
so <code class="language-plaintext highlighter-rouge">ModelF</code> itself also gets a new type parameter to carry around.</p>
<p>The <code class="language-plaintext highlighter-rouge">DiracF</code> instruction is a <em>terminating</em> instruction; it has no recursive
point and just terminates with a value of type <code class="language-plaintext highlighter-rouge">a</code> when reached. It’s
structurally equivalent to the <code class="language-plaintext highlighter-rouge">Pure a</code> branch of <code class="language-plaintext highlighter-rouge">Free</code> that we were relying
on to terminate our programs previously - the only thing we’ve done is add it
to our instruction set proper.</p>
<p>Why <code class="language-plaintext highlighter-rouge">DiracF</code>? A <a href="https://en.wikipedia.org/wiki/Dirac_delta_function">Dirac distribution</a> places the entirety of its
probability mass on a single point, and this is the exact probabilistic
interpretation of the applicative <code class="language-plaintext highlighter-rouge">pure</code> or monadic <code class="language-plaintext highlighter-rouge">return</code> that one
encounters with an appropriate probability type. Intuitively, if I sample a
value \(x\) from a uniform distribution, then that is indistinguishable from
sampling \(x\) from said uniform distribution and then sampling from a Dirac
distribution with parameter \(x\).</p>
<p>Make sense? If not, it might be helpful to note that there is no difference
between any of the following (to which <code class="language-plaintext highlighter-rouge">uniform</code> and <code class="language-plaintext highlighter-rouge">dirac</code> are analogous):</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> action :: m a
> action >>= return :: m a
> action >>= return >>= return >>= return :: m a
</code></pre></div></div>
<p>Wrapping <code class="language-plaintext highlighter-rouge">ModelF a</code> up in <code class="language-plaintext highlighter-rouge">Free</code>, we get the following general type for our
programs:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">Program</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">ModelF</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>And we can construct a bunch of embedded language terms in the standard way:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">beta</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Program</span> <span class="n">a</span> <span class="kt">Double</span>
<span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">id</span><span class="p">)</span>
<span class="n">bernoulli</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Program</span> <span class="n">a</span> <span class="kt">Bool</span>
<span class="n">bernoulli</span> <span class="n">p</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">id</span><span class="p">)</span>
<span class="n">normal</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Program</span> <span class="n">a</span> <span class="kt">Double</span>
<span class="n">normal</span> <span class="n">m</span> <span class="n">s</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">NormalF</span> <span class="n">m</span> <span class="n">s</span> <span class="n">id</span><span class="p">)</span>
<span class="n">dirac</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">dirac</span> <span class="n">x</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">DiracF</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Program</code> is a general type, capturing both terminating and nonterminating
programs via its type parameters. What do I mean by this? Note that in
<code class="language-plaintext highlighter-rouge">Program a b</code>, the <code class="language-plaintext highlighter-rouge">a</code> type parameter can only be concretely instantiated via
use of the terminating <code class="language-plaintext highlighter-rouge">dirac</code> term. On the other hand, the <code class="language-plaintext highlighter-rouge">b</code> type parameter
is <em>unaffected</em> by the <code class="language-plaintext highlighter-rouge">dirac</code> term; it can only be instantiated by the other
nonterminating terms: <code class="language-plaintext highlighter-rouge">beta</code>, <code class="language-plaintext highlighter-rouge">bernoulli</code>, <code class="language-plaintext highlighter-rouge">normal</code>, or compound expressions of
these.</p>
<p>We can thus distinguish between terminating and nonterminating programs at the
type level, like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">Terminating</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Program</span> <span class="n">a</span> <span class="kt">Void</span>
<span class="kr">type</span> <span class="kt">Model</span> <span class="n">b</span> <span class="o">=</span> <span class="n">forall</span> <span class="n">a</span><span class="o">.</span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">b</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">Void</code> is the uninhabited type, brought into scope via <code class="language-plaintext highlighter-rouge">Data.Void</code> or simply
defined via <code class="language-plaintext highlighter-rouge">data Void = Void Void</code>. Any program that ends via a <code class="language-plaintext highlighter-rouge">dirac</code>
instruction <em>must</em> be <code class="language-plaintext highlighter-rouge">Terminating</code>, and any program that <em>doesn’t</em> end with a
<code class="language-plaintext highlighter-rouge">dirac</code> instruction <em>can not</em> be <code class="language-plaintext highlighter-rouge">Terminating</code>. We’ll just continue to call
a nonterminating program a <code class="language-plaintext highlighter-rouge">Model</code>, as before.</p>
<p>Good. So if it’s not clear: from a user’s perspective, nothing has changed.
We still write probabilistic programs using simple monadic language terms.
Here’s a Gaussian mixture model where the mixing parameter follows a beta
distribution, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mixture</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">mixture</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">prob</span> <span class="o"><-</span> <span class="n">beta</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">accept</span> <span class="o"><-</span> <span class="n">bernoulli</span> <span class="n">prob</span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">normal</span> <span class="p">(</span><span class="n">negate</span> <span class="mi">2</span><span class="p">)</span> <span class="mf">0.5</span>
<span class="kr">else</span> <span class="n">normal</span> <span class="mi">2</span> <span class="mf">0.5</span>
</code></pre></div></div>
<p>Meanwhile the syntax tree generated looks something like the following. It’s
more or less a traditional probabilistic graphical model description of our
program:</p>
<p><img src="/images/mixture_ast.png" alt="" class="center-image" /></p>
<p>It’s important to note that in this embedded framework, the only pieces of the
syntax tree that we can observe are those related directly to our primitive
instructions. For our purposes this is excellent - we can focus on programs
entirely at the level of their probabilistic components, and ignore the
deterministic parts that would otherwise be distractions.</p>
<p>To collect samples from <code class="language-plaintext highlighter-rouge">mixture</code>, we can first interpret it into a sampling
function, and then simulate from it. The <code class="language-plaintext highlighter-rouge">toSampler</code> function <a href="/simple-probabilistic-programming">from last
time</a> doesn’t change much:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">toSampler</span> <span class="o">::</span> <span class="kt">Program</span> <span class="n">a</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Prob</span> <span class="kt">IO</span> <span class="n">a</span>
<span class="n">toSampler</span> <span class="o">=</span> <span class="n">iterM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">Prob</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">Prob</span><span class="o">.</span><span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">NormalF</span> <span class="n">m</span> <span class="n">s</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">Prob</span><span class="o">.</span><span class="n">normal</span> <span class="n">m</span> <span class="n">s</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">DiracF</span> <span class="n">x</span> <span class="o">-></span> <span class="n">return</span> <span class="n">x</span>
</code></pre></div></div>
<p>Sampling from <code class="language-plaintext highlighter-rouge">mixture 2 3</code> a thousand times yields the following</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> simulate (toSampler (mixture 2 3))
</code></pre></div></div>
<p><img src="/images/mixture_samples.png" alt="" class="center-image" /></p>
<p>Note that the rightmost component gets more traffic due to the hyperparameter
combination of 2 and 3 that we provided to <code class="language-plaintext highlighter-rouge">mixture</code>.</p>
<p>Also, a note - since we have general recursion in Haskell, so-called
‘terminating’ programs here can actually.. uh, fail to terminate. They must
only terminate as far as we can express the sentiment at the embedded language
level. Consider the following, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">foo</span> <span class="o">::</span> <span class="kt">Terminating</span> <span class="n">a</span>
<span class="n">foo</span> <span class="o">=</span> <span class="p">(</span><span class="n">loop</span> <span class="mi">1</span><span class="p">)</span> <span class="o">>>=</span> <span class="n">dirac</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">a</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">p</span> <span class="o"><-</span> <span class="n">beta</span> <span class="n">a</span> <span class="mi">1</span>
<span class="n">loop</span> <span class="n">p</span>
</code></pre></div></div>
<p><code class="language-plaintext highlighter-rouge">foo</code> here doesn’t actually terminate. But at least this kind of weird case
can be picked up in the types:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> :t simulate (toSampler foo)
simulate (toSampler foo) :: IO Void
</code></pre></div></div>
<p>If you try to sample from a distribution over <code class="language-plaintext highlighter-rouge">Void</code> or <code class="language-plaintext highlighter-rouge">forall a. a</code> then I
can’t be held responsible for what you get up to. But there are other cases,
sadly, where we’re also out of luck:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">trollGeometric</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Int</span>
<span class="n">trollGeometric</span> <span class="n">p</span> <span class="o">=</span> <span class="n">loop</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">accept</span> <span class="o"><-</span> <span class="n">return</span> <span class="kt">False</span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">return</span> <span class="mi">1</span>
<span class="kr">else</span> <span class="n">fmap</span> <span class="n">succ</span> <span class="n">loop</span>
</code></pre></div></div>
<p>A geometric distribution that actually <em>used its argument</em> \(p\), for \(0 < p
\leq 1\), could be guaranteed to terminate with probability 1. This one
doesn’t, so <code class="language-plaintext highlighter-rouge">trollGeometric undefined >>= dirac</code> won’t.</p>
<p>At the end of the day we’re stuck with what our host language offers us. So,
take the termination guarantees for our embedded language with a grain of salt.</p>
<h2 id="stateful-inference">Stateful Inference</h2>
<p>In the previous post we used a simple <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampler</a> to sample from
a conditional distribution. ‘Vanilla’ Monte Carlo algorithms like rejection
and importance sampling are <em>stateless</em>. This makes them nice in some ways -
they tend to be simple to implement and are embarrassingly parallel, for
example. But the <a href="https://en.wikipedia.org/wiki/Curse_of_dimensionality">curse of dimensionality</a> prevents them from scaling
well to larger problems. I won’t go into detail on that here - for a deep dive
on the topic, you probably won’t find anything better than this <a href="http://videolectures.net/mlss09uk_murray_mcmc/">phenomenal
couple of talks on MCMC</a> that Iain Murray gave at a MLSS session in
Cambridge in 2009. I think they’re unparalleled to this day.</p>
<p>The point is that in higher dimensions we tend to get a lot out of state.
Essentially, if one finds an interesting region of high-dimensional parameter
space, then it’s better to remember where that is, rather than forgetting it
exists as soon as one stumbles onto it. The manifold hypothesis conjectures
that interesting regions of space tend to be near <em>other</em> interesting regions
of space, so exploring neighbourhoods of interesting places tends to pay off.
Stateful Monte Carlo methods - namely, the family of <em>Markov chain</em> Monte Carlo
algorithms - handle exactly this, by using a Markov chain to wander over
parameter space. I’ve written on MCMC <a href="/markov-chains-a-la-carte">in</a> <a href="/flat-mcmc-update">the</a> <a href="/randomness-in-haskell">past</a> -
you can check out some of those articles if you’re interested.</p>
<p>In the stateless rejection sampler we just performed conditional inference via
the following algorithm:</p>
<ul>
<li>Sample from a parameter model.</li>
<li>Sample from a data model, using the sample from the parameter model as
input.</li>
<li>If the sample from the data model matches the provided observations, return
the sample from the parameter model.</li>
</ul>
<p>By repeating this many times we get a sample of arbitrary size from the
appropriate conditional, inverse, or posterior distribution (whatever you want
to call it).</p>
<p>In a stateful inference routine - here, the good old Metropolis-Hastings
algorithm - we’re instead going to do the following repeatedly:</p>
<ul>
<li>Sample from a parameter model, recording <em>the way the program executed</em> in
order to return the sample that it did.</li>
<li>Compute the <em>cost</em>, in some sense, of generating the provided observations,
using the sample from the parameter model as input.</li>
<li>Propose a new sample from the parameter model by <em>perturbing the way the
program executed</em> and recording the new sample the program outputs.</li>
<li>Compute the cost of generating the provided observations using this new
sample from the parameter model as input.</li>
<li>Compare the costs of generating the provided observations under the
respective samples from the parameter models.</li>
<li>With probability depending on the ratio of the costs, flip a coin. If you
see a head, then move to the new, proposed execution trace of the program.
Otherwise, stay at the old execution trace.</li>
</ul>
<p>This procedure generates a Markov chain over the space of possible execution
traces of the program - essentially, plausible ways that the program could have
executed in order to generate the supplied observations.</p>
<p>Implementations of <a href="https://en.wikipedia.org/wiki/Church_(programming_language)">Church</a> use variations of this method to do
inference, the most famous of which is a low-overhead transformational
compilation procedure described in <a href="http://www.jmlr.org/proceedings/papers/v15/wingate11a/wingate11a.pdf">a great and influential 2011 paper</a>
by David Wingate et al.</p>
<h2 id="representing-running-programs">Representing Running Programs</h2>
<p>To perform inference on probabilistic programs according to the aforementioned
Metropolis-Hastings algorithm, we need to represent <em>executing</em> programs
somehow, in a form that enables us to examine and modify their internal state.</p>
<p>How can we do that? We’ll pluck another useful recursive structure from our
repertoire and consider the humble <code class="language-plaintext highlighter-rouge">Cofree</code>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span> <span class="n">a</span> <span class="o">:<</span> <span class="n">f</span> <span class="p">(</span><span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p><a href="/tour-of-some-recursive-types">Recall</a> that <code class="language-plaintext highlighter-rouge">Cofree</code> allows one to <em>annotate</em> programs with arbitrary
information at each internal node. This is a great feature; if we can annotate
each internal node with important information about its state - its current
value, the current state of its generator, the ‘cost’ associated with it - then
we can walk through the program and examine it as required. So, it can capture
a ‘running’ program in exactly the way we need.</p>
<p>Let’s describe running programs as values having the following <code class="language-plaintext highlighter-rouge">Execution</code>
type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">Execution</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Cofree</span> <span class="p">(</span><span class="kt">ModelF</span> <span class="n">a</span><span class="p">)</span> <span class="kt">Node</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">Node</code> type is what we’ll use to describe the internal state of each node
on the program. I’ll define it like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Node</span> <span class="o">=</span> <span class="kt">Node</span> <span class="p">{</span>
<span class="n">nodeCost</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="p">,</span> <span class="n">nodeValue</span> <span class="o">::</span> <span class="kt">Dynamic</span>
<span class="p">,</span> <span class="n">nodeSeed</span> <span class="o">::</span> <span class="kt">MWC</span><span class="o">.</span><span class="kt">Seed</span>
<span class="p">,</span> <span class="n">nodeHistory</span> <span class="o">::</span> <span class="p">[</span><span class="kt">Dynamic</span><span class="p">]</span>
<span class="p">}</span> <span class="kr">deriving</span> <span class="kt">Show</span>
</code></pre></div></div>
<p>I’ll elaborate on this type below, but you can see that it captures a bunch of
information about the state of each node.</p>
<p>One can mechanically transform any <code class="language-plaintext highlighter-rouge">Free</code>-encoded program into a
<code class="language-plaintext highlighter-rouge">Cofree</code>-encoded program, so long as the original <code class="language-plaintext highlighter-rouge">Free</code>-encoded program can
terminate of its own accord, i.e. on the level of its own instructions. Hence
the need for our <code class="language-plaintext highlighter-rouge">Terminating</code> type and all that.</p>
<p>In our case, setting everything up just right takes a bit of code, mainly
around handling <a href="/randomness-in-haskell">pseudo-random number generators</a> in a pure fashion. So
I won’t talk about every little detail of it right here. The general idea is
to write a function that takes instructions to the appropriate state captured
by a <code class="language-plaintext highlighter-rouge">Node</code> value, like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">initialize</span> <span class="o">::</span> <span class="kt">Typeable</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">MWC</span><span class="o">.</span><span class="kt">Seed</span> <span class="o">-></span> <span class="kt">ModelF</span> <span class="n">a</span> <span class="n">b</span> <span class="o">-></span> <span class="kt">Node</span>
<span class="n">initialize</span> <span class="n">seed</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">runST</span> <span class="o">$</span> <span class="kr">do</span>
<span class="p">(</span><span class="n">nodeValue</span><span class="p">,</span> <span class="n">nodeSeed</span><span class="p">)</span> <span class="o"><-</span> <span class="n">samplePurely</span> <span class="p">(</span><span class="kt">Prob</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span><span class="p">)</span> <span class="n">seed</span>
<span class="kr">let</span> <span class="n">nodeCost</span> <span class="o">=</span> <span class="n">logDensityBernoulli</span> <span class="n">p</span> <span class="p">(</span><span class="n">unsafeFromDyn</span> <span class="n">nodeValue</span><span class="p">)</span>
<span class="n">nodeHistory</span> <span class="o">=</span> <span class="n">mempty</span>
<span class="n">return</span> <span class="kt">Node</span> <span class="p">{</span><span class="o">..</span><span class="p">}</span>
<span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">runST</span> <span class="o">$</span> <span class="kr">do</span>
<span class="p">(</span><span class="n">nodeValue</span><span class="p">,</span> <span class="n">nodeSeed</span><span class="p">)</span> <span class="o"><-</span> <span class="n">samplePurely</span> <span class="p">(</span><span class="kt">Prob</span><span class="o">.</span><span class="n">beta</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span> <span class="n">seed</span>
<span class="kr">let</span> <span class="n">nodeCost</span> <span class="o">=</span> <span class="n">logDensityBeta</span> <span class="n">a</span> <span class="n">b</span> <span class="p">(</span><span class="n">unsafeFromDyn</span> <span class="n">nodeValue</span><span class="p">)</span>
<span class="n">nodeHistory</span> <span class="o">=</span> <span class="n">mempty</span>
<span class="n">return</span> <span class="kt">Node</span> <span class="p">{</span><span class="o">..</span><span class="p">}</span>
<span class="o">...</span>
</code></pre></div></div>
<p>You can see that for each node, I sample from it, calculate its cost, and then
initialize its ‘history’ as an empty list.</p>
<p>Here it’s worth going into a brief aside.</p>
<p>There are two mildly annoying things we have to deal with in this situation.
First, individual nodes in the program typically sample values at <em>different
types</em>, and second, we can’t easily use effects when annotating. This means
that we have to pack heterogeneously-typed things into a homogeneously-typed
container, and also use pure random number generation facilities to sample
them.</p>
<p>A quick-and-dirty answer for the first case is to just use dynamic typing when
storing the values. It works and is easy, but of course is subject to the
standard caveats. I use a function called <code class="language-plaintext highlighter-rouge">unsafeFromDyn</code> to convert
dynamically-typed values back to a typed form, so you can gauge the safety of
all this for yourself.</p>
<p>For the second case, I just use the <code class="language-plaintext highlighter-rouge">ST</code> monad, along with manual state
snapshotting, to execute and iterate a random number generator. Pretty
simple.</p>
<p>Also: in terms of efficiency, keeping a node’s history on-site at each
execution falls into the ‘completely insane’ category, but let’s not worry much
about efficiency right now. Prototypes gonna prototype and all that.</p>
<p>Anyway.</p>
<p>Given this <code class="language-plaintext highlighter-rouge">initialize</code> function, we can transform a terminating program into a
running program by simple recursion. Again, we can only transform programs
with type <code class="language-plaintext highlighter-rouge">Terminating a</code> because we need to rule out the case of ever visiting
the <code class="language-plaintext highlighter-rouge">Pure</code> constructor of <code class="language-plaintext highlighter-rouge">Free</code>. We handle that by the <code class="language-plaintext highlighter-rouge">absurd</code> function
provided by <code class="language-plaintext highlighter-rouge">Data.Void</code>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">execute</span> <span class="o">::</span> <span class="kt">Typeable</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">Terminating</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Execution</span> <span class="n">a</span>
<span class="n">execute</span> <span class="o">=</span> <span class="n">annotate</span> <span class="n">defaultSeed</span> <span class="kr">where</span>
<span class="n">defaultSeed</span> <span class="o">=</span> <span class="p">(</span><span class="mi">42</span><span class="p">,</span> <span class="mi">108512</span><span class="p">)</span>
<span class="n">annotate</span> <span class="n">seeds</span> <span class="n">term</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">term</span> <span class="kr">of</span>
<span class="kt">Pure</span> <span class="n">r</span> <span class="o">-></span> <span class="n">absurd</span> <span class="n">r</span>
<span class="kt">Free</span> <span class="n">instruction</span> <span class="o">-></span>
<span class="kr">let</span> <span class="p">(</span><span class="n">nextSeeds</span><span class="p">,</span> <span class="n">generator</span><span class="p">)</span> <span class="o">=</span> <span class="n">xorshift</span> <span class="n">seeds</span>
<span class="n">seed</span> <span class="o">=</span> <span class="kt">MWC</span><span class="o">.</span><span class="n">toSeed</span> <span class="p">(</span><span class="kt">V</span><span class="o">.</span><span class="n">singleton</span> <span class="n">generator</span><span class="p">)</span>
<span class="n">node</span> <span class="o">=</span> <span class="n">initialize</span> <span class="n">seed</span> <span class="n">instruction</span>
<span class="kr">in</span> <span class="n">node</span> <span class="o">:<</span> <span class="n">fmap</span> <span class="p">(</span><span class="n">annotate</span> <span class="n">nextSeeds</span><span class="p">)</span> <span class="n">instruction</span>
</code></pre></div></div>
<p>And there you have it - <code class="language-plaintext highlighter-rouge">execute</code> takes a terminating program as input and
returns a running program - an execution trace - as output. The syntax tree we
had previously gets turned into something like this:</p>
<p><img src="/images/mixture_ast_ann.png" alt="" class="center-image" /></p>
<h2 id="perturbing-running-programs">Perturbing Running Programs</h2>
<p>Given an execution trace, we’re able to step through it sequentially and
investigate the program’s internal state. But to do inference we also need to
<em>modify</em> it as well. What’s the answer here?</p>
<p>Just as <code class="language-plaintext highlighter-rouge">Free</code> has a monadic structure that allows us to write embedded
programs using built-in monadic combinators and do-notation, <code class="language-plaintext highlighter-rouge">Cofree</code> has a
<em>comonadic</em> structure that is amenable to use with the various comonadic
combinators found in <code class="language-plaintext highlighter-rouge">Control.Comonad</code>. The most important for our purposes is
the comonadic ‘extend’ operation that’s dual to monad’s ‘bind’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">extend</span> <span class="o">::</span> <span class="kt">Comonad</span> <span class="n">w</span> <span class="o">=></span> <span class="p">(</span><span class="n">w</span> <span class="n">a</span> <span class="o">-></span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="n">w</span> <span class="n">a</span> <span class="o">-></span> <span class="n">w</span> <span class="n">b</span>
<span class="n">extend</span> <span class="n">f</span> <span class="o">=</span> <span class="n">fmap</span> <span class="n">f</span> <span class="o">.</span> <span class="n">duplicate</span>
</code></pre></div></div>
<p>To perturb a running program, we can thus write a function that perturbs any
given annotated node, and then <code class="language-plaintext highlighter-rouge">extend</code> it over the entire execution trace.</p>
<p>The <code class="language-plaintext highlighter-rouge">perturbNode</code> function can be similar to the <code class="language-plaintext highlighter-rouge">initialize</code> function from
earlier; it describes how to perturb every node based on the instruction found
there:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">perturbNode</span> <span class="o">::</span> <span class="kt">Execution</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Node</span>
<span class="n">perturbNode</span> <span class="p">(</span><span class="n">node</span><span class="o">@</span><span class="kt">Node</span> <span class="p">{</span><span class="o">..</span><span class="p">}</span> <span class="o">:<</span> <span class="n">cons</span><span class="p">)</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">cons</span> <span class="kr">of</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">runST</span> <span class="o">$</span> <span class="kr">do</span>
<span class="p">(</span><span class="n">nvalue</span><span class="p">,</span> <span class="n">nseed</span><span class="p">)</span> <span class="o"><-</span> <span class="n">samplePurely</span> <span class="p">(</span><span class="kt">Prob</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span><span class="p">)</span> <span class="n">nodeSeed</span>
<span class="kr">let</span> <span class="n">nscore</span> <span class="o">=</span> <span class="n">logDensityBernoulli</span> <span class="n">p</span> <span class="p">(</span><span class="n">unsafeFromDyn</span> <span class="n">nvalue</span><span class="p">)</span>
<span class="n">return</span> <span class="o">$!</span> <span class="kt">Node</span> <span class="n">nscore</span> <span class="n">nvalue</span> <span class="n">nseed</span> <span class="n">nodeHistory</span>
<span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">runST</span> <span class="o">$</span> <span class="kr">do</span>
<span class="p">(</span><span class="n">nvalue</span><span class="p">,</span> <span class="n">nseed</span><span class="p">)</span> <span class="o"><-</span> <span class="n">samplePurely</span> <span class="p">(</span><span class="kt">Prob</span><span class="o">.</span><span class="n">beta</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span> <span class="n">nodeSeed</span>
<span class="kr">let</span> <span class="n">nscore</span> <span class="o">=</span> <span class="n">logDensityBeta</span> <span class="n">a</span> <span class="n">b</span> <span class="p">(</span><span class="n">unsafeFromDyn</span> <span class="n">nvalue</span><span class="p">)</span>
<span class="n">return</span> <span class="o">$!</span> <span class="kt">Node</span> <span class="n">nscore</span> <span class="n">nvalue</span> <span class="n">nseed</span> <span class="n">nodeHistory</span>
<span class="o">...</span>
</code></pre></div></div>
<p>Note that this is a very crude way to perturb nodes - we’re just sampling from
whatever distribution we find at each one. A more refined procedure would
sample from each node on a more <em>local</em> basis, sampling from its respective
domain in a neighbourhood of its current location. For example, to perturb a
<code class="language-plaintext highlighter-rouge">BetaF</code> node we might sample from a tiny Gaussian bubble around its current
location, repeating the process if we happen to ‘fall off’ the support. I’ll
leave matters like that for another post.</p>
<p>Perturbing an entire trace is then as easy as I claimed it to be:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">perturb</span> <span class="o">::</span> <span class="kt">Execution</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Execution</span> <span class="n">a</span>
<span class="n">perturb</span> <span class="o">=</span> <span class="n">extend</span> <span class="n">perturbNode</span>
</code></pre></div></div>
<p>For some comonadic intuition: when we ‘extend’ a function over an execution,
the trace itself gets ‘duplicated’ in a comonadic context. Each node in the
program becomes annotated with a view of <em>the rest of the execution trace</em> from
that point forward. It can be difficult to visualize at first, but I reckon
the following image is pretty faithful:</p>
<p><img src="/images/mixture_ast_duplicate.png" alt="" class="center-image" /></p>
<p>Each annotation then has <code class="language-plaintext highlighter-rouge">perturbNode</code> applied to it, which reduces the trace
back to the standard annotated version we saw before.</p>
<h2 id="iterating-the-markov-chain">Iterating the Markov Chain</h2>
<p>So: to move around in parameter space, we’ll propose state changes by
perturbing the current state, and then accept or reject proposals according to
local economic conditions.</p>
<p>If you already have no idea what I’m talking about, then the phrase ‘local
economic conditions’ probably didn’t help you much. But it’s a useful analogy
to have in one’s head. Each state in parameter space has a cost associated
with it - the cost of generating the observations that we’re conditioning on
while doing inference. If certain parameter values yield a data model that is
unlikely to generate the provided observations, then those observations will be
<em>expensive</em> to generate when measured in terms of log-likelihood. Parameter
values that yield data models more likely to generate the supplied observations
will be comparatively cheaper.</p>
<p>If a proposed execution trace is significantly cheaper than the trace we’re
currently at, then we usually want to move to it. We allow some randomness in
our decision to keep everything nice and <a href="/on-measurability">measure</a>-preserving.</p>
<p>We can thus construct the conditional distribution over execution traces using
the following <code class="language-plaintext highlighter-rouge">invert</code> function, using the same nomenclature as the rejection
sampler we used previously. To focus on the main points, I’ll elide some of
its body:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">invert</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Eq</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Typeable</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Typeable</span> <span class="n">b</span><span class="p">)</span>
<span class="o">=></span> <span class="kt">Int</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Model</span> <span class="n">b</span> <span class="o">-></span> <span class="p">(</span><span class="n">b</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span>
<span class="o">-></span> <span class="kt">Model</span> <span class="p">(</span><span class="kt">Execution</span> <span class="n">b</span><span class="p">)</span>
<span class="n">invert</span> <span class="n">epochs</span> <span class="n">obs</span> <span class="n">prior</span> <span class="n">ll</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">epochs</span> <span class="p">(</span><span class="n">execute</span> <span class="n">terminated</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">terminated</span> <span class="o">=</span> <span class="n">prior</span> <span class="o">>>=</span> <span class="n">dirac</span>
<span class="n">loop</span> <span class="n">n</span> <span class="n">current</span>
<span class="o">|</span> <span class="n">n</span> <span class="o">==</span> <span class="mi">0</span> <span class="o">=</span> <span class="n">return</span> <span class="n">current</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="kr">do</span>
<span class="kr">let</span> <span class="n">proposal</span> <span class="o">=</span> <span class="n">perturb</span> <span class="n">current</span>
<span class="c1">-- calculate costs and movement probability here</span>
<span class="n">accept</span> <span class="o"><-</span> <span class="n">bernoulli</span> <span class="n">prob</span>
<span class="kr">let</span> <span class="n">next</span> <span class="o">=</span> <span class="kr">if</span> <span class="n">accept</span> <span class="kr">then</span> <span class="n">proposal</span> <span class="kr">else</span> <span class="n">stepGenerators</span> <span class="n">current</span>
<span class="n">loop</span> <span class="p">(</span><span class="n">pred</span> <span class="n">n</span><span class="p">)</span> <span class="p">(</span><span class="n">snapshot</span> <span class="n">next</span><span class="p">)</span>
</code></pre></div></div>
<p>There are a few things to comment on here.</p>
<p>First, notice how the return type of <code class="language-plaintext highlighter-rouge">invert</code> is <code class="language-plaintext highlighter-rouge">Model (Execution b)</code>? Using
the semantics of our embedded language, it’s literally a standard model over
execution traces. The above function returns a first-class value that is
completely uninterpreted and abstract. Cool.</p>
<p>We’re also dealing with things a little differently from the rejection sampler
that we built previously. Here, the data model is expressed by a <em>cost
function</em>; that is, a function that takes a parameter value and observation as
input, and returns the cost of generating the observation (conditional on the
supplied parameter value) as output. This is the approach used in the
excellent <a href="https://www.repository.cam.ac.uk/bitstream/handle/1810/249132/Scibior%20et%20al%202015%20Haskell%20Symposium%202015.pdf?sequence=1&isAllowed=y">Practical Probabilistic Programming with Monads</a> paper by Adam
Scibior et al and also mentioned by Dan Roy in <a href="https://www.youtube.com/watch?v=TFXcVlKqPlM">his recent talk</a> at the
Simons Institute. Ideally we’d just reify the cost function here from the
description of a model directly (to keep the interface similar to the one used
in the rejection sampler implementation), but I haven’t yet found a way to do
this in a type-safe fashion.</p>
<p>Regardless of whether or not we accept a proposed move, we need to snapshot the
current value of each node and add it to that node’s history. This can be done
using another comonadic extend:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">snapshotValue</span> <span class="o">::</span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="kt">Node</span> <span class="o">-></span> <span class="kt">Node</span>
<span class="n">snapshotValue</span> <span class="p">(</span><span class="kt">Node</span> <span class="p">{</span><span class="o">..</span><span class="p">}</span> <span class="o">:<</span> <span class="n">cons</span><span class="p">)</span> <span class="o">=</span> <span class="kt">Node</span> <span class="p">{</span> <span class="n">nodeHistory</span> <span class="o">=</span> <span class="n">history</span><span class="p">,</span> <span class="o">..</span> <span class="p">}</span> <span class="kr">where</span>
<span class="n">history</span> <span class="o">=</span> <span class="n">nodeValue</span> <span class="o">:</span> <span class="n">nodeHistory</span>
<span class="n">snapshot</span> <span class="o">::</span> <span class="kt">Functor</span> <span class="n">f</span> <span class="o">=></span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="kt">Node</span> <span class="o">-></span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="kt">Node</span>
<span class="n">snapshot</span> <span class="o">=</span> <span class="n">extend</span> <span class="n">snapshotValue</span>
</code></pre></div></div>
<p>The other point of note is minor, but an extremely easy detail to overlook.
Since we’re handling random value generation at each node purely, using on-site
PRNGs, we need to iterate the generators forward a step in the event that we
don’t accept a proposal. Otherwise we’d propose a new execution based on the
same generator states that we’d used previously! For now I’ll just iterate the
generators by forcing a sample of a uniform variate at each node, and then
throwing away the result. To do this we can use the now-standard comonadic
pattern:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">stepGenerator</span> <span class="o">::</span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="kt">Node</span> <span class="o">-></span> <span class="kt">Node</span>
<span class="n">stepGenerator</span> <span class="p">(</span><span class="kt">Node</span> <span class="p">{</span><span class="o">..</span><span class="p">}</span> <span class="o">:<</span> <span class="n">cons</span><span class="p">)</span> <span class="o">=</span> <span class="n">runST</span> <span class="o">$</span> <span class="kr">do</span>
<span class="p">(</span><span class="n">nval</span><span class="p">,</span> <span class="n">nseed</span><span class="p">)</span> <span class="o"><-</span> <span class="n">samplePurely</span> <span class="p">(</span><span class="kt">Prob</span><span class="o">.</span><span class="n">beta</span> <span class="mi">1</span> <span class="mi">1</span><span class="p">)</span> <span class="n">nodeSeed</span>
<span class="n">return</span> <span class="kt">Node</span> <span class="p">{</span><span class="n">nodeSeed</span> <span class="o">=</span> <span class="n">nseed</span><span class="p">,</span> <span class="o">..</span><span class="p">}</span>
<span class="n">stepGenerators</span> <span class="o">::</span> <span class="kt">Functor</span> <span class="n">f</span> <span class="o">=></span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="kt">Node</span> <span class="o">-></span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="kt">Node</span>
<span class="n">stepGenerators</span> <span class="o">=</span> <span class="n">extend</span> <span class="n">stepGenerator</span>
</code></pre></div></div>
<h2 id="inspecting-execution-traces">Inspecting Execution Traces</h2>
<p>Alright so let’s see how this all works. Let’s write a model, condition it
on some observations, and do inference.</p>
<p>We’ll choose our simple Gaussian mixture model from earlier, where the mixing
probability follows a beta distribution, and cluster assignment itself follows
a Bernoulli distribution. We thus choose the ‘leftmost’ component of the
mixture with the appropriate mixture probability.</p>
<p>We can break the mixture model up as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">prior</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">prior</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">p</span> <span class="o"><-</span> <span class="n">beta</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">bernoulli</span> <span class="n">p</span>
<span class="n">likelihood</span> <span class="o">::</span> <span class="kt">Bool</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">likelihood</span> <span class="n">left</span>
<span class="o">|</span> <span class="n">left</span> <span class="o">=</span> <span class="n">normal</span> <span class="p">(</span><span class="n">negate</span> <span class="mi">2</span><span class="p">)</span> <span class="mf">0.5</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">normal</span> <span class="mi">2</span> <span class="mf">0.5</span>
</code></pre></div></div>
<p>Let’s take a look at some samples from the marginal distribution. This time
I’ll flip things and assign hyperparameters of 3 and 2 for the prior:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> simulate (toSampler (prior 3 2 >>= likelihood))
</code></pre></div></div>
<p><img src="/images/mixture_trace.png" alt="" class="center-image" /></p>
<p>It looks like we’re slightly more likely to sample from the left mixture
component than the right one. Again, this makes sense - the mean of a beta(3,
2) distribution is 0.6.</p>
<p>Now, what about inference? I’ll define the conditional model as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">posterior</span> <span class="o">::</span> <span class="kt">Model</span> <span class="p">(</span><span class="kt">Execution</span> <span class="kt">Bool</span><span class="p">)</span>
<span class="n">posterior</span> <span class="o">=</span> <span class="n">invert</span> <span class="mi">1000</span> <span class="n">obs</span> <span class="n">prior</span> <span class="n">ll</span> <span class="kr">where</span>
<span class="n">obs</span> <span class="o">=</span> <span class="p">[</span> <span class="o">-</span><span class="mf">1.7</span><span class="p">,</span> <span class="o">-</span><span class="mf">1.8</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.01</span><span class="p">,</span> <span class="o">-</span><span class="mf">2.4</span>
<span class="p">,</span> <span class="mf">1.9</span><span class="p">,</span> <span class="mf">1.8</span>
<span class="p">]</span>
<span class="n">ll</span> <span class="n">left</span>
<span class="o">|</span> <span class="n">left</span> <span class="o">=</span> <span class="n">logDensityNormal</span> <span class="p">(</span><span class="n">negate</span> <span class="mi">2</span><span class="p">)</span> <span class="mf">0.5</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">logDensityNormal</span> <span class="mi">2</span> <span class="mf">0.5</span>
</code></pre></div></div>
<p>Here we have four observations that presumably arise from the leftmost
component, and only two that match up with the rightmost. Note also that I’ve
replaced the <code class="language-plaintext highlighter-rouge">likelihood</code> model with its appropriate cost function due to
reasons I mentioned in the last section. (It would be easy to reify <em>this</em>
model as its cost function, but doing it for general models is trickier)</p>
<p>Anyway, let’s sample from the conditional distribution:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> simulate (toSampler posterior)
</code></pre></div></div>
<p>Sampling returns a running program, of course, and we can step through it to
examine its structure. We can use the supplied values recorded at each node
to ‘automatically’ step through execution, or we can supply our own values to
investigate arbitrary branches.</p>
<p>The conditional distribution we’ve found over the mixing probability is as
follows:</p>
<p><img src="/images/post_p.png" alt="" class="center-image" /></p>
<p>Looks like we’re in the right ballpark.</p>
<p>We can examine the traces of other elements of the program as well. Here’s the
recorded distribution over component assignments, for example - note that the
rightmost bar <em>here</em> corresponds to the leftmost component in the mixture:</p>
<p><img src="/images/post_b.png" alt="" class="center-image" /></p>
<p>You can see that whenever we wandered into the rightmost component, we’d
swiftly wind up jumping back out of it:</p>
<p><img src="/images/post_b_ts.png" alt="" class="center-image" /></p>
<h2 id="comments">Comments</h2>
<p>This is a fun take on probabilistic programming. In particular I find a few
aspects of the whole setup to be pretty attractive:</p>
<p>We use a primitive, limited instruction set to parameterize both programs - via
<code class="language-plaintext highlighter-rouge">Free</code> - and running programs - via <code class="language-plaintext highlighter-rouge">Cofree</code>. These off-the-shelf recursive
types are used to wrap things up and provide most of our required control flow
automatically. It’s easy to transparently add structure to embedded programs
built in this way; for example, we can statically <a href="/encoding-independence-statically">encode independence</a>
by replacing our <code class="language-plaintext highlighter-rouge">ModelF a</code> type with something like:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">InstructionF</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Coproduct</span> <span class="p">(</span><span class="kt">ModelF</span> <span class="n">a</span><span class="p">)</span> <span class="p">(</span><span class="kt">Ap</span> <span class="p">(</span><span class="kt">ModelF</span> <span class="n">a</span><span class="p">))</span>
</code></pre></div></div>
<p>This can be hidden from the user so that we’re left with the same simple
monadic syntax we presently enjoy, but we also get to take independence into
account when performing inference, or any other structural interpretation for
that matter.</p>
<p>When it comes to inference, the program representation is completely separate
from whatever inference backend we choose to augment it with. We can deal with
traces as <em>first-class</em> values that can be directly stored, inspected,
manipulated, and so on. And everything is done in a typed and
purely-functional framework. I’ve used dynamic typing functionality from
<code class="language-plaintext highlighter-rouge">Data.Dynamic</code> to store values in execution traces here, but we could similarly
just define a concrete <code class="language-plaintext highlighter-rouge">Value</code> type with the appropriate constructors for
integers, doubles, bools, etc., and use <em>that</em> to store everything.</p>
<p>At the same time, this is a pretty early concept - doing inference
<em>efficiently</em> in this setting is another matter, and there are a couple of
computational and statistical issues here that need to be ironed out to make
further progress.</p>
<p>The current way I’ve organized Markov chain generation and iteration is just
woefully inefficient. Storing the history of each node on-site is needlessly
costly and I’m sure results in a ton of unnecessary allocation. On a semantic
level, it also ‘complects’ state and identity: why, after all, should a single
execution trace know anything about traces that preceded it? Clearly this
should be accumulated in another data structure. There is a lot of other
low-hanging fruit around strictness and PRNG management as well.</p>
<p>From a more statistical angle, the present implementation does a poor job when
it comes to perturbing execution traces. Some changes - such as improving the
proposal mechanism for a given instruction - are easy to implement, and
representing distributions as instructions indeed makes it easy to tailor local
proposal distributions in a context-independent way. But another problem is
that, by using a ‘blunt’ comonadic <code class="language-plaintext highlighter-rouge">extend</code>, we perturb an execution by
perturbing <em>every node</em> in it. In general it’s better to make small
perturbations rather than large ones to ensure a reasonable acceptance ratio,
but to do that we’d need to perturb single nodes (or at least subsets of nodes)
at a time.</p>
<p>There <em>may</em> be some inroads here via comonad transformers like <code class="language-plaintext highlighter-rouge">StoreT</code> or
lenses that would allow us to zoom in on a particular node and perturb it,
rather than perturbing everything at once. But my comonad-fu is not yet quite
at the required level to evaluate this, so I’ll come back to that idea some
other time.</p>
<p>I’m interested in playing with this concept some more in the future, though I’m
not yet sure how much I expect it to be a tenable way to do inference at scale.
If you’re interested in playing with it, I’ve dumped the code from this post
into <a href="https://gist.github.com/jtobin/497e688359c17d1fdf9215868a300b55">this gist</a>.</p>
<p>Thanks to Niffe Hermansson and Fredrik Olsen for reviewing a draft of this
post and providing helpful comments.</p>
A Simple Embedded Probabilistic Programming Language2016-10-17T00:00:00+04:00https://jtobin.io/simple-probabilistic-programming<p>What does a dead-simple probabilistic programming language look like? The
simplest thing I can imagine involves three components:</p>
<ul>
<li>A representation for probabilistic models.</li>
<li>A way to simulate from those models (‘forward’ sampling).</li>
<li>A way to sample from a conditional model (‘backward’ sampling).</li>
</ul>
<p>Rob Zinkov <a href="http://www.zinkov.com/posts/2015-08-25-building-a-probabilisitic-interpreter/">wrote an article</a> on this type of thing around a year ago,
and Dan Roy recently <a href="https://www.youtube.com/watch?v=TFXcVlKqPlM">gave a talk</a> on the topic as well. In the spirit
of unabashed unoriginality, I’ll give a sort of composite example of the two.
Most of the material here comes directly from Dan’s talk; definitely check it
out if you’re curious about this whole probabilistic programming mumbojumbo.</p>
<p>Let’s whip together a highly-structured, typed, embedded probabilistic
programming language - the core of which will encompass a tiny amount of code.</p>
<p>Some preliminaries - note that you’ll need my simple little
<a href="https://hackage.haskell.org/package/mwc-probability">mwc-probability</a> library handy for when it comes time to do sampling:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="kr">import</span> <span class="nn">Control.Monad</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Free</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">System.Random.MWC.Probability</span> <span class="k">as</span> <span class="n">MWC</span>
</code></pre></div></div>
<h2 id="representing-probabilistic-models">Representing Probabilistic Models</h2>
<p>Step one is to represent the fundamental constructs found in probabilistic
programs. These are abstract probability distributions; I like to call them
<em>models</em>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">ModelF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">BernoulliF</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Bool</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="o">|</span> <span class="kt">BetaF</span> <span class="kt">Double</span> <span class="kt">Double</span> <span class="p">(</span><span class="kt">Double</span> <span class="o">-></span> <span class="n">r</span><span class="p">)</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
<span class="kr">type</span> <span class="kt">Model</span> <span class="o">=</span> <span class="kt">Free</span> <span class="kt">ModelF</span>
</code></pre></div></div>
<p>Each foundational probability distribution we want to consider is represented
as a constructor of the <code class="language-plaintext highlighter-rouge">ModelF</code> type. You can think of them as probabilistic
<a href="/tour-of-some-recursive-types">instructions</a>, in a sense. A <code class="language-plaintext highlighter-rouge">Model</code> itself is a program parameterized
by this probabilistic instruction set.</p>
<p>In a more sophisticated implementation you’d probably want to add more
primitives, but you can get pretty far with the beta and Bernoulli
distributions alone. Here are some embedded language terms, only two of which
correspond one-to-one with to the constructors themselves:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bernoulli</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">bernoulli</span> <span class="n">p</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">id</span><span class="p">)</span>
<span class="n">beta</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">id</span><span class="p">)</span>
<span class="n">uniform</span> <span class="o">::</span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">uniform</span> <span class="o">=</span> <span class="n">beta</span> <span class="mi">1</span> <span class="mi">1</span>
<span class="n">binomial</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Int</span>
<span class="n">binomial</span> <span class="n">n</span> <span class="n">p</span> <span class="o">=</span> <span class="n">fmap</span> <span class="n">count</span> <span class="n">coins</span> <span class="kr">where</span>
<span class="n">count</span> <span class="o">=</span> <span class="n">length</span> <span class="o">.</span> <span class="n">filter</span> <span class="n">id</span>
<span class="n">coins</span> <span class="o">=</span> <span class="n">replicateM</span> <span class="n">n</span> <span class="p">(</span><span class="n">bernoulli</span> <span class="n">p</span><span class="p">)</span>
<span class="n">betaBinomial</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Int</span>
<span class="n">betaBinomial</span> <span class="n">n</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">p</span> <span class="o"><-</span> <span class="n">beta</span> <span class="n">a</span> <span class="n">b</span>
<span class="n">binomial</span> <span class="n">n</span> <span class="n">p</span>
</code></pre></div></div>
<p>You can build a lot of other useful distributions by just starting from the
beta and Bernoulli as well. And technically I guess the more foundational
distributions to use here would be the <a href="https://en.wikipedia.org/wiki/Dirichlet_distribution">Dirichlet</a> and
<a href="https://en.wikipedia.org/wiki/Categorical_distribution">categorical</a>, of which the beta and Bernoulli are special cases. But I
digress. The point is that other distributions are easy to construct from a
set of reliable primitives; you can check out the old <a href="https://www.cs.cmu.edu/~fp/papers/toplas08.pdf">lambda-naught</a>
paper by Park et al for more examples.</p>
<p>See how <code class="language-plaintext highlighter-rouge">binomial</code> and <code class="language-plaintext highlighter-rouge">betaBinomial</code> are defined? In the case of <code class="language-plaintext highlighter-rouge">binomial</code>
we’re using the property that models have a <em>functorial</em> structure by just
mapping a counting function over the result of a bunch of Bernoulli
random variables. For <code class="language-plaintext highlighter-rouge">betaBinomial</code> we’re directly making use of our monadic
structure, first describing a weight parameter via a beta distribution and then
using it as an input to a binomial distribution.</p>
<p>Note in particular that we’ve expressed <code class="language-plaintext highlighter-rouge">betaBinomial</code> by binding a <em>parameter
model</em> to a <em>data model</em>. This is a foundational pattern in Bayesian
statistics; in the more usual lingo, the parameter model corresponds to the
<em>prior distribution</em>, and the data model is the <em>likelihood</em>.</p>
<h2 id="forward-mode-sampling">Forward-Mode Sampling</h2>
<p>So we have our representation. Next up, we want to <em>simulate</em> from these
models. Thus far they’re purely abstract, and don’t encode any information
about probability or sampling or what have you. We have to ascribe that
ourselves.</p>
<p><em>mwc-probability</em> defines a monadic sampling-based probability distribution
type called <code class="language-plaintext highlighter-rouge">Prob</code>, and we can use a basic <a href="/practical-recursion-schemes">recursion scheme</a> on free
monads to adapt our own model type to that:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">toSampler</span> <span class="o">::</span> <span class="kt">Model</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">MWC</span><span class="o">.</span><span class="kt">Prob</span> <span class="kt">IO</span> <span class="n">a</span>
<span class="n">toSampler</span> <span class="o">=</span> <span class="n">iterM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">MWC</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span> <span class="o">>>=</span> <span class="n">f</span>
<span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">f</span> <span class="o">-></span> <span class="kt">MWC</span><span class="o">.</span><span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">>>=</span> <span class="n">f</span>
</code></pre></div></div>
<p>We can glue that around the relevant <em>mwc-probability</em> functionality to
simulate from models directly:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">simulate</span> <span class="o">::</span> <span class="kt">Model</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">IO</span> <span class="n">a</span>
<span class="n">simulate</span> <span class="n">model</span> <span class="o">=</span> <span class="kt">MWC</span><span class="o">.</span><span class="n">withSystemRandom</span> <span class="o">.</span> <span class="kt">MWC</span><span class="o">.</span><span class="n">asGenIO</span> <span class="o">$</span>
<span class="kt">MWC</span><span class="o">.</span><span class="n">sample</span> <span class="p">(</span><span class="n">toSampler</span> <span class="n">model</span><span class="p">)</span>
</code></pre></div></div>
<p>And this can be used with standard monadic combinators like <code class="language-plaintext highlighter-rouge">replicateM</code> to
collect larger samples:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span> <span class="n">replicateM</span> <span class="mi">10</span> <span class="o">$</span> <span class="n">simulate</span> <span class="p">(</span><span class="n">betaBinomial</span> <span class="mi">10</span> <span class="mi">1</span> <span class="mi">4</span><span class="p">)</span>
<span class="p">[</span><span class="mi">5</span><span class="p">,</span><span class="mi">7</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">1</span><span class="p">,</span><span class="mi">0</span><span class="p">,</span><span class="mi">4</span><span class="p">,</span><span class="mi">2</span><span class="p">]</span>
</code></pre></div></div>
<h2 id="reverse-mode-sampling">Reverse-Mode Sampling</h2>
<p>Now. Here we want to condition our model on some observations and then recover
the conditional distribution over its internal parameters.</p>
<p>This part - inference - is what makes probabilistic programming hard, and doing
it really well remains an unsolved problem. One of the neat theoretical
results in this space due to <a href="https://arxiv.org/abs/1005.3014">Ackerman, Freer, and Roy</a> is that in the
general case the problem is actually <em>unsolvable</em>, in that one can encode as a
probabilistic program a conditional distribution that computes the halting
problem. Similarly, in general it’s impossible to do this sort of thing
<em>efficiently</em> even for computable conditional distributions. Consider the case
of a program that returns the hash of a random n-long binary string, and then
try to infer the distribution over strings given some hashes, for example.
This is never going to be a tractable problem.</p>
<p>For now let’s use a simple <a href="https://en.wikipedia.org/wiki/Rejection_sampling">rejection sampler</a> to encode a conditional
distribution. We’ll require some observations, a proposal distribution, and
the model that we want to invert:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">invert</span> <span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Eq</span> <span class="n">b</span><span class="p">)</span> <span class="o">=></span> <span class="n">m</span> <span class="n">a</span> <span class="o">-></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="n">b</span><span class="p">]</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span>
<span class="n">invert</span> <span class="n">proposal</span> <span class="n">model</span> <span class="n">observed</span> <span class="o">=</span> <span class="n">loop</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">parameters</span> <span class="o"><-</span> <span class="n">proposal</span>
<span class="n">generated</span> <span class="o"><-</span> <span class="n">replicateM</span> <span class="p">(</span><span class="n">length</span> <span class="n">observed</span><span class="p">)</span> <span class="p">(</span><span class="n">model</span> <span class="n">parameters</span><span class="p">)</span>
<span class="kr">if</span> <span class="n">generated</span> <span class="o">==</span> <span class="n">observed</span>
<span class="kr">then</span> <span class="n">return</span> <span class="n">parameters</span>
<span class="kr">else</span> <span class="n">loop</span>
</code></pre></div></div>
<p>Let’s use it to compute the posterior or inverse model of an (apparently)
biased coin, given a few observations. We’ll just use a uniform distribution
as our proposal:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">posterior</span> <span class="o">::</span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">posterior</span> <span class="o">=</span> <span class="n">invert</span> <span class="p">[</span><span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">False</span><span class="p">,</span> <span class="kt">True</span><span class="p">]</span> <span class="n">uniform</span> <span class="n">bernoulli</span>
</code></pre></div></div>
<p>Let’s grab some samples from the posterior distribution:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> replicateM 1000 (simulate posterior)
</code></pre></div></div>
<p><img src="/images/posterior_samples.png" alt="" /></p>
<p>The central tendency of the posterior floats about 0.75, which is what we’d
expect, given our observations. This has been inferred from only four
points; let’s try adding a few more. But before we do that, note that the
present way the rejection sampling algorithm works is:</p>
<ul>
<li>Propose a parameter value according to the supplied proposal distribution.</li>
<li>Generate a sample from the model, of equal size to the supplied observations.</li>
<li>Compare the collected sample to the supplied observations. If they’re equal,
then return the proposed parameter value. Otherwise start over.</li>
</ul>
<p>Rejection sampling isn’t exactly efficient in nontrivial settings anyway, but
it’s <em>supremely</em> inefficient for our present case. The random variables we’re
interested in are <a href="https://en.wikipedia.org/wiki/Exchangeable_random_variables">exchangeable</a>, so what we’re concerned about is the
total number of <code class="language-plaintext highlighter-rouge">True</code> or <code class="language-plaintext highlighter-rouge">False</code> values observed - not any specific order they
appear in.</p>
<p>We can add an ‘assistance’ function to the rejection sampler to help us out in
this case:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">invertWithAssistance</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Eq</span> <span class="n">c</span><span class="p">)</span> <span class="o">=></span> <span class="p">([</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="n">c</span><span class="p">)</span> <span class="o">-></span> <span class="n">m</span> <span class="n">b</span> <span class="o">-></span> <span class="p">(</span><span class="n">b</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="n">m</span> <span class="n">b</span>
<span class="n">invertWithAssistance</span> <span class="n">assister</span> <span class="n">proposal</span> <span class="n">model</span> <span class="n">observed</span> <span class="o">=</span> <span class="n">loop</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">parameters</span> <span class="o"><-</span> <span class="n">proposal</span>
<span class="n">generated</span> <span class="o"><-</span> <span class="n">replicateM</span> <span class="p">(</span><span class="n">length</span> <span class="n">observed</span><span class="p">)</span> <span class="p">(</span><span class="n">model</span> <span class="n">parameters</span><span class="p">)</span>
<span class="kr">if</span> <span class="n">assister</span> <span class="n">generated</span> <span class="o">==</span> <span class="n">assister</span> <span class="n">observed</span>
<span class="kr">then</span> <span class="n">return</span> <span class="n">parameters</span>
<span class="kr">else</span> <span class="n">loop</span>
</code></pre></div></div>
<p>The assister summarizes both our observations and collected sample to ensure
they’re efficiently comparable. In our situation, we can use a simple counting
function to tally up the number of <code class="language-plaintext highlighter-rouge">True</code> values we observe:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">count</span> <span class="o">::</span> <span class="p">[</span><span class="kt">Bool</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">count</span> <span class="o">=</span> <span class="n">length</span> <span class="o">.</span> <span class="n">filter</span> <span class="n">id</span>
</code></pre></div></div>
<p>Now let’s create another posterior by conditioning on a few more observations:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">posterior0</span> <span class="o">::</span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">posterior0</span> <span class="o">=</span> <span class="n">invertWithAssitance</span> <span class="n">count</span> <span class="n">uniform</span> <span class="n">bernoulli</span> <span class="n">obs</span> <span class="kr">where</span>
<span class="n">obs</span> <span class="o">=</span>
<span class="p">[</span><span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">False</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">False</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">True</span><span class="p">,</span> <span class="kt">False</span><span class="p">]</span>
</code></pre></div></div>
<p>and collect another thousand samples from it. This would likely take an
annoying amount of time without the use of our <code class="language-plaintext highlighter-rouge">count</code> function for assistance
above:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> replicateM 1000 (simulate posterior0)
</code></pre></div></div>
<p><img src="/images/posterior_samples0.png" alt="" /></p>
<p>Note that with more information to condition on, we get a more informative
posterior.</p>
<h2 id="conclusion">Conclusion</h2>
<p>This is a really basic formulation - too basic to be useful in any meaningful
way - but it illustrates some of the most important concepts in probabilistic
programming. Representation, simulation, and inference.</p>
<p>I think it’s also particularly nice to do this in Haskell, rather than something
like Python (which Dan used in his talk) - it provides us with a lot of
extensible structure in a familiar framework for language hacking. It sort of
demands you’re a fan of all these higher-kinded types and structured recursions
and all that, but if you’re reading this blog then you’re probably in that camp
anyway.</p>
<p>I’ll probably write a few more little articles like this over time. There are
a ton of improvements that we can make to this basic setup - encoding
<a href="/encoding-independence-statically">independence</a>, sampling via <a href="/markov-chains-a-la-carte">MCMC</a>, etc. - and it might be fun to
grow everything out piece by piece.</p>
<p>I’ve dropped the code from this post into <a href="https://gist.github.com/jtobin/95573e26843cf5fa0295360d3b33d3f1">this gist</a>.</p>
Randomness in Haskell2016-10-01T00:00:00+04:00https://jtobin.io/randomness-in-haskell<p>Randomness is a constant nuisance point for Haskell beginners who may be coming
from a language like Python or R. While in Python you can just get away with
something like:</p>
<div class="language-python highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">In</span> <span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">numpy</span><span class="p">.</span><span class="n">random</span><span class="p">.</span><span class="n">rand</span><span class="p">(</span><span class="mi">3</span><span class="p">)</span>
<span class="n">Out</span><span class="p">[</span><span class="mi">2</span><span class="p">]:</span> <span class="n">array</span><span class="p">([</span> <span class="mf">0.61426175</span><span class="p">,</span> <span class="mf">0.05309224</span><span class="p">,</span> <span class="mf">0.38861597</span><span class="p">])</span>
</code></pre></div></div>
<p>or in R:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.49473012</span><span class="w"> </span><span class="m">0.68436352</span><span class="w"> </span><span class="m">0.04135914</span><span class="w">
</span></code></pre></div></div>
<p>In Haskell, the situation is more complicated. It’s not too much worse when
you get the hang of things, but it’s certainly one of those things that throws
beginners for a loop - and for good reason.</p>
<p>In this article I want to provide a simple guide, with examples, for getting
started and becoming comfortable with randomness in Haskell. Hopefully it
helps!</p>
<p>I’m writing this from a hotel during my girlfriend’s birthday, so it’s being
slapped together very rapidly with a kind of get-it-done attitude. If anything
is unclear or you have any questions, feel free to shoot me a ping and I’ll try
to improve it when I get a chance.</p>
<h2 id="randomness-on-computers-in-general">Randomness on Computers in General</h2>
<p>Check out the R code I posted previously. If you just open R and type
<code class="language-plaintext highlighter-rouge">runif(3)</code> on your machine, then odds are you’ll get a different triple of
numbers than what I got above.</p>
<p>These numbers are being generated based on R’s global <em>random number generator</em>
(RNG), which, absent any fiddling by the user, is initialized as needed based
on the system time and ID of the R process. So: if you open up the R
interpreter and call <code class="language-plaintext highlighter-rouge">runif(3)</code>, then behind the scenes R will initialize the
RNG based on the time and process ID, and then use a particular algorithm to
generate random numbers based on that initialized value (called the ‘seed’).</p>
<p>These numbers aren’t truly random - they’re <em>pseudo-random</em>, which means
they’re generated by a deterministic algorithm such that the resulting values
<em>appear</em> random over time. The default algorithm used by R, for example, is
the famous <a href="https://en.wikipedia.org/wiki/Mersenne_Twister">Mersenne Twister</a>, which you can verify as follows:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> RNGkind()
[1] "Mersenne-Twister" "Inversion"
</code></pre></div></div>
<p>You can also set the seed yourself in R, using the <code class="language-plaintext highlighter-rouge">set.seed</code> function. Then
if you type something like <code class="language-plaintext highlighter-rouge">runif(3)</code>, R will use this initialized RNG rather
than coming up with its own seed based on the time and process ID. Setting
the seed allows you to reproduce operations involving pseudo-random numbers;
just re-set the seed and perform the same operations again:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.9148060</span><span class="w"> </span><span class="m">0.9370754</span><span class="w"> </span><span class="m">0.2861395</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">set.seed</span><span class="p">(</span><span class="m">42</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">0.9148060</span><span class="w"> </span><span class="m">0.9370754</span><span class="w"> </span><span class="m">0.2861395</span><span class="w">
</span></code></pre></div></div>
<p>(It’s good practice to <em>always</em> initialize the RNG using some known seed before
running an experiment, simulation, and so on.)</p>
<p>So the big thing to notice here, in any case, is that R uses a <em>global</em> RNG.
It maintains the state of this RNG implicitly and behind the scenes. When you
type <code class="language-plaintext highlighter-rouge">runif(3)</code>, R consults this implicit RNG, gives you your pseudo-random
numbers based on its value, and <em>updates</em> the global RNG without you needing to
worry about any of this plumbing yourself. The same is generally true for
randomness in most programming languages - Python, C, Ruby, and so on.</p>
<h2 id="explicit-rng-management">Explicit RNG Management</h2>
<p>But let’s come back to Haskell. Haskell, unlike R or Python, is
<em>purely-functional</em>. State, or effects in general, are <em>never</em> implicit in the
same way that R updates its global RNG. We need to either explicitly pass
around a RNG ourselves, or at least allow some explicit monad to do it for us.</p>
<p>Passing around a RNG manually is annoying, so in practice this means everyone
uses a monad to handle RNG state. This means that <strong>one needs to be
comfortable working with monadic code in order to practically use random
numbers in Haskell</strong>, which presents a big hurdle for beginners who may have
been able to ignore monads thus far on their Haskell journey.</p>
<p>Let’s see what I mean by all of this by going through a few examples. Make
sure you have <a href="https://docs.haskellstack.org/en/stable/README/">stack</a> installed, and then grab a few libraries that we’ll
make use of in the remainder of this post:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ stack install random mwc-random primitive
</code></pre></div></div>
<h3 id="the-really-annoying-method---manual-rng-management">The Really Annoying Method - Manual RNG Management</h3>
<p>Let me demonstrate the simplest conceptual method for dealing with random
numbers: manually grabbing and passing around a RNG without involving any
monads whatsoever.</p>
<p>First, open up GHCi:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ stack ghci
</code></pre></div></div>
<p>And let’s also get some quick preliminaries out of the way:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>Prelude> :set prompt "> "
> import System.Random
> import Control.Monad
> let runif_pure = randomR (0 :: Double, 1)
> let runif n = replicateM n (randomRIO (0 :: Double, 1))
> let set_seed = setStdGen . mkStdGen
</code></pre></div></div>
<p>We’ll first use the basic <a href="https://hackage.haskell.org/package/random-1.1/docs/System-Random.html"><code class="language-plaintext highlighter-rouge">System.Random</code> module</a> for illustration. To
initialize a RNG, we can make one by providing the <a href="https://hackage.haskell.org/package/random-1.1/docs/System-Random.html#v:mkStdGen"><code class="language-plaintext highlighter-rouge">mkStdGen</code> function</a>
with an integer seed:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> let rng = mkStdGen 42
> rng
43 1
</code></pre></div></div>
<p>We can use this thing to generate random numbers. A simple function to do that
is <a href="https://hackage.haskell.org/package/random-1.1/docs/System-Random.html#v:randomR"><code class="language-plaintext highlighter-rouge">randomR</code></a>, which will generate pseudo-random values for some ordered
type in a given range. We’ll use the <code class="language-plaintext highlighter-rouge">runif_pure</code> alias for it that we defined
previously, just to make things look similar to the previous R example and also
emphasize that this one is a pure function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> runif_pure rng
(1.0663729393723398e-2,2060101257 2103410263)
</code></pre></div></div>
<p>You can see that we got back a pair of values, the first element of which is
our random number <code class="language-plaintext highlighter-rouge">1.0663729393723398e-2</code>. Cool. Let’s try to generate
another:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> runif_pure rng
(1.0663729393723398e-2,2060101257 2103410263)
</code></pre></div></div>
<p>Hmm. We generated the same number again. This is because the value of <code class="language-plaintext highlighter-rouge">rng</code>
hasn’t changed - it’s still the same value we made via <code class="language-plaintext highlighter-rouge">mkStdGen 42</code>. Since
we’re using the same random number generator to generate a pseudo-random value,
we get the same pseudo-random value.</p>
<p>If we want to make <em>new</em> random numbers, then we need to use a different
generator. And the second element of the pair returned from our call to
<code class="language-plaintext highlighter-rouge">runif_pure</code> is exactly that - an updated RNG that we can use to generate
additional random numbers.</p>
<p>Let’s try that all again, using the generator we get back from the first
function call as an input to the second:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> let (x, rng1) = runif_pure rng
> x
1.0663729393723398e-2
> let (y, rng2) = runif_pure rng1
> y
0.9827538369038856
</code></pre></div></div>
<p>Success!</p>
<p>I mean.. sort of. It works and all, and it <em>does</em> constitute a general-purpose
solution. But manually binding updated RNG states to names and swapping those
in for new values is still pretty annoying.</p>
<p>You could also generate an infinite list of random numbers using the
<a href="https://hackage.haskell.org/package/random-1.1/docs/System-Random.html#v:randomRs"><code class="language-plaintext highlighter-rouge">randomRs</code> function</a> and just take from it as needed, but you still
probably need to manage that list to make sure you don’t re-use any numbers.
You kind of trade off managing the RNG for managing an infinite list of random
numbers, which isn’t much better.</p>
<h3 id="the-less-annoying-method---get-a-monad-to-do-it">The Less-Annoying Method - Get A Monad To Do It</h3>
<p>The good news is that we can offload the job of managing the RNG state to a
monad. I won’t actually explain how that works in detail here - I think most
people facing this problem are initially more concerned with getting something
working, rather than deeply grokking monads off the bat - so I’ll just claim
that we can get a monad to handle the RNG state for us, and that will hopefully
(mostly) suffice for now.</p>
<p><img src="/images/bane.gif" alt="" title=".. that comes later." /></p>
<p>Still rolling with the <code class="language-plaintext highlighter-rouge">System.Random</code> module for the time being, we’ll use the
<code class="language-plaintext highlighter-rouge">runif</code> alias for the <code class="language-plaintext highlighter-rouge">randomRIO</code> function that we defined previously to
generate some new random numbers:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> runif 3
[0.9873934690803106,0.3794382930121829,0.2285653405908732]
> runif 3
[0.7651878964537555,0.2623159001635825,0.7683468476766804]
</code></pre></div></div>
<p>Simpler! Notice we haven’t had to do anything with a generator manually - we
just ask for random numbers and then get them, just like in R. And if we want
to set the value of the RNG being used here, we can use the <code class="language-plaintext highlighter-rouge">setStdGen</code>
function with an RNG that we’ve already created. Here let’s just use the
<code class="language-plaintext highlighter-rouge">set_seed</code> alias we defined earlier, to mimic R’s <code class="language-plaintext highlighter-rouge">set.seed</code> function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> set_seed 42
> runif 3
[1.0663729393723398e-2,0.9827538369038856,0.7042944187434987]
> set_seed 42
> runif 3
[1.0663729393723398e-2,0.9827538369038856,0.7042944187434987]
</code></pre></div></div>
<p>So things are similar to how they work in R here - we have a global RNG of
sorts, and we can set its state as desired using the <code class="language-plaintext highlighter-rouge">set_seed</code> function. But
since this is Haskell, the effects of creating and updating the generator state
<em>must still be explicit</em>. And they <em>are</em> explicit - it’s just that they’re
explicit in the <strong>type</strong> of <code class="language-plaintext highlighter-rouge">runif</code>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> :t runif
runif :: Int -> IO [Double]
</code></pre></div></div>
<p>Note that <code class="language-plaintext highlighter-rouge">runif</code> returns a value that’s wrapped up in <code class="language-plaintext highlighter-rouge">IO</code>. This is how we
indicate explicitly - at the type level - that something is being done with the
generator in the background. <code class="language-plaintext highlighter-rouge">IO</code> is a monad, and it happens to be the thing
that’s dealing with the generator for us here.</p>
<p>What this means for you, the practitioner, is that you can’t just mix values of
some type <code class="language-plaintext highlighter-rouge">a</code> with values of type <code class="language-plaintext highlighter-rouge">IO a</code> willy-nilly. You may be writing a
function <code class="language-plaintext highlighter-rouge">f</code> with type <code class="language-plaintext highlighter-rouge">[Double] -> Double</code>, where the input list of doubles is
intended to be randomly-generated. But if you just go ahead and generate a
list <code class="language-plaintext highlighter-rouge">xs</code> of random numbers, they’ll have type <code class="language-plaintext highlighter-rouge">IO [Double]</code>, and you’ll stare
in confusion at some type error from GHC when you try to apply <code class="language-plaintext highlighter-rouge">f</code> to <code class="language-plaintext highlighter-rouge">xs</code>.</p>
<p>Here’s what I mean. Take the example of just generating some random numbers
and then summing them up. First, in R:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span><span class="w"> </span><span class="n">xs</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">runif</span><span class="p">(</span><span class="m">3</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="nf">sum</span><span class="p">(</span><span class="n">xs</span><span class="p">)</span><span class="w">
</span><span class="p">[</span><span class="m">1</span><span class="p">]</span><span class="w"> </span><span class="m">1.20353</span><span class="w">
</span></code></pre></div></div>
<p>And now in Haskell, using the same mechanism we tried earlier:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> let xs = runif 3
> :t xs
xs :: IO [Double]
> sum xs
<interactive>:16:1:
No instance for (Num [Double]) arising from a use of ‘sum’
In the expression: sum xs
In an equation for ‘it’: it = sum xs
</code></pre></div></div>
<p>This means that to deal with the numbers we generate, we have to treat them a
little differently than we would in R, or compared to the situation where we
were managing the RNG explicitly in Haskell. Concretely: if we use a monad to
manage the RNG for us, then the numbers we generate will be ‘tagged’ by the
monad. So we need to do something or other to make those tagged numbers work
with ‘untagged’ numbers, or functions designed to work with ‘untagged’ numbers.</p>
<p>This is where things get confusing for beginners. Here’s how we could add up
some random numbers in GHCi:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> xs <- runif 3
> sum xs
1.512024272587933
</code></pre></div></div>
<p>We’ve used the <code class="language-plaintext highlighter-rouge"><-</code> symbol to bind the result of <code class="language-plaintext highlighter-rouge">runif 3</code> to the name <code class="language-plaintext highlighter-rouge">xs</code>,
rather than <code class="language-plaintext highlighter-rouge">let xs = ...</code>. But this is sort of particular to running code in
GHCi; if you try to do this in a generic Haskell function, you’ll possibly wind
up with some more weird type errors. To do this in regular ol’ Haskell code,
you need to both use <code class="language-plaintext highlighter-rouge"><-</code>-style binding <strong>and</strong> also acknowledge the ‘tagged’
nature of randomly-generated values.</p>
<p>The crux is that, when you’re using a monad to generate random numbers in
Haskell, you need to separate <em>generating them</em> from <em>using them</em>. Rather than
try to explain what I mean here precisely, let’s rely on example, and implement
a simple Metropolis sampler for illustration.</p>
<h2 id="a-metropolis-sampler">A Metropolis Sampler</h2>
<p>The <a href="https://en.wikipedia.org/wiki/Metropolis%E2%80%93Hastings_algorithm">Metropolis algorithm</a> will help you approximate expectations over
certain probability spaces. Here’s how it works. Picture yourself strolling
around some bumpy landscape; you want to walk around it in such a fashion that
you visit regions of it with probability proportional to their altitude. To do
that, you can repeatedly:</p>
<ol>
<li>Pick a random point near your current location.</li>
<li>Compare your present altitude to the altitude of that point you picked.
Calculate a probability based on their ratio.</li>
<li>Flip a coin where the chance of observing a head is equal to that
probability. If you get a head, move to the location you picked.
Otherwise, stay put.</li>
</ol>
<p>Let’s implement it in Haskell, using a monadic random number generator to do
so. This time we’re going to use <code class="language-plaintext highlighter-rouge">mwc-random</code> - a more industrial-strength
randomness library that you can confidently use in production code.</p>
<p><code class="language-plaintext highlighter-rouge">mwc-random</code> uses <a href="https://en.wikipedia.org/wiki/Multiply-with-carry">Marsaglia’s multiply-with-carry algorithm</a> to generate
pseudo-random numbers. It requires you to explicitly create and pass a RNG to
functions that need to generate random numbers, but it uses a monad to <em>update</em>
the RNG state itself. This winds up being pretty nice; let’s dive in to see.</p>
<p>Create a module called <code class="language-plaintext highlighter-rouge">Metropolis.hs</code> and get some imports out of the way:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">module</span> <span class="nn">Metropolis</span> <span class="kr">where</span>
<span class="kr">import</span> <span class="nn">Control.Monad</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Primitive</span>
<span class="kr">import</span> <span class="nn">System.Random.MWC</span> <span class="k">as</span> <span class="n">MWC</span>
<span class="kr">import</span> <span class="nn">System.Random.MWC.Distributions</span> <span class="k">as</span> <span class="n">MWC</span>
</code></pre></div></div>
<h3 id="step-one">Step One</h3>
<p>The first thing we want to do is implement is point (1) from above:</p>
<blockquote>
<p>Pick a random point near your current location.</p>
</blockquote>
<p>We’ll just use a standard normal distribution of the appropriate dimension to
do this - we just want to take a location, perturb it, and return the perturbed
location.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">propose</span> <span class="o">::</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Gen</span> <span class="kt">RealWorld</span> <span class="o">-></span> <span class="kt">IO</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span>
<span class="n">propose</span> <span class="n">location</span> <span class="n">rng</span> <span class="o">=</span> <span class="n">traverse</span> <span class="p">(</span><span class="n">perturb</span> <span class="n">rng</span><span class="p">)</span> <span class="n">location</span> <span class="kr">where</span>
<span class="n">perturb</span> <span class="n">gen</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">MWC</span><span class="o">.</span><span class="n">normal</span> <span class="n">x</span> <span class="mi">1</span> <span class="n">gen</span>
</code></pre></div></div>
<p>So at finer detail: we’re walking over the coordinates of the current location
and generating a normally-distributed value centered at each coordinate. The
<a href="https://hackage.haskell.org/package/mwc-random-0.13.4.0/docs/System-Random-MWC-Distributions.html#v:normal"><code class="language-plaintext highlighter-rouge">MWC.normal</code></a> function will do this for a given mean and standard
deviation, and we can use the <code class="language-plaintext highlighter-rouge">traverse</code> function to walk over each coordinate.</p>
<p>Note that we pass a <code class="language-plaintext highlighter-rouge">mwc-random</code> RNG - the value with type <code class="language-plaintext highlighter-rouge">Gen RealWorld</code> - to
the <code class="language-plaintext highlighter-rouge">propose</code> function. We need to supply this generator anywhere we want to
generate random numbers, but we don’t need to manually worry about tracking and
updating its state. The <code class="language-plaintext highlighter-rouge">IO</code> monad will do that for us. The resulting
randomly-generated values will be tagged with <code class="language-plaintext highlighter-rouge">IO</code>, so we’ll need to deal with
that appropriately.</p>
<h3 id="step-two">Step Two</h3>
<p>Now let’s implement point (2):</p>
<blockquote>
<p>Compare your present altitude to the altitude of that point you picked.
Calculate a probability based on their ratio.</p>
</blockquote>
<p>So, we need a function that will compare the altitude of our current point to
the altitude of a proposed point and compute a probability from that. The
following will do: it takes a function that will compute a (log-scale) altitude
for us, as well as the current and proposed locations, and returns a
probability.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">moveProbability</span> <span class="o">::</span> <span class="p">([</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">moveProbability</span> <span class="n">altitude</span> <span class="n">current</span> <span class="n">proposed</span> <span class="o">=</span>
<span class="n">whenNaN</span> <span class="mi">0</span> <span class="p">(</span><span class="n">exp</span> <span class="p">(</span><span class="n">min</span> <span class="mi">0</span> <span class="p">(</span><span class="n">altitude</span> <span class="n">proposed</span> <span class="o">-</span> <span class="n">altitude</span> <span class="n">current</span><span class="p">)))</span>
<span class="kr">where</span>
<span class="n">whenNaN</span> <span class="n">val</span> <span class="n">x</span>
<span class="o">|</span> <span class="n">isNaN</span> <span class="n">x</span> <span class="o">=</span> <span class="n">val</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">x</span>
</code></pre></div></div>
<h3 id="step-three">Step Three</h3>
<p>Finally, the third step of the algorithm:</p>
<blockquote>
<p>Flip a coin where the chance of observing a head is equal to that
probability. If you get a head, move to the location you picked. Otherwise
stay put.</p>
</blockquote>
<p>So let’s get to it:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">decide</span> <span class="o">::</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Gen</span> <span class="kt">RealWorld</span> <span class="o">-></span> <span class="kt">IO</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span>
<span class="n">decide</span> <span class="n">current</span> <span class="n">proposed</span> <span class="n">prob</span> <span class="n">rng</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">accept</span> <span class="o"><-</span> <span class="kt">MWC</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">prob</span> <span class="n">rng</span>
<span class="n">return</span> <span class="o">$</span>
<span class="kr">if</span> <span class="n">accept</span>
<span class="kr">then</span> <span class="n">proposed</span>
<span class="kr">else</span> <span class="n">current</span>
</code></pre></div></div>
<p>Here we need to flip a coin, so we require a source of randomness again. The
<code class="language-plaintext highlighter-rouge">decide</code> function thus takes another generator of type <code class="language-plaintext highlighter-rouge">Gen RealWorld</code> that we
then supply to the <code class="language-plaintext highlighter-rouge">MWC.bernoulli</code> function, and the result - the final
location - is once again wrapped in <code class="language-plaintext highlighter-rouge">IO</code>.</p>
<p>This function clearly demonstrates the typical way that you’ll deal with random
numbers in Haskell code. <code class="language-plaintext highlighter-rouge">decide</code> is a monadic function, so it proceeds using
do-notation. When you need to generate a random value - here we generate a
random <code class="language-plaintext highlighter-rouge">True</code> or <code class="language-plaintext highlighter-rouge">False</code> value according to a Bernoulli distribution - you bind
the result to a name using the <code class="language-plaintext highlighter-rouge"><-</code> symbol. Then afterwards, in the scope of
the function, you can use the bound value as if it were pure. But the entire
function must still return a ‘wrapped-up’ value that makes the effect of
passing the generator explicit at the type level; right here, that means that
the value will be wrapped up in <code class="language-plaintext highlighter-rouge">IO</code>.</p>
<h3 id="putting-everything-together">Putting Everything Together</h3>
<p>The final Metropolis transition is a combination of steps one through three.
We can put them together like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">metropolis</span> <span class="o">::</span> <span class="p">([</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Gen</span> <span class="kt">RealWorld</span> <span class="o">-></span> <span class="kt">IO</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span>
<span class="n">metropolis</span> <span class="n">altitude</span> <span class="n">current</span> <span class="n">rng</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">proposed</span> <span class="o"><-</span> <span class="n">propose</span> <span class="n">current</span> <span class="n">rng</span>
<span class="kr">let</span> <span class="n">prob</span> <span class="o">=</span> <span class="n">moveProbability</span> <span class="n">altitude</span> <span class="n">current</span> <span class="n">proposed</span>
<span class="n">decide</span> <span class="n">current</span> <span class="n">proposed</span> <span class="n">prob</span> <span class="n">rng</span>
</code></pre></div></div>
<p>Again, <code class="language-plaintext highlighter-rouge">metropolis</code> is monadic, so we start off with a <code class="language-plaintext highlighter-rouge">do</code> to make monadic
programming easy on us. Whenever we need a random value, we bind the result of
a random number-returning function using the <code class="language-plaintext highlighter-rouge"><-</code> notation.</p>
<p>The <code class="language-plaintext highlighter-rouge">propose</code> function returns a random location, so we bind its result to the
name <code class="language-plaintext highlighter-rouge">proposed</code> using the <code class="language-plaintext highlighter-rouge"><-</code> symbol. The <code class="language-plaintext highlighter-rouge">moveProbability</code> function, on the
other hand, is pure - so we bind that using a <code class="language-plaintext highlighter-rouge">let prob = ...</code> expression. The
<code class="language-plaintext highlighter-rouge">decide</code> function returns a random value, so we can just plop it right on the
end here. The entire result of the <code class="language-plaintext highlighter-rouge">metropolis</code> function is random, so it is
wrapped up in <code class="language-plaintext highlighter-rouge">IO</code>.</p>
<p>The result of <code class="language-plaintext highlighter-rouge">metropolis</code> is just a single transition of the Metropolis
algorithm, which involves doing this kind of thing over and over. If we do
that, we observe a bunch of points that trace out a particular realization of a
<a href="https://en.wikipedia.org/wiki/Markov_chain">Markov chain</a>, which we can generate as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chain</span>
<span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="p">([</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Double</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span> <span class="o">-></span> <span class="kt">Gen</span> <span class="kt">RealWorld</span> <span class="o">-></span> <span class="kt">IO</span> <span class="p">[[</span><span class="kt">Double</span><span class="p">]]</span>
<span class="n">chain</span> <span class="n">epochs</span> <span class="n">altitude</span> <span class="n">origin</span> <span class="n">rng</span> <span class="o">=</span> <span class="n">loop</span> <span class="n">epochs</span> <span class="p">[</span><span class="n">origin</span><span class="p">]</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">n</span> <span class="n">history</span><span class="o">@</span><span class="p">(</span><span class="n">current</span><span class="o">:</span><span class="kr">_</span><span class="p">)</span>
<span class="o">|</span> <span class="n">n</span> <span class="o"><=</span> <span class="mi">0</span> <span class="o">=</span> <span class="n">return</span> <span class="n">history</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">next</span> <span class="o"><-</span> <span class="n">metropolis</span> <span class="n">altitude</span> <span class="n">current</span> <span class="n">rng</span>
<span class="n">loop</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span> <span class="p">(</span><span class="n">next</span><span class="o">:</span><span class="n">history</span><span class="p">)</span>
</code></pre></div></div>
<h3 id="an-example">An Example</h3>
<p>Now that we have our <code class="language-plaintext highlighter-rouge">chain</code> function, we can use it to trace out a collection
of points visited on a realization of a Markov chain. Remember that we’re
supposed to be wandering over some particular abstract landscape; here, let’s
stroll over the one defined by the following function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>landscape :: [Double] -> Double
landscape [x0, x1] =
-0.5 * (x0 ^ 2 * x1 ^ 2 + x0 ^ 2 + x1 ^ 2 - 8 * x0 - 8 * x1)
</code></pre></div></div>
<p>What we’ll now do is pick an origin to start from, wander over the landscape
for some number of steps, and then print the resulting realization of the
Markov chain to stdout. We’ll do all that through the following <code class="language-plaintext highlighter-rouge">main</code>
function:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>main :: IO ()
main = do
rng <- MWC.createSystemRandom
let origin = [-0.2, 0.3]
trace <- chain 1000 landscape origin rng
mapM_ print trace
</code></pre></div></div>
<p>Running that will dump a trace to stdout. If you clean it up and plot it,
you’ll see that the visited points have traced out a rough approximation of the
landscape:</p>
<p><img src="/images/bnn.png" alt="" /></p>
<h2 id="fini">Fini</h2>
<p>Hopefully this gives a broad idea of how to go about using random numbers in
Haskell. I’ve talked about:</p>
<ul>
<li>Why randomness in Haskell isn’t as simple as randomness in (say) Python or R.</li>
<li>How to handle randomness in Haskell, either by manual generator management or
by offloading that job to a monad.</li>
<li>How to get thing done when a monad manages the generator for you - separating
random number generation from random number processing.</li>
<li>Doing all the above with an industrial-strength RNG, using a simple
Metropolis algorithm as an example.</li>
</ul>
<p>Hopefully the example gives you an idea of how to work with random numbers in
practice.</p>
<p>I’ll be the first to admit that randomness in Haskell requires more work than
randomness in a language like R, which to this day remains my go-to interactive
data analysis language of choice. Using randomness effectively in Haskell
requires a decent understanding of how to <em>work with</em> monadic code, even if one
doesn’t quite understand monads entirely yet.</p>
<p>What I can say is that when one has developed some intuition for monads -
acquiring a ‘feel’ for how to work with monadic functions and values - the
difficulty and awkwardness drop off a bit, and working with randomness feels no
different than working with any other effect.</p>
<p>Happy generating! I’ve dumped the code for the Metropolis example into a
<a href="https://gist.github.com/jtobin/0c097884f6340f29bd23ba52564e82f8">gist</a>.</p>
<p>For a more production-quality Metropolis sampler, you can check out my
<a href="https://github.com/jtobin/mighty-metropolis">mighty-metropolis</a> library, which is a member of the <a href="https://github.com/jtobin/declarative">declarative</a>
suite of MCMC algos.</p>
On Measurability2016-07-18T00:00:00+04:00https://jtobin.io/on-measurability<p>.. this one is pretty dry, I’ll admit. <a href="https://www.amazon.com/Probability-Martingales-Cambridge-Mathematical-Textbooks/dp/0521406056">David Williams</a> said it
best:</p>
<blockquote>
<p>.. Measure theory, that most arid of subjects when done for its own sake,
becomes amazingly more alive when used in probability, not only because it is
then applied, but also because it is immensely enriched.</p>
</blockquote>
<p>Unfortunately for you, dear reader, we won’t be talking about probability.</p>
<p>Moving on. What does it mean for something to be <em>measurable</em> in the
mathematical sense? Take some arbitrary collection \(X\) and slap an
appropriate algebraic structure \(\mathcal{X}\) on it - usually an
<a href="https://en.wikipedia.org/wiki/Algebra_of_sets">algebra</a> or <a href="https://en.wikipedia.org/wiki/Sigma-algebra">\(\sigma\)-algebra</a>, etc. Then we can
refer to a few different objects as ‘measurable’, going roughly as follows.</p>
<p>The elements of the structure \(\mathcal{X}\) are called <em>measurable sets</em>.
They’re called this because they can literally be assigned a notion of measure,
whcih is a kind of generalized volume. If we’re just talking about some subset
of \(X\) out of the context of its structure then we can be pedantic and call
it measurable <em>with respect to</em> \(\mathcal{X}\), say. You could also call a
set \(\mathcal{X}\)-measurable, to be similarly precise.</p>
<p>The product of the original collection and its associated structure \((X,
\mathcal{X})\) is called a <em>measurable space</em>. It’s called that because it can
be completed with a measuring function \(\mu\) - itself called a measure - that
assigns notions of measure to measurable sets.</p>
<p>Now take some other measurable space \((Y, \mathcal{Y})\) and consider a
function \(f\) from \(X\) to \(Y\). This is a <em>measurable function</em> if it
satisfies the following technical requirement: that for any
\(\mathcal{Y}\)-measurable set, its preimage under \(f\) is an element of
\(\mathcal{X}\) (so: the preimage under \(f\) is \(\mathcal{X}\)-measurable).</p>
<p>The concept of measurability for functions probably feels the least intuitive
of the three - like one of those dry taxonomical classifications that we just
have to keep on the books. The ‘make sure your function is measurable and
everything will be ok’ heuristic will get you pretty far. But there is some
good intuition available, if you want to look for it.</p>
<p>Here’s an example: define a set \(X\) that consists of the elements \(A\),
\(B\), and \(C\). To talk about measurable functions, we first need to define
our measurable sets. The de-facto default structure used for this is a
<a href="https://en.wikipedia.org/wiki/Sigma-algebra">\(\sigma\)-algebra</a>, and we can always <em>generate</em> one from some
underlying class of sets. Let’s do that from the following plain old
<em>partition</em> that splits the original collection into a couple of disjoint
‘slices’:</p>
\[H = \{\{A, B\}, \{C\}\}\]
<p>The \(\sigma\)-algebra \(\mathcal{X}\) generated from this partition will just
be the partition itself, completed with the whole set \(X\) and the empty set.
To be clear, it’s the following:</p>
\[\mathcal{X} = \left\{\{A, B, C\}, \{A, B\}, \{C\}, \emptyset\right\}\]
<p>The resulting measurable space is \((X, \mathcal{X})\). So we could assign a
notion of generalized volume to any element of \(\mathcal{X}\), though I won’t
actually worry about doing that here.</p>
<p>Now. Let’s think about some functions from \(X\) to the real numbers, which
we’ll assume to be endowed with a suitable \(\sigma\)-algebra of their own (one
typically assumes the <a href="https://en.wikipedia.org/wiki/Topological_space#Examples_of_topological_spaces">standard topology</a> on \(\mathbb{R}\) and the
associated <a href="https://en.wikipedia.org/wiki/Borel_set">Borel \(\sigma\)-algebra</a>).</p>
<p>How about this - a simple indicator function on the slice containing \(C\):</p>
\[f(x) =
\begin{cases}
0, \, x \in \{A, B\} \\
1, \, x \in \{C\}
\end{cases}\]
<p>Is it measurable? That’s easy to check. The preimage of \(\{0\}\) is \(\{A,
B\}\), the preimage of \(\{1\}\) is \(\{C\}\), and the preimage of \(\{0, 1\}\)
is \(X\) itself. Those are all in \(\mathcal{X}\), and the preimage of the
empty set is the empty set, so we’re good.</p>
<p>Ok. What about this one:</p>
\[g(x) =
\begin{cases}
0, \, x \in \{A\} \\
1, \, x \in \{B\} \\
2, \, x \in \{C\}
\end{cases}\]
<p>Check the preimage of \(\{1, 2\}\) and you’ll find it’s \(\{B, C\}\). But
that’s <em>not</em> a member of \(\mathcal{X}\), so \(g\) is not measurable!</p>
<p>What happened here? Failing to satisfying technical requirements aside: what,
intuitively, made \(f\) measurable where \(g\) wasn’t?</p>
<p>The answer is a problem of <em>resolution</em>. Look again at \(\mathcal{X}\):</p>
\[\left\{\{A, B, C\}, \{A, B\}, \{C\}, \emptyset\right\}\]
<p>The structure \(\mathcal{X}\) that we’ve endowed our collection \(X\) with is
<em>too coarse</em> to permit distinguishing between elements of the slice \(\{A,
B\}\). There is no measurable set \(A\), nor a measurable set \(B\) - just
a measurable set \(\{A, B\}\). And as a result, if we define a function that
says something about either \(A\) or \(B\) without saying the same thing about
the other, <em>that function won’t be measurable.</em> The function \(f\) assigned
the same value to both \(A\) and \(B\), so we didn’t have any problem there.</p>
<p>If we want to be able to distinguish between \(A\) and \(B\), we’ll need to
equip \(X\) with some structure that has a finer resolution. You can check
that if you make a measurable space out of \(X\) and its power set (the set of
all subsets of \(X\)) then \(g\) will be measurable there, for example.</p>
<p>So if we’re using partitions to define our measurable sets, we get a neat
little property: for any measurable function, elements in the same slice of the
partition <em>must</em> have the same value when passed through the function. So if
you have a function \(h : X \to H\) that takes an element to its respective
slice in a partition, you know that \(h(x_{0}) = h(x_{1})\) for any \(x_{0}\),
\(x_{1}\) in \(X\) implies that \(f(x_{0}) = f(x_{1})\) for any measurable
function \(f\).</p>
<h2 id="addendum">Addendum</h2>
<p>Whipping together a measurable space using a \(\sigma\)-algebra generated by a
partition of sets occurs naturally when we talk about <a href="https://en.wikipedia.org/wiki/Correlated_equilibrium">correlated
equilibrium</a>, a solution concept in non-cooperative game theory. It’s
common to say a function - in that context a <em>correlated strategy</em> - must be
measurable ‘with respect to the partition’, which sort of elides the fact that
we still need to generate a \(\sigma\)-algebra from it anyway.</p>
<p>Some oldschool authors (Halmos, at least) developed their measure theory using
<a href="https://en.wikipedia.org/wiki/Ring_of_sets">\(\sigma\)-rings</a>, but this doesn’t seem very popular nowadays.
Since a ring doesn’t require including the entire set \(X\), you need to go
through an awkward extra hoop when defining measurability on functions. But
regardless, it’s interesting to think about what happens when one uses
different structures to define measurable sets!</p>
Making a Market2016-04-20T00:00:00+04:00https://jtobin.io/making-a-market<p>Suppose you’re in the derivatives business. You are interested in making a
market on some events; say, whether or not your friend Jay will win tomorrow
night’s poker game, or that the winning pot will be at least USD 100. Let’s
examine some rules about how you should do business if you want this venture to
succeed.</p>
<p>What do I mean by ‘make a market’? I mean that you will be willing to buy and
sell units of a particular security that will be redeemable from the seller at
some particular value after tomorrow’s poker game has ended (you will be making
a simple <em>prediction market</em>, in other words). You can make bid offers to buy
securities at some price, and ask offers to sell securities at some price.</p>
<p>To keep things simple let’s say you’re doing this gratis; society rewards you
extrinsically for facilitating the market - your friends will give you free
pizza at the game, maybe - so you won’t levy any <em>transaction fees</em> for making
trades. Also scarcity isn’t a huge issue, so you’re willing to buy or sell any
finite number of securities.</p>
<p>Consider the possible outcomes of the game (one and only one of which must
occur):</p>
<ol>
<li>(A) Jay wins and the pot is at least USD 100.</li>
<li>(B) Jay wins and the pot is less than USD 100.</li>
<li>(C) Jay loses and the pot is at least USD 100.</li>
<li>(D) Jay loses and the pot is less than USD 100.</li>
</ol>
<p>The securities you are making a market on pay USD 1 if an event occurs, and USD
0 otherwise. So: if I buy 5 securities on outcome \(A\) from you, and outcome
\(A\) occurs, I’ll be able to go to you and redeem my securities for a total of
USD 5. Alternatively, if I sell you 5 securities on outcome \(A\), and outcome
\(A\) occurs, you’ll be able to come to me and redeem your securities for a
total of USD 5.</p>
<p>Consider what that implies: as a market maker, you face the prospect of making
hefty payments to customers who redeem valuable securities. For example,
imagine the situation where you charge USD 0.50 for a security on outcome
\(A\), but outcome \(A\) is almost certain to occur in some sense (Jay is a
beast when it comes to poker and a lot of high rollers are playing); if your
customers exclusively load up on 100 cheap securities on outcome \(A\), and
outcome \(A\) occurs, then you stand to owe them a total payment of USD 100
against the USD 50 that they have paid for the securities. You thus have a
heavy incentive to price your securities as accurately as possible - where
‘accurate’ means to minimize your expected loss.</p>
<p>It may always be the case, however, that it is difficult to price your
securities accurately. For example, if some customer has more information than
you (say, she privately knows that Jay is unusually bad at poker) then she
potentially stands to profit from holding said information in lieu of your
ignorance on the matter (and that of your prices). Such is life for a market
maker. But there are particular prices you could offer - independent of any
participant’s private information - that are plainly stupid or ruinous for you
(a set of prices like this is sometimes called a <a href="https://en.wikipedia.org/wiki/Dutch_book">Dutch
book</a>). Consider selling securities
on outcome \(A\) for the price of USD -1; then anyone who buys one of these
securities not only stands to redeem USD 1 in the event outcome \(A\) occurs,
but also gains USD 1 simply from the act of buying the security in the first
place.</p>
<p>Setting a negative price like this is irrational on your part; customers will
realize an <em>arbitrage opportunity</em> on securities for outcome \(A\) and will
happily buy as many as they can get their hands on, to your ruin. In other
words - and to nobody’s surprise - by setting a negative price, <strong>you can be
made a sure loser</strong> in the market.</p>
<p>There are other prices you should avoid setting as well, if you want to avoid
arbitrage opportunities like these. For starters:</p>
<ul>
<li>For any outcome \(E\), you must set the price of a security on \(E\) to be at
least USD 0.</li>
<li>For any <em>certain</em> outcome \(E\), you must set the price of a security on \(E\) to
be exactly USD 1.</li>
</ul>
<p>The first condition rules out negative prices, and the second ensures that your
books balance when it comes time to settle payment for securities on a certain
event.</p>
<p>What’s more, the price that you set on any given security doesn’t exist in
isolation. Given the outcomes \(A\), \(B\), \(C\), and \(D\) listed previously, at
least one <em>must</em> occur. So as per the second rule, the price of a synthetic
derivative on the outcome “Jay wins or loses, and the pot is any value” must be
set to USD 1. This places constraints on the prices that you can set for
individual securities. It suffices that:</p>
<ul>
<li>For any countable set of mutually exclusive outcomes \(E_{1}, E_{2}, \ldots\),
you must set the price of the security on outcome “\(E_{1}\) or \(E_{2}\) or..”
to exactly the sum of the prices of the individual outcomes.</li>
</ul>
<p>This eliminates the possibility that your customers will make you a certain
loser by buying elaborate combinations of securities on different outcomes.</p>
<p>There are other rules that your prices must obey as well, but they fall out as
corollaries of these three. If you broke any of them you’d also be breaking
one of these.</p>
<p>It turns out that you <em>cannot be made a sure loser if, and only if, your prices
obey these three rules</em>. That is:</p>
<ul>
<li>If your prices follow these rules, then you will offer customers no arbitrage
opportunities.</li>
<li>Any market absent of arbitrage opportunities must have prices that conform
to these rules.</li>
</ul>
<p>These prices are called <em>coherent</em>, and absence of coherence implies the
existence of arbitrage opportunities for your customers.</p>
<h2 id="but-why-male-models">But Why Male Models</h2>
<p>The trick, of course, is that these prices correspond to <em>probabilities</em>, and
the rules for avoiding arbitrage correspond to the standard <a href="https://en.wikipedia.org/wiki/Probability_axioms">Kolmogorov
axioms</a> of probability
theory.</p>
<p>The consequence is that if your description of uncertain phenomena does not
involve probability theory, or does not behave exactly like probability theory,
then it is an <em>incoherent</em> representation of information you have about those
phenomena.</p>
<p>As a result, probability theory should be your tool of choice when it comes
to describing uncertain phenomena. Granted you may not have to worry about
market making in return for pizza, but you’d like to be assured that there are
no structural problems with your description.</p>
<h2 id="comments">Comments</h2>
<p>This is a summary of the development of probability presented in Jay Kadane’s
brilliant <a href="http://uncertainty.stat.cmu.edu/">Principles of Uncertainty</a>. The
original argument was developed by de Finetti and Savage in the mid-20th
century.</p>
<p>Kadane’s book makes for an exceptional read, and it’s free to boot. I
recommend checking it out if it has flown under your radar.</p>
<p>An interesting characteristic of this development of probability is that there
is no way to guarantee the nonexistence of arbitrage opportunities for a
countably infinite number of purchased securities. That is: if you’re a market
maker, you could be made a sure loser in the market when it came time for you
to settle a countably infinite number of redemption claims. The quirk here is
that you could also be made a sure winner as well; whether you win or lose with
certainty depends on the order in which the claims are settled! (Fortunately
this doesn’t tend to be an issue in practice.)</p>
<p>Thanks to Fredrik Olsen for reviewing a draft of this post.</p>
<h2 id="references">References</h2>
<ul>
<li><a href="http://uncertainty.stat.cmu.edu/">Principles of Uncertainty</a></li>
<li><a href="http://www.mit.edu/~mitter/publications/102_ondefinetti_elsev.pdf">On De Finetti coherence and Kolmogorov probability</a></li>
<li><a href="https://normaldeviate.wordpress.com/2013/06/30/lost-causes-in-statistics-i-finite-additivity/">Lost Causes in Statistics I: Finite Additivity</a></li>
<li><a href="http://joelvelasco.net/teaching/3865/howson%20-%20de%20finetti%20countable%20additivity.pdf">De Finetti, Countable Additivity, Consistency and Coherence</a></li>
<li><a href="http://wwwf.imperial.ac.uk/~bin06/Papers/favcarev.pdf">Finite Additivity Versus Countable Additivity: De Finetti and Savage</a></li>
</ul>
flat-mcmc Update and v1.0.0 Release2016-04-07T00:00:00+04:00https://jtobin.io/flat-mcmc-update<p>I’ve updated my old <a href="https://github.com/jtobin/flat-mcmc"><em>flat-mcmc</em></a> library
for ensemble sampling in Haskell and have pushed out a v1.0.0 release.</p>
<h2 id="history">History</h2>
<p>I wrote <em>flat-mcmc</em> in 2012, and it was the first serious-ish size project I
attempted in Haskell. It’s an implementation of Goodman & Weare’s <a href="http://msp.org/camcos/2010/5-1/camcos-v5-n1-p04-p.pdf">affine
invariant ensemble
sampler</a>, a Monte Carlo
algorithm that works by running a Markov chain over an ensemble of particles.
It’s easy to get started with (there are no tuning parameters, etc.) and
is sufficiently robust for a lot of purposes. The algorithm became somewhat
famous in the astrostatistics community, where some of its members implemented
it via the very nice and polished Python library,
<a href="http://dan.iel.fm/emcee/current/">emcee</a>.</p>
<p>The library has become my second-most starred repo on Github, with a whopping
10 stars as of this writing (the Haskell MCMC community is pretty niche, bro).
Lately someone emailed me and asked if I wouldn’t mind pushing it to Stackage,
so I figured it was due for an update and gave it a little modernizing along
the way.</p>
<p>I’m currently on sabbatical and am traveling through Vietnam; I started the
rewrite in Hanoi and finished it in Saigon, so it was a kind of nice side
project to do while sipping coffees and the like during downtime.</p>
<h2 id="what-is-it">What Is It</h2>
<p>I wrote a little summary of the library in 2012, which you can still find
<a href="http://jtobin.ca/flat-mcmc/">tucked away on my personal site</a>. Check that out
if you’d like a description of the algorithm and why you might want to use it.</p>
<p>Since I wrote the initial version my astrostatistics-inclined friends David
Huijser and Brendon Brewer <a href="http://arxiv.org/abs/1509.02230">wrote a paper</a>
about some limitations they discovered when using this algorithm in
high-dimensional settings. So caveat emptor, buyer beware and all that.</p>
<p>In general this is an extremely easy-to-use algorithm that will probably get
you decent samples from arbitrary targets without tedious tuning/fiddling.</p>
<h2 id="whats-new">What’s New</h2>
<p>I’ve updated and standardized the API in line with my other MCMC projects
huddled around the <a href="markov-chains-a-la-carte">declarative</a> library. That means
that, like the others, there are two primary ways to use the library: via an
<code class="language-plaintext highlighter-rouge">mcmc</code> function that will print a trace to stdout, or a <code class="language-plaintext highlighter-rouge">flat</code> transition
operator that can be used to work with chains in memory.</p>
<p>Regrettably you can’t use the <code class="language-plaintext highlighter-rouge">flat</code> transition operator with others in the
<code class="language-plaintext highlighter-rouge">declarative</code> ecosystem as it operates over <em>ensembles</em>, whereas the others are
single-particle algorithms.</p>
<p>The README over at the <a href="https://github.com/jtobin/flat-mcmc">Github repo</a>
contains a brief usage example. If there’s some feature you’d like to see or
documentation/examples you could stand to have added then don’t hestitate to
ping me and I’ll be happy to whip something up.</p>
<p>In the meantime I’ve pushed a new version to
<a href="https://hackage.haskell.org/package/flat-mcmc">Hackage</a> and added the library
to <a href="https://www.stackage.org/">Stackage</a>, so it should show up in an LTS
release soon enough.</p>
<p>Cheers!</p>
Encoding Statistical Independence, Statically2016-02-16T00:00:00+04:00https://jtobin.io/encoding-independence-statically<p><a href="http://strictlypositive.org/IdiomLite.pdf">Applicative functors</a> are useful
for encoding context-free effects. This typically gets put to work around
things like <a href="https://hackage.haskell.org/package/optparse-applicative">parsing</a>
or <a href="https://jaspervdj.be/posts/2015-05-19-monoidal-either.html">validation</a>,
but if you have a statistical bent then an applicative structure will be
familiar to you as an encoder of <em>independence</em>.</p>
<p>In this article I’ll give a whirlwind tour of probability monads and algebraic
freeness, and demonstrate that applicative functors can be used to represent
independence between probability distributions in a way that can be verified
statically.</p>
<p>I’ll use the following preamble for the code in the rest of this article.
You’ll need the <a href="https://hackage.haskell.org/package/free">free</a> and
<a href="https://hackage.haskell.org/package/mwc-probability">mwc-probability</a>
libraries if you’re following along at home:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="kr">import</span> <span class="nn">Control.Applicative</span>
<span class="kr">import</span> <span class="nn">Control.Applicative.Free</span>
<span class="kr">import</span> <span class="nn">Control.Monad</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Free</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Primitive</span>
<span class="kr">import</span> <span class="nn">System.Random.MWC.Probability</span> <span class="p">(</span><span class="kt">Prob</span><span class="p">)</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">System.Random.MWC.Probability</span> <span class="k">as</span> <span class="n">MWC</span>
</code></pre></div></div>
<h2 id="probability-distributions-and-algebraic-freeness">Probability Distributions and Algebraic Freeness</h2>
<p>Many functional programmers (though fewer statisticians) know that probability
has a <a href="https://www.cs.tufts.edu/~nr/pubs/pmonad.pdf">monadic structure</a>. This
can be expressed in multiple ways; the discrete probability distribution type
found in the
<a href="https://web.engr.oregonstate.edu/~erwig/papers/PFP_JFP06.pdf">PFP</a> framework,
the sampling function representation used in the
<a href="https://www.cs.cmu.edu/~fp/papers/toplas08.pdf">lambda-naught</a> paper (and
implemented <a href="https://github.com/jtobin/mwc-probability">here</a>, for example),
and even an obscure <a href="https://github.com/jtobin/measurable">measure-based</a>
representation first described by Ramsey and Pfeffer, which doesn’t have a ton
of practical use.</p>
<p>The monadic structure allows one to sequence distributions together. That is:
if some distribution ‘foo’ has a parameter which itself has the probability
distribution ‘bar’ attached to it, the compound distribution can be expressed
by the monadic expression ‘bar »= foo’.</p>
<p>At a larger scale, monadic programs like this correspond exactly to what you’d
typically see in a run-of-the-mill visualization of a probabilistic model:</p>
<p><img src="/images/fmm.png" alt="" class="center-image" /></p>
<p>In this classical kind of visualization the nodes represent probability
distributions and the arrows describe the dependence structure. Translating it
to a monadic program is mechanical: the nodes become monadic expressions and
the arrows become binds. You’ll see a simple example in this article shortly.</p>
<p>The monadic structure of probability implies that it also has a <em>functorial</em>
structure. Mapping a function over some probability distrubution will
transform its support while leaving its probability density structure invariant
in some sense. If the function ‘uniform’ defines a uniform probability
distribution over the interval (0, 1), then the function ‘fmap (+ 1) uniform’
will define a probability distribution over the interval (1, 2).</p>
<p>I’ll come back to probability shortly, but the point is that probability
distributions have a clear and well-defined algebraic structure in terms of
things like functors and monads.</p>
<p>Recently <em>free objects</em> have become fashionable in functional programming. I
won’t talk about it in detail here, but algebraic ‘freeness’ corresponds to a
certain <em>preservation of structure</em>, and exploiting this kind of preserved
structure is a useful technique for writing and interpreting programs.</p>
<p>Gabriel Gonzalez famously wrote about freeness in an <a href="http://www.haskellforall.com/2012/06/you-could-have-invented-free-monads.html">oft-cited
article</a>
about free monads, John De Goes wrote a compelling piece on the topic in the
excellent <a href="http://degoes.net/articles/modern-fp/">A Modern Architecture for Functional
Programming</a>, and just today I noticed
that Chris Stucchio had published an article on using <a href="http://engineering.wingify.com/posts/Free-objects/">Free Boolean
Algebras</a> for implementing
a kind of constraint DSL. The last article included the following quote, which
IMO sums up much of the <em>raison d’être</em> to exploit freeness in your day-to-day
work:</p>
<blockquote>
<p>.. if you find yourself re-implementing the same algebraic structure over and over, it might be worth asking yourself if a free version of that algebraic structure exists. If so, you might save yourself a lot of work by using that.</p>
</blockquote>
<p>If a free version of some structure exists, then it embodies the ‘essence’ of
that structure, and you can encode specific instances of it by just layering
the required functionality over the free object itself.</p>
<h2 id="a-type-for-probabilistic-models">A Type for Probabilistic Models</h2>
<p>Back to probability. Since probability distributions are monads, we can use a
free monad to encode them in a structure-preserving way. Here I’ll define a
simple probability base functor for which each constructor is a particular
‘named’ probability distribution:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>data ProbF r =
BetaF Double Double (Double -> r)
| BernoulliF Double (Bool -> r)
deriving Functor
type Model = Free ProbF
</code></pre></div></div>
<p>Here we’ll only work with two simple named distributions - the beta and the
Bernoulli - but the sky is the limit.</p>
<p>The ‘Model’ type wraps up this probability base functor in the free monad,
‘Free’. In this sense a ‘Model’ can be viewed as a program parameterized by
the underlying probabilistic instruction set defined by ‘ProbF’ (a technique I
<a href="/tour-of-some-recursive-types">described
recently</a>).</p>
<p>Expressions with the type ‘Model’ are terms in an embedded language. We can
create some user-friendly ones for our beta-bernoulli language like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">beta</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Double</span>
<span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">id</span><span class="p">)</span>
<span class="n">bernoulli</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">bernoulli</span> <span class="n">p</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">id</span><span class="p">)</span>
</code></pre></div></div>
<p>Those primitive terms can then be used to construct expressions.</p>
<p>The beta and Bernoulli distributions enjoy an algebraic property called
<a href="https://en.wikipedia.org/wiki/Conjugate_prior">conjugacy</a> that ensures
(amongst other things) that the compound distribution formed by combining the
two of them is <a href="https://en.wikipedia.org/wiki/Beta-binomial_distribution">analytically
tractable</a>. Here’s a
parameterized coin constructed by doing just that:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coin</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Model</span> <span class="kt">Bool</span>
<span class="n">coin</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">>>=</span> <span class="n">bernoulli</span>
</code></pre></div></div>
<p>By tweaking the parameters ‘a’ and ‘b’ we can bias the coin in particular ways,
making it more or less likely to observe a head when it’s inspected.</p>
<p>A simple evaluator for the language goes like this:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>eval :: PrimMonad m => Model a -> Prob m a
eval = iterM $ \case
BetaF a b k -> MWC.beta a b >>= k
BernoulliF p k -> MWC.bernoulli p >>= k
</code></pre></div></div>
<p>‘iterM’ is a monadic, catamorphism-like <a href="/practical-recursion-schemes">recursion
scheme</a>
that can be used to succinctly consume a ‘Model’. Here I’m using it to
propagate uncertainty through the model by sampling from it ancestrally in a
top-down manner. The ‘MWC.beta’ and ‘MWC.bernoulli’ functions are sampling
functions from the <em>mwc-probability</em> library, and the resulting type ‘Prob m a’
is a simple probability monad type based on sampling functions.</p>
<p>To actually sample from the resulting ‘Prob’ type we can use the system’s PRNG
for randomness. Here are some simple coin tosses with various biases as an
example; you can mentally substitute ‘Head’ for ‘True’ if you’d like:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> gen <- MWC.createSystemRandom
> replicateM 10 $ MWC.sample (eval (coin 1 1)) gen
[False,True,False,False,False,False,False,True,False,False]
> replicateM 10 $ MWC.sample (eval (coin 1 8)) gen
[False,False,False,False,False,False,False,False,False,False]
> replicateM 10 $ MWC.sample (eval (coin 8 1)) gen
[True,True,True,False,True,True,True,True,True,True]
</code></pre></div></div>
<p>As a side note: encoding probability distributions in this way means that the
other ‘forms’ of probability monad described previously happen to fall out
naturally in the form of specific interpreters over the free monad itself. A
measure-based probability monad could be achieved by using a different ‘eval’
function; the important monadic structure is already preserved ‘for free’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">measureEval</span> <span class="o">::</span> <span class="kt">Model</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Measure</span> <span class="n">a</span>
<span class="n">measureEval</span> <span class="o">=</span> <span class="n">iterM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">BetaF</span> <span class="n">a</span> <span class="n">b</span> <span class="n">k</span> <span class="o">-></span> <span class="kt">Measurable</span><span class="o">.</span><span class="n">beta</span> <span class="n">a</span> <span class="n">b</span> <span class="o">>>=</span> <span class="n">k</span>
<span class="kt">BernoulliF</span> <span class="n">p</span> <span class="n">k</span> <span class="o">-></span> <span class="kt">Measurable</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span> <span class="o">>>=</span> <span class="n">k</span>
</code></pre></div></div>
<h2 id="independence-and-applicativeness">Independence and Applicativeness</h2>
<p>So that’s all cool stuff. But in some cases the monadic structure is more than
what we actually require. Consider flipping two coins and then returning them
in a pair, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flips</span> <span class="o">::</span> <span class="kt">Model</span> <span class="p">(</span><span class="kt">Bool</span><span class="p">,</span> <span class="kt">Bool</span><span class="p">)</span>
<span class="n">flips</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">c0</span> <span class="o"><-</span> <span class="n">coin</span> <span class="mi">1</span> <span class="mi">8</span>
<span class="n">c1</span> <span class="o"><-</span> <span class="n">coin</span> <span class="mi">8</span> <span class="mi">1</span>
<span class="n">return</span> <span class="p">(</span><span class="n">c0</span><span class="p">,</span> <span class="n">c1</span><span class="p">)</span>
</code></pre></div></div>
<p>These coins are independent - they don’t affect each other whatsoever and enjoy
the <a href="https://en.wikipedia.org/wiki/Independence_(probability_theory)">probabilistic/statistical
property</a> that
formalizes that relationship. But the monadic program above doesn’t actually
capture this independence in any sense; desugared, the program actually
proceeds like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flips</span> <span class="o">=</span>
<span class="n">coin</span> <span class="mi">1</span> <span class="mi">8</span> <span class="o">>>=</span> <span class="nf">\</span><span class="n">c0</span> <span class="o">-></span>
<span class="n">coin</span> <span class="mi">8</span> <span class="mi">1</span> <span class="o">>>=</span> <span class="nf">\</span><span class="n">c1</span> <span class="o">-></span>
<span class="n">return</span> <span class="p">(</span><span class="n">c0</span><span class="p">,</span> <span class="n">c1</span><span class="p">)</span>
</code></pre></div></div>
<p>On the right side of any monadic bind we just have a black box - an opaque
function that can’t be examined statically. Each monadic expression binds its
result to the rest of the program, and we - programming ‘at the surface’ -
can’t look at it without going ahead and evaluating it. In particular we can’t
guarantee that the coins are truly independent - it’s just a mental invariant
that can’t be transferred to an interpreter.</p>
<p>But this is the well-known motivation for applicative functors, so we can do a
little better here by exploiting them. Applicatives are strictly less
powerful than monads, so they let us write a probabilistic program that can
<em>guarantee</em> the independence of expressions.</p>
<p>Let’s bring in the free applicative, ‘Ap’. I’ll define a type called ‘Sample’
by layering ‘Ap’ over our existing ‘Model’ type:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>type Sample = Ap Model
</code></pre></div></div>
<p>So an expression with type ‘Sample’ is a free applicative over the ‘Model’ base
functor. I chose the namesake because typically we talk about samples that are
independent and identically-distributed draws from some probability
distribution, though we could use ‘Ap’ to collect samples that are
independently-but-not-identically distributed as well.</p>
<p>To use our existing embedded language terms with the free applicative, we can
create the following helper function as an alias for ‘liftAp’ from
‘Control.Applicative.Free’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">independent</span> <span class="o">::</span> <span class="n">f</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Ap</span> <span class="n">f</span> <span class="n">a</span>
<span class="n">independent</span> <span class="o">=</span> <span class="n">liftAp</span>
</code></pre></div></div>
<p>With that in hand, we can write programs that statically encode independence.
Here are the two coin flips from earlier (and if you’re applicative-savvy I’ll
avoid using ‘liftA2’ here for clarity):</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">flips</span> <span class="o">::</span> <span class="kt">Sample</span> <span class="p">(</span><span class="kt">Bool</span><span class="p">,</span> <span class="kt">Bool</span><span class="p">)</span>
<span class="n">flips</span> <span class="o">=</span> <span class="p">(,)</span> <span class="o"><$></span> <span class="n">independent</span> <span class="p">(</span><span class="n">coin</span> <span class="mi">1</span> <span class="mi">8</span><span class="p">)</span> <span class="o"><*></span> <span class="n">independent</span> <span class="p">(</span><span class="n">coin</span> <span class="mi">8</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>The applicative structure enforces exactly what we want: that no part of the
effectful computation can depend on a previous part of the effectful
computation. Or in probability-speak: that the distributions involved do not
depend on each other in any way (they would be captured by the <em>plate</em> notation
in the visualization shown previously).</p>
<p>To wrap up, we can reuse our previous evaluation function to interpret a
‘Sample’ into a value with the ‘Prob’ type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">evalIndependent</span> <span class="o">::</span> <span class="kt">PrimMonad</span> <span class="n">m</span> <span class="o">=></span> <span class="kt">Sample</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Prob</span> <span class="n">m</span> <span class="n">a</span>
<span class="n">evalIndependent</span> <span class="o">=</span> <span class="n">runAp</span> <span class="n">eval</span>
</code></pre></div></div>
<p>And from here it can just be evaluated as before:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> MWC.sample (evalIndependent flips) gen
(False,True)
</code></pre></div></div>
<h2 id="conclusion">Conclusion</h2>
<p>That applicativeness embodies context-freeness seems to be well-known when it
comes to parsing, but its relation to independence in probability theory seems
less so.</p>
<p>Why might this be useful, you ask? Because preserving structure is <em>mandatory</em>
for performing inference on probabilistic programs, and it’s safe to bet that
the more structure you can capture, the easier that job will be.</p>
<p>In particular, algorithms for sampling from independent distributions tend to
be simpler and more efficient than those useful for sampling from dependent
distributions (where you might want something like <a href="https://github.com/jtobin/hasty-hamiltonian">Hamiltonian Monte
Carlo</a> or
<a href="https://github.com/jtobin/hnuts">NUTS</a>). Identifying independent components
of a probabilistic program statically could thus conceptually simplify the task
of sampling from some conditioned programs quite a bit - and
<a href="http://zinkov.com/posts/2012-06-27-why-prob-programming-matters/">that</a>
<a href="https://plus.google.com/u/0/107971134877020469960/posts/KpeRdJKR6Z1">matters</a>.</p>
<p>Enjoy! I’ve dumped the code from this article into a
<a href="https://gist.github.com/jtobin/f54e2173314ed7a76312">gist</a>.</p>
Time Traveling Recursion Schemes2016-02-09T00:00:00+04:00https://jtobin.io/time-traveling-recursion<p>In <a href="/practical-recursion-schemes">Practical Recursion
Schemes</a>
I talked about <em>recursion schemes</em>, describing them as elegant and useful
patterns for expressing general computation. In that article I introduced a
number of things relevant to working with the
<a href="https://hackage.haskell.org/package/recursion-schemes">recursion-schemes</a>
package in Haskell.</p>
<p>In particular, I went over:</p>
<ul>
<li>factoring the recursion out of recursive types using base functors and a
fixed-point wrapper</li>
<li>the ‘Foldable’ and ‘Unfoldable’ typeclasses corresponding to recursive and
corecursive data types, plus their ‘project’ and ‘embed’ functions
respectively</li>
<li>the ‘Base’ type family that maps recursive types to their base functors</li>
<li>some of the most common and useful recursion schemes: <em>cata</em>, <em>ana</em>, <em>para</em>,
and <em>hylo</em>.</li>
</ul>
<p>In <a href="/tour-of-some-recursive-types">A Tour of Some Useful Recursive
Types</a>
I went into further detail on ‘Fix’, ‘Free’, and ‘Cofree’ - three higher-kinded
recursive types that are useful for encoding programs defined by some
underlying instruction set.</p>
<p>I’ve also posted a couple of minor notes - I described the <em>apo</em> scheme in
<a href="/sorting-slower-with-style">Sorting Slower With Style</a> (as well as how to use
<em>recursion-schemes</em> with regular Haskell lists) and chatted about monadic
versions of the various schemes in <a href="/monadic-recursion-schemes">Monadic Recursion
Schemes</a>.</p>
<p>Here I want to clue up this whole recursion series by briefly talking about two
other recursion schemes - <em>histo</em> and <em>futu</em> - that work by looking at the past
or future of the recursion respectively.</p>
<p>Here’s a little preamble for the examples to come:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="cp">{-# LANGUAGE TypeFamilies #-}</span>
<span class="kr">import</span> <span class="nn">Control.Comonad.Cofree</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Free</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
</code></pre></div></div>
<h3 id="histomorphisms">Histomorphisms</h3>
<p>Histomorphisms are terrifically simple - they just give you access to arbitrary
previously-computed values of the recursion at any given point (its <em>history</em>,
hence the namesake). They’re perfectly suited to dynamic programming problems,
or anything where you might need to re-use intermediate computations later.</p>
<p><em>Histo</em> needs a data structure to store the history of the recursion in. The
the natural choice there is ‘Cofree’, which allows one to annotate recursive
types with arbitrary metadata. Brian McKenna wrote <a href="http://brianmckenna.org/blog/type_annotation_cofree">a great
article</a> on making
practical use of these kind of annotations awhile back.</p>
<p>But yeah, histomorphisms are very easy to use. Check out the following
function that returns all the odd-indexed elements of a list:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">oddIndices</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">oddIndices</span> <span class="o">=</span> <span class="n">histo</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">[]</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="p">(</span><span class="kr">_</span> <span class="o">:<</span> <span class="kt">Nil</span><span class="p">)</span> <span class="o">-></span> <span class="p">[</span><span class="n">h</span><span class="p">]</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="p">(</span><span class="kr">_</span> <span class="o">:<</span> <span class="kt">Cons</span> <span class="kr">_</span> <span class="p">(</span><span class="n">t</span> <span class="o">:<</span> <span class="kr">_</span><span class="p">))</span> <span class="o">-></span> <span class="n">h</span><span class="o">:</span><span class="n">t</span>
</code></pre></div></div>
<p>The value to the left of a ‘:<’ constructor is an <em>annotation</em> provided by
‘Cofree’, and the value to right is the (similarly annotated) next step of the
recursion. The annotations at any point are the previously computed values of
the recursion corresponding to that point.</p>
<p>So in the case above, we’re just grabbing some elements from the input list and
ignoring others. The algebra is saying:</p>
<ul>
<li>if the input list is empty, return an empty list</li>
<li>if the input list has only one element, return that one-element list</li>
<li>if the input list has at least two elements, return the list built by
cons-ing the first element together with the list computed two steps ago</li>
</ul>
<p>The list computed two steps ago is available as the annotation on the
constructor two steps down - I’ve matched it as ‘t’ in the last line of the
above example. Like <em>cata</em>, <em>histo</em> works from the bottom-up.</p>
<p>A function that computes even indices is similar:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">evenIndices</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">evenIndices</span> <span class="o">=</span> <span class="n">histo</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">[]</span>
<span class="kt">Cons</span> <span class="kr">_</span> <span class="p">(</span><span class="kr">_</span> <span class="o">:<</span> <span class="kt">Nil</span><span class="p">)</span> <span class="o">-></span> <span class="kt">[]</span>
<span class="kt">Cons</span> <span class="kr">_</span> <span class="p">(</span><span class="kr">_</span> <span class="o">:<</span> <span class="kt">Cons</span> <span class="n">h</span> <span class="p">(</span><span class="n">t</span> <span class="o">:<</span> <span class="kr">_</span><span class="p">))</span> <span class="o">-></span> <span class="n">h</span><span class="o">:</span><span class="n">t</span>
</code></pre></div></div>
<h3 id="futumorphisms">Futumorphisms</h3>
<p>Like histomorphisms, futumorphisms are also simple. They give you access to
a particular computed part of the recursion at any given point.</p>
<p>However I’ll concede that the perceived simplicity probably comes with
experience, and there is likely some conceptual weirdness to be found here.
Just as <em>histo</em> gives you access to previously-computed values, <em>futu</em> gives
you access to values that the recursion will compute in the future.</p>
<p><img src="/images/lions-wat.gif" alt="wat" title="wat" /></p>
<p>So yeah, that sounds crazy. But the reality is more mundane, if you’re
familiar with the underlying concepts.</p>
<p>For <em>histo</em>, the recursion proceeds from the bottom up. At each point, the
part of the recursive type you’re working with is annotated with the value of
the recursion at that point (using ‘Cofree’), so you can always just reach back
and grab it for use in the present.</p>
<p>With <em>futu</em>, the recursion proceeds from the top down. At each point, you
construct an expression that can make use of a value to be supplied later.
When the value does become available, you can use it to evaluate the
expression.</p>
<p>A histomorphism makes use of ‘Cofree’ to do its annotation, so it should be no
surprise that a futumorphism uses the dual structure - ‘Free’ - to create its
expressions. The well-known ‘free monad’ is <a href="http://www.haskellforall.com/2012/06/you-could-have-invented-free-monads.html">tremendously
useful</a>
for working with small embedded languages. I also wrote about ‘Free’ in the
same article mentioned previously, so I’ll <a href="/tour-of-some-recursive-types">link it
again</a>
in case you want to refer to it.</p>
<p>As an example, we’ll use <em>futu</em> to implement the same two functions that we did
for <em>histo</em>. First, the function that grabs all odd-indexed elements:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">oddIndicesF</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">oddIndicesF</span> <span class="o">=</span> <span class="n">futu</span> <span class="n">coalg</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">list</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">list</span> <span class="kr">of</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">Nil</span>
<span class="kt">Cons</span> <span class="n">x</span> <span class="n">s</span> <span class="o">-></span> <span class="kt">Cons</span> <span class="n">x</span> <span class="o">$</span> <span class="kr">do</span>
<span class="n">return</span> <span class="o">$</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">s</span> <span class="kr">of</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="n">s</span>
<span class="kt">Cons</span> <span class="kr">_</span> <span class="n">t</span> <span class="o">-></span> <span class="n">t</span>
</code></pre></div></div>
<p>The coalgebra is saying the following:</p>
<ul>
<li>if the input list is empty, return an empty list</li>
<li>if the input list has at least one element, construct an expression that
will use a value to be supplied later.</li>
<li>if the supplied value turns out to be empty, return the one-element list.</li>
<li>if the supplied value turns out to have at least one more element, return the
list constructed by skipping it, and using the value from another step in
the future.</li>
</ul>
<p>You can write that function more concisely, and indeed
<a href="https://hackage.haskell.org/package/hlint">HLint</a> will complain at you for
writing it as I’ve done above. But I think this one makes it clear what parts
are happening based on values to be supplied in the future. Namely, anything
that occurs after ‘do’.</p>
<p>It’s kind of cool - you Enter The Monad™ and can suddenly work with values that
don’t exist yet, while treating them as if they do.</p>
<p>Here’s <em>futu</em>-implemented ‘evenIndices’ for good measure:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">evenIndicesF</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">evenIndicesF</span> <span class="o">=</span> <span class="n">futu</span> <span class="n">coalg</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">list</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">list</span> <span class="kr">of</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">Nil</span>
<span class="kt">Cons</span> <span class="kr">_</span> <span class="n">s</span> <span class="o">-></span> <span class="kr">case</span> <span class="n">project</span> <span class="n">s</span> <span class="kr">of</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">Nil</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">Cons</span> <span class="n">h</span> <span class="o">$</span> <span class="n">return</span> <span class="n">t</span>
</code></pre></div></div>
<p>Sort of a neat feature is that ‘Free’ part of the coalgebra can be written
a little cleaner if you have ‘Free’-encoded embedded language terms floating
around. Here are a couple of such terms, plus a ‘twiddle’ function that uses
them to permute elements of an input list as they’re encountered:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nil</span> <span class="o">::</span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">Prim</span> <span class="p">[</span><span class="n">a</span><span class="p">])</span> <span class="n">b</span>
<span class="n">nil</span> <span class="o">=</span> <span class="n">liftF</span> <span class="kt">Nil</span>
<span class="n">cons</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="n">b</span> <span class="o">-></span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">Prim</span> <span class="p">[</span><span class="n">a</span><span class="p">])</span> <span class="n">b</span>
<span class="n">cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">=</span> <span class="n">liftF</span> <span class="p">(</span><span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span><span class="p">)</span>
<span class="n">twiddle</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">twiddle</span> <span class="o">=</span> <span class="n">futu</span> <span class="n">coalg</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">r</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">r</span> <span class="kr">of</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">Nil</span>
<span class="kt">Cons</span> <span class="n">x</span> <span class="n">l</span> <span class="o">-></span> <span class="kr">case</span> <span class="n">project</span> <span class="n">l</span> <span class="kr">of</span>
<span class="kt">Nil</span> <span class="o">-></span> <span class="kt">Cons</span> <span class="n">x</span> <span class="n">nil</span>
<span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">Cons</span> <span class="n">h</span> <span class="o">$</span> <span class="n">cons</span> <span class="n">x</span> <span class="n">t</span>
</code></pre></div></div>
<p>If you’ve been looking to twiddle elements of a recursive type then you’ve
found a classy way to do it:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> take 20 $ twiddle [1..]
[2,1,4,3,6,5,8,7,10,9,12,11,14,13,16,15,18,17,20,19]
</code></pre></div></div>
<p>Enjoy! You can find the code from this article in this
<a href="https://gist.github.com/jtobin/bbb2070f6a63956401b3">gist</a>.</p>
Monadic Recursion Schemes2016-01-20T00:00:00+04:00https://jtobin.io/monadic-recursion-schemes<p>I have another few posts that I’d like to write before cluing up the
whole <a href="/practical-recursion-schemes">recursion schemes</a> kick I’ve been
on. The first is a simple note about monadic versions of the schemes
introduced thus far.</p>
<p>In practice you often want to deal with effectful versions of something like
<em>cata</em>. Take a very simple embedded language, for example (“Hutton’s Razor”,
with variables):</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE DeriveFoldable #-}</span>
<span class="cp">{-# LANGUAGE DeriveTraversable #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="kr">import</span> <span class="nn">Control.Monad</span> <span class="p">((</span><span class="o"><=<</span><span class="p">),</span> <span class="nf">liftM2</span><span class="p">)</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Trans.Class</span> <span class="p">(</span><span class="nf">lift</span><span class="p">)</span>
<span class="kr">import</span> <span class="nn">Control.Monad.Trans.Reader</span> <span class="p">(</span><span class="kt">ReaderT</span><span class="p">,</span> <span class="nf">ask</span><span class="p">,</span> <span class="nf">runReaderT</span><span class="p">)</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span> <span class="k">hiding</span> <span class="p">(</span><span class="kt">Foldable</span><span class="p">,</span> <span class="kt">Unfoldable</span><span class="p">)</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Data.Functor.Foldable</span> <span class="k">as</span> <span class="n">RS</span> <span class="p">(</span><span class="kt">Foldable</span><span class="p">,</span> <span class="kt">Unfoldable</span><span class="p">)</span>
<span class="kr">import</span> <span class="nn">Data.Map.Strict</span> <span class="p">(</span><span class="kt">Map</span><span class="p">)</span>
<span class="kr">import</span> <span class="k">qualified</span> <span class="nn">Data.Map.Strict</span> <span class="k">as</span> <span class="n">Map</span>
<span class="kr">data</span> <span class="kt">ExprF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">VarF</span> <span class="kt">String</span>
<span class="o">|</span> <span class="kt">LitF</span> <span class="kt">Int</span>
<span class="o">|</span> <span class="kt">AddF</span> <span class="n">r</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">,</span> <span class="kt">Foldable</span><span class="p">,</span> <span class="kt">Traversable</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Expr</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">ExprF</span>
<span class="n">var</span> <span class="o">::</span> <span class="kt">String</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">var</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="o">.</span> <span class="kt">VarF</span>
<span class="n">lit</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">lit</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="o">.</span> <span class="kt">LitF</span>
<span class="n">add</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">add</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">AddF</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span>
</code></pre></div></div>
<p>(<strong>Note</strong>: Make sure you import ‘Data.Functor.Foldable.Foldable’ with a
qualifier because GHC’s ‘DeriveFoldable’ pragma will become confused if there
are multiple ‘Foldables’ in scope.)</p>
<p>Take proper error handling over an expression of type ‘Expr’ as an example; at
present we’d have to write an ‘eval’ function as something like</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">eval</span> <span class="o">=</span> <span class="n">cata</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">LitF</span> <span class="n">j</span> <span class="o">-></span> <span class="n">j</span>
<span class="kt">AddF</span> <span class="n">i</span> <span class="n">j</span> <span class="o">-></span> <span class="n">i</span> <span class="o">+</span> <span class="n">j</span>
<span class="kt">VarF</span> <span class="kr">_</span> <span class="o">-></span> <span class="n">error</span> <span class="s">"free variable in expression"</span>
</code></pre></div></div>
<p>This is a bit of a non-starter in a serious or production implementation, where
errors are typically handled using a higher-kinded type like ‘Maybe’ or
‘Either’ instead of by just blowing up the program on the spot. If we hit an
unbound variable during evaluation, we’d be better suited to return an error
<em>value</em> that can be dealt with in a more appropriate place.</p>
<p>Look at the algebra used in ‘eval’; what would be useful is something like</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">monadicAlgebra</span> <span class="o">=</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">LitF</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="n">j</span>
<span class="kt">AddF</span> <span class="n">i</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">)</span>
<span class="kt">VarF</span> <span class="n">v</span> <span class="o">-></span> <span class="kt">Left</span> <span class="p">(</span><span class="kt">FreeVar</span> <span class="n">v</span><span class="p">)</span>
<span class="kr">data</span> <span class="kt">Error</span> <span class="o">=</span>
<span class="kt">FreeVar</span> <span class="kt">String</span>
<span class="kr">deriving</span> <span class="kt">Show</span>
</code></pre></div></div>
<p>This won’t fly with <em>cata</em> as-is, and <em>recursion-schemes</em> doesn’t appear to
include any support for monadic variants out of the box. But we can produce a
monadic <em>cata</em> - as well as monadic versions of the other schemes I’ve talked
about to date - without a lot of trouble.</p>
<p>To begin, I’ll stoop to a level I haven’t yet descended to and include a
commutative diagram that defines a catamorphism:</p>
<p><img src="/images/cata.png" alt="cata" class="center-image" /></p>
<p>To read it, start in the bottom left corner and work your way to the bottom
right. You can see that we can go from a value of type ‘t’ to one of type ‘a’
by either applying ‘cata alg’ directly, or by composing a bunch of other
functions together.</p>
<p>If we’re trying to <strong>define</strong> <em>cata</em>, we’ll obviously want to do it in terms
of the compositions:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cata</span><span class="o">::</span> <span class="p">(</span><span class="kt">RS</span><span class="o">.</span><span class="kt">Foldable</span> <span class="n">t</span><span class="p">)</span> <span class="o">=></span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">t</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">cata</span> <span class="n">alg</span> <span class="o">=</span> <span class="n">alg</span> <span class="o">.</span> <span class="n">fmap</span> <span class="p">(</span><span class="n">cata</span> <span class="n">alg</span><span class="p">)</span> <span class="o">.</span> <span class="n">project</span>
</code></pre></div></div>
<p>Note that in practice it’s typically <a href="http://johantibell.com/files/haskell-performance-patterns.html#(7)">more
efficient</a>
to write recursive functions using a non-recursive wrapper, like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cata</span><span class="o">::</span> <span class="p">(</span><span class="kt">RS</span><span class="o">.</span><span class="kt">Foldable</span> <span class="n">t</span><span class="p">)</span> <span class="o">=></span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">t</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">cata</span> <span class="n">alg</span> <span class="o">=</span> <span class="n">c</span> <span class="kr">where</span> <span class="n">c</span> <span class="o">=</span> <span class="n">alg</span> <span class="o">.</span> <span class="n">fmap</span> <span class="n">c</span> <span class="o">.</span> <span class="n">project</span>
</code></pre></div></div>
<p>This ensures that the function can be inlined. Indeed, this is the version
that <em>recursion-schemes</em> uses internally.</p>
<p>To get to a monadic version we need to support a monadic algebra - that is, a
function with type ‘Base t a -> m a’ for appropriate ‘t’. To translate the
commutative diagram, we need to replace ‘fmap’ with ‘traverse’ (requiring a
‘Traversable’ instance) and the final composition with monadic (or <em>Kleisli</em>)
composition:</p>
<p><img src="/images/cataM.png" alt="cataM" class="center-image" /></p>
<p>The resulting function can be read straight off the diagram, modulo additional
constraints on type variables. I’ll go ahead and write it directly in the
inline-friendly way:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cataM</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Traversable</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span><span class="p">),</span> <span class="kt">RS</span><span class="o">.</span><span class="kt">Foldable</span> <span class="n">t</span><span class="p">)</span>
<span class="o">=></span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">t</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span>
<span class="n">cataM</span> <span class="n">alg</span> <span class="o">=</span> <span class="n">c</span> <span class="kr">where</span>
<span class="n">c</span> <span class="o">=</span> <span class="n">alg</span> <span class="o"><=<</span> <span class="n">traverse</span> <span class="n">c</span> <span class="o">.</span> <span class="n">project</span>
</code></pre></div></div>
<p>Going back to the previous example, we can now define a proper ‘eval’ as
follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Either</span> <span class="kt">Error</span> <span class="kt">Int</span>
<span class="n">eval</span> <span class="o">=</span> <span class="n">cataM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">LitF</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="n">j</span>
<span class="kt">AddF</span> <span class="n">i</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">)</span>
<span class="kt">VarF</span> <span class="n">v</span> <span class="o">-></span> <span class="kt">Left</span> <span class="p">(</span><span class="kt">FreeVar</span> <span class="n">v</span><span class="p">)</span>
</code></pre></div></div>
<p>This will of course work for any monad. A common pattern on an ‘eval’ function
is to additionally slap on a ‘ReaderT’ layer to supply an environment, for
example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">ReaderT</span> <span class="p">(</span><span class="kt">Map</span> <span class="kt">String</span> <span class="kt">Int</span><span class="p">)</span> <span class="p">(</span><span class="kt">Either</span> <span class="kt">Error</span><span class="p">)</span> <span class="kt">Int</span>
<span class="n">eval</span> <span class="o">=</span> <span class="n">cataM</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">LitF</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="n">j</span>
<span class="kt">AddF</span> <span class="n">i</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="p">(</span><span class="n">i</span> <span class="o">+</span> <span class="n">j</span><span class="p">)</span>
<span class="kt">VarF</span> <span class="n">v</span> <span class="o">-></span> <span class="kr">do</span>
<span class="n">env</span> <span class="o"><-</span> <span class="n">ask</span>
<span class="kr">case</span> <span class="kt">Map</span><span class="o">.</span><span class="n">lookup</span> <span class="n">v</span> <span class="n">env</span> <span class="kr">of</span>
<span class="kt">Nothing</span> <span class="o">-></span> <span class="n">lift</span> <span class="p">(</span><span class="kt">Left</span> <span class="p">(</span><span class="kt">FreeVar</span> <span class="n">v</span><span class="p">))</span>
<span class="kt">Just</span> <span class="n">j</span> <span class="o">-></span> <span class="n">return</span> <span class="n">j</span>
</code></pre></div></div>
<p>And just an example of how that works:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> let open = add (var "x") (var "y")
> runReaderT (eval open) (Map.singleton "x" 1)
Left (FreeVar "y")
> runReaderT (eval open) (Map.fromList [("x", 1), ("y", 5)])
Right 6
</code></pre></div></div>
<p>You can follow the same formula to create the other monadic recursion schemes.
Here’s monadic <em>ana</em>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">anaM</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Traversable</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span><span class="p">),</span> <span class="kt">RS</span><span class="o">.</span><span class="kt">Unfoldable</span> <span class="n">t</span><span class="p">)</span>
<span class="o">=></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="n">a</span><span class="p">))</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="n">t</span>
<span class="n">anaM</span> <span class="n">coalg</span> <span class="o">=</span> <span class="n">a</span> <span class="kr">where</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">return</span> <span class="o">.</span> <span class="n">embed</span><span class="p">)</span> <span class="o"><=<</span> <span class="n">traverse</span> <span class="n">a</span> <span class="o"><=<</span> <span class="n">coalg</span>
</code></pre></div></div>
<p>and monadic <em>para</em>, <em>apo</em>, and <em>hylo</em> follow in much the same way:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">paraM</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Traversable</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span><span class="p">),</span> <span class="kt">RS</span><span class="o">.</span><span class="kt">Foldable</span> <span class="n">t</span><span class="p">)</span>
<span class="o">=></span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">t</span> <span class="o">-></span> <span class="n">m</span> <span class="n">a</span>
<span class="n">paraM</span> <span class="n">alg</span> <span class="o">=</span> <span class="n">p</span> <span class="kr">where</span>
<span class="n">p</span> <span class="o">=</span> <span class="n">alg</span> <span class="o"><=<</span> <span class="n">traverse</span> <span class="n">f</span> <span class="o">.</span> <span class="n">project</span>
<span class="n">f</span> <span class="n">t</span> <span class="o">=</span> <span class="n">liftM2</span> <span class="p">(,)</span> <span class="p">(</span><span class="n">return</span> <span class="n">t</span><span class="p">)</span> <span class="p">(</span><span class="n">p</span> <span class="n">t</span><span class="p">)</span>
<span class="n">apoM</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Traversable</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span><span class="p">),</span> <span class="kt">RS</span><span class="o">.</span><span class="kt">Unfoldable</span> <span class="n">t</span><span class="p">)</span>
<span class="o">=></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="p">(</span><span class="kt">Either</span> <span class="n">t</span> <span class="n">a</span><span class="p">)))</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="n">t</span>
<span class="n">apoM</span> <span class="n">coalg</span> <span class="o">=</span> <span class="n">a</span> <span class="kr">where</span>
<span class="n">a</span> <span class="o">=</span> <span class="p">(</span><span class="n">return</span> <span class="o">.</span> <span class="n">embed</span><span class="p">)</span> <span class="o"><=<</span> <span class="n">traverse</span> <span class="n">f</span> <span class="o"><=<</span> <span class="n">coalg</span>
<span class="n">f</span> <span class="o">=</span> <span class="n">either</span> <span class="n">return</span> <span class="n">a</span>
<span class="n">hyloM</span>
<span class="o">::</span> <span class="p">(</span><span class="kt">Monad</span> <span class="n">m</span><span class="p">,</span> <span class="kt">Traversable</span> <span class="n">t</span><span class="p">)</span>
<span class="o">=></span> <span class="p">(</span><span class="n">t</span> <span class="n">b</span> <span class="o">-></span> <span class="n">m</span> <span class="n">b</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="p">(</span><span class="n">t</span> <span class="n">a</span><span class="p">))</span> <span class="o">-></span> <span class="n">a</span> <span class="o">-></span> <span class="n">m</span> <span class="n">b</span>
<span class="n">hyloM</span> <span class="n">alg</span> <span class="n">coalg</span> <span class="o">=</span> <span class="n">h</span>
<span class="kr">where</span> <span class="n">h</span> <span class="o">=</span> <span class="n">alg</span> <span class="o"><=<</span> <span class="n">traverse</span> <span class="n">h</span> <span class="o"><=<</span> <span class="n">coalg</span>
</code></pre></div></div>
<p>These are straightforward extensions from the basic schemes. A good exercise
is to try putting together the commutative diagrams corresponding to each
scheme yourself, and then use them to derive the monadic versions. That’s
pretty fun to do for <em>para</em> and <em>apo</em> in particular.</p>
<p>If you’re using these monadic versions in your own project, you may want to
drop them into a module like ‘Data.Functor.Foldable.Extended’ as <a href="http://jaspervdj.be/posts/2015-01-20-haskell-design-patterns-extended-modules.html">recommended
by</a>
my colleague Jasper Van der Jeugt. Additionally, there is an <a href="https://github.com/ekmett/recursion-schemes/issues/3">old
issue</a> floating around on
the <em>recursion-schemes</em> repo that proposes adding them to the library itself.
So maybe they’ll turn up in there eventually.</p>
Sorting Slower with Style2016-01-19T00:00:00+04:00https://jtobin.io/sorting-slower-with-style<p>I previously wrote about <a href="/sorting-with-style">implementing merge sort</a> using
<a href="/practical-recursion-schemes">recursion schemes</a>. By using a hylomorphism we
could express the algorithm concisely and true to its high-level description.</p>
<p><a href="https://en.wikipedia.org/wiki/Insertion_sort">Insertion sort</a> can be
implemented in a similar way - this time by putting one recursion scheme inside
of another.</p>
<p><img src="/images/xzibit.png" alt="yo dawg, we heard you like morphisms" title="yo dawg, we heard you like morphisms" /></p>
<p>Read on for details.</p>
<h2 id="apomorphisms">Apomorphisms</h2>
<p>These guys don’t seem to get a lot of love in the recursion scheme tutorial du
jour, probably because they might be the first scheme you encounter that looks
truly weird on first glance. But <em>apo</em> is really not bad at all - I’d go so
far as to call apomorphisms straightforward and practical.</p>
<p>So: if you remember your elementary recursion schemes, you can say that <em>apo</em>
is to <em>ana</em> as <em>para</em> is to <em>cata</em>. A paramorphism gives you access to a value
of the original input type at every point of the recursion; an apomorphism lets
you terminate using a value of the original input type at any point of the
recursion.</p>
<p>This is pretty useful - often when traversing some structure you just want to
bail out and return some value on the spot, rather than continuing on recursing
needlessly.</p>
<p>A good introduction is the toy ‘mapHead’ function that maps some other function
over the head of a list and leaves the rest of it unchanged. Let’s first rig
up a hand-rolled list type to illustrate it on:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE TypeFamilies #-}</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
<span class="kr">data</span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">ConsF</span> <span class="n">a</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">NilF</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">List</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">ListF</span> <span class="n">a</span><span class="p">)</span>
<span class="n">fromList</span> <span class="o">::</span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">fromList</span> <span class="o">=</span> <span class="n">ana</span> <span class="n">coalg</span> <span class="o">.</span> <span class="n">project</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="kt">Nil</span> <span class="o">=</span> <span class="kt">NilF</span>
<span class="n">coalg</span> <span class="p">(</span><span class="kt">Cons</span> <span class="n">h</span> <span class="n">t</span><span class="p">)</span> <span class="o">=</span> <span class="kt">ConsF</span> <span class="n">h</span> <span class="n">t</span>
</code></pre></div></div>
<p>(I’ll come back to the implementation of ‘fromList’ later, but for now you can
see it’s implemented via an anamorphism.)</p>
<h3 id="example-one">Example One</h3>
<p>Here’s ‘mapHead’ for our hand-rolled list type, implemented via <em>apo</em>:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>mapHead :: (a -> a) -> List a -> List a
mapHead f = apo coalg . project where
coalg NilF = NilF
coalg (ConsF h t) = ConsF (f h) (Left t)
</code></pre></div></div>
<p>Before I talk you through it, here’s a trivial usage example:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> fromList [1..3]
Fix (ConsF 1 (Fix (ConsF 2 (Fix (ConsF 3 (Fix NilF))))))
> mapHead succ (fromList [1..3])
Fix (ConsF 2 (Fix (ConsF 2 (Fix (ConsF 3 (Fix NilF))))))
</code></pre></div></div>
<p>Now. Take a look at the coalgebra involved in writing ‘mapHead’. It has the
type ‘a -> Base t (Either t a)’, which for our hand-rolled list case simplifies
to ‘a -> ListF a (Either (List a) a)’.</p>
<p>Just as a reminder, you can check this in GHCi using the ‘:kind!’ command:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> :set -XRankNTypes
> :kind! forall a. a -> Base (List a) (Either (List a) a)
forall a. a -> Base (List a) (Either (List a) a) :: *
= a -> ListF a (Either (List a) a)
</code></pre></div></div>
<p>So, inside any base functor on the right hand side we’re going to be dealing
with some ‘Either’ values. The ‘Left’ branch indicates that we’re going to
terminate the recursion using whatever value we pass, whereas the ‘Right’
branch means we’ll continue recursing as per normal.</p>
<p>In the case of ‘mapHead’, the coalgebra is saying:</p>
<ul>
<li>deconstruct the list; if it has no elements just return an empty list</li>
<li>if the list has at least one element, return the list constructed by
prepending ‘f h’ to the existing tail.</li>
</ul>
<p>Here the ‘Left’ branch is used to terminate the recursion and just return the
existing tail on the spot.</p>
<h3 id="example-two">Example Two</h3>
<p>That was pretty easy, so let’s take it up a notch and implement list
concatenation:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cat</span> <span class="o">::</span> <span class="kt">List</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">cat</span> <span class="n">l0</span> <span class="n">l1</span> <span class="o">=</span> <span class="n">apo</span> <span class="n">coalg</span> <span class="p">(</span><span class="n">project</span> <span class="n">l0</span><span class="p">)</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="kt">NilF</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">l1</span> <span class="kr">of</span>
<span class="kt">NilF</span> <span class="o">-></span> <span class="kt">NilF</span>
<span class="kt">ConsF</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">h</span> <span class="p">(</span><span class="kt">Left</span> <span class="n">t</span><span class="p">)</span>
<span class="n">coalg</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">l</span><span class="p">)</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">l</span> <span class="kr">of</span>
<span class="kt">NilF</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">x</span> <span class="p">(</span><span class="kt">Left</span> <span class="n">l1</span><span class="p">)</span>
<span class="kt">ConsF</span> <span class="n">h</span> <span class="n">t</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">x</span> <span class="p">(</span><span class="kt">Right</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">h</span> <span class="n">t</span><span class="p">))</span>
</code></pre></div></div>
<p>This one is slightly more involved, but the principles are almost entirely the
same. If both lists are empty we just return an empty list, and if the first
list has at most one element we return the list constructed by jamming the
second list onto it. The ‘Left’ branch again just terminates the recursion and
stops everything there.</p>
<p>If both lists are nonempty? Then we actually do some work and recurse, which
is what the ‘Right’ branch indicates.</p>
<p>So hopefully you can see there’s nothing too weird going on - the coalgebras
are really simple once you get used to the Either constructors floating around
in there.</p>
<p>Paramorphisms involve an algebra that gives you access to a value of the
original input type in a <em>pair</em> - a product of two values. Since apomorphisms
are their dual, it’s no surprise that you can give them a value of the original
input type using ‘Either’ - a sum of two values.</p>
<h2 id="insertion-sort">Insertion Sort</h2>
<p>So yeah, my favourite example of an apomorphism is for implementing the ‘inner
loop’ of insertion sort, a famous worst-case \(O(n^2)\) comparison-based
sort. Granted that insertion sort itself is a bit of a toy algorithm, but the
pattern used to implement its internals is pretty interesting and more broadly
applicable.</p>
<p>This animation found on
<a href="https://commons.wikimedia.org/wiki/File:Insertion-sort-example-300px.gif">Wikipedia</a>
illustrates how insertion sort works:</p>
<p><img src="/images/insertion-sort.gif" alt="CC-BY-SA 3.0 Swfung8" /></p>
<p>We’ll actually be doing this thing in reverse - starting from the right-hand
side and scanning left - but here’s the inner loop that we’ll be concerned
with: if we’re looking at two elements that are out of sorted order, slide the
offending element to where it belongs by pushing it to the right until it hits
either a bigger element or the end of the list.</p>
<p>As an example, picture the following list:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[3, 1, 1, 2, 4, 3, 5, 1, 6, 2, 1]
</code></pre></div></div>
<p>The first two elements are out of sorted order, so we want to slide the 3
rightwards along the list until it bumps up against a larger element - here
that’s the 4.</p>
<p>The following function describes how to do that in general for our hand-rolled
list type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">coalg</span> <span class="kt">NilF</span> <span class="o">=</span> <span class="kt">NilF</span>
<span class="n">coalg</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">l</span><span class="p">)</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">l</span> <span class="kr">of</span>
<span class="kt">NilF</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">x</span> <span class="p">(</span><span class="kt">Left</span> <span class="n">l</span><span class="p">)</span>
<span class="kt">ConsF</span> <span class="n">h</span> <span class="n">t</span>
<span class="o">|</span> <span class="n">x</span> <span class="o"><=</span> <span class="n">h</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">x</span> <span class="p">(</span><span class="kt">Left</span> <span class="n">l</span><span class="p">)</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">h</span> <span class="p">(</span><span class="kt">Right</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">t</span><span class="p">))</span>
</code></pre></div></div>
<p>It says:</p>
<ul>
<li>deconstruct the list; if it has no elements just return an empty list</li>
<li>if the list has only one element, or has at least two elements that are in
sorted order, terminate with the original list by passing the tail of the
list in the ‘Left’ branch</li>
<li>if the list has at least two elements that are out of sorted order, swap
them and recurse using the ‘Right’ branch</li>
</ul>
<p>And with that in place, we can use an apomorphism to put it to work:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">knockback</span> <span class="o">::</span> <span class="kt">Ord</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">List</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">knockback</span> <span class="o">=</span> <span class="n">apo</span> <span class="n">coalg</span> <span class="o">.</span> <span class="n">project</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="kt">NilF</span> <span class="o">=</span> <span class="kt">NilF</span>
<span class="n">coalg</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">l</span><span class="p">)</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">project</span> <span class="n">l</span> <span class="kr">of</span>
<span class="kt">NilF</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">x</span> <span class="p">(</span><span class="kt">Left</span> <span class="n">l</span><span class="p">)</span>
<span class="kt">ConsF</span> <span class="n">h</span> <span class="n">t</span>
<span class="o">|</span> <span class="n">x</span> <span class="o"><=</span> <span class="n">h</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">x</span> <span class="p">(</span><span class="kt">Left</span> <span class="n">l</span><span class="p">)</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">-></span> <span class="kt">ConsF</span> <span class="n">h</span> <span class="p">(</span><span class="kt">Right</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">t</span><span class="p">))</span>
</code></pre></div></div>
<p>Check out how it works on our original list, slotting the leading 3 in front of
the 4 as required. I’ll use a regular list here for readability:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> let test = [3, 1, 1, 2, 4, 3, 5, 1, 6, 2, 1]
> knockbackL test
[1, 1, 2, 3, 4, 3, 5, 1, 6, 2, 1]
</code></pre></div></div>
<p>Now to implement insertion sort we just want to do this repeatedly like in the
animation above.</p>
<p>This isn’t something you’d likely notice at first glance, but check out the
type of ‘knockback . embed’:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> :t knockback . embed
knockback . embed :: Ord a => ListF a (List a) -> List a
</code></pre></div></div>
<p>That’s an algebra in the ‘ListF a’ base functor, so we can drop it into <em>cata</em>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">insertionSort</span> <span class="o">::</span> <span class="kt">Ord</span> <span class="n">a</span> <span class="o">=></span> <span class="kt">List</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">insertionSort</span> <span class="o">=</span> <span class="n">cata</span> <span class="p">(</span><span class="n">knockback</span> <span class="o">.</span> <span class="n">embed</span><span class="p">)</span>
</code></pre></div></div>
<p>And voila, we have our sort.</p>
<p>If it’s not clear how the thing works, you can visualize the whole process as
working from the back of the list, knocking back unsorted elements and
recursing towards the front like so:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>[]
[1]
[2, 1] -> [1, 2]
[6, 1, 2] -> [1, 2, 6]
[1, 1, 2, 6]
[5, 1, 1, 2, 6] -> [1, 1, 2, 5, 6]
[3, 1, 1, 2, 5, 6] -> [1, 1, 2, 3, 5, 6]
[4, 1, 1, 2, 3, 5, 6] -> [1, 1, 2, 3, 4, 5, 6]
[2, 1, 1, 2, 3, 4, 5, 6] -> [1, 1, 2, 2, 3, 4, 5, 6]
[1, 1, 1, 2, 2, 3, 4, 5, 6]
[1, 1, 1, 1, 2, 2, 3, 4, 5, 6]
[3, 1, 1, 1, 1, 2, 2, 3, 4, 5, 6] -> [1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 6]
[1, 1, 1, 1, 2, 2, 3, 3, 4, 5, 6]
</code></pre></div></div>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>And that’s it! If you’re unlucky you may be sorting asymptotically worse than
if you had used mergesort. But at least you’re doing it with <em>style</em>.</p>
<p>The ‘mapHead’ and ‘cat’ examples come from the unreadable <a href="http://cs.ioc.ee/~tarmo/papers/nwpt97-peas.pdf">Vene and
Uustalu</a> paper that first
described apomorphisms. The insertion sort example comes from Tim Williams’s
<a href="https://www.youtube.com/watch?v=Zw9KeP3OzpU">excellent recursion schemes
talk</a>.</p>
<p>As always, I’ve dumped the code for this article into a
<a href="https://gist.github.com/jtobin/8fe373e19aa1a232f0d3">gist</a>.</p>
<h2 id="addendum-using-regular-lists">Addendum: Using Regular Lists</h2>
<p>You’ll note that the ‘fromList’ and ‘knockbackL’ functions above operate on
regular Haskell lists. The short of it is that it’s easy to do this;
<em>recursion-schemes</em> defines a data family called ‘Prim’ that basically endows
lists with base functor constructors of their own. You just need to use ‘Nil’
in place of ‘[]’ and ‘Cons’ in place of ‘(:)’.</p>
<p>Here’s insertion sort implemented in the same way, but for regular lists:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>knockbackL :: Ord a => [a] -> [a]
knockbackL = apo coalg . project where
coalg Nil = Nil
coalg (Cons x l) = case project l of
Nil -> Cons x (Left l)
Cons h t
| x <= h -> Cons x (Left l)
| otherwise -> Cons h (Right (Cons x t))
insertionSortL :: Ord a => [a] -> [a]
insertionSortL = cata (knockbackL . embed)
</code></pre></div></div>
Yo Dawg We Heard You Like Derivatives2016-01-08T00:00:00+04:00https://jtobin.io/ad-via-recursion-schemes<p>I noticed <a href="http://h2.jaguarpaw.co.uk/posts/symbolic-expressions-can-be-automatically-differentiated/">this
article</a>
by Tom Ellis today that provides an excellent ‘demystified’ introduction to
automatic differentiation. His exposition is exceptionally clear and simple.</p>
<p>Hopefully not in the spirit of re-mystifying things too much, I wanted to
demonstrate that this kind of forward-mode automatic differentiation can be
implemented using a catamorphism, which cleans up the various <code class="language-plaintext highlighter-rouge">let</code> statements
found in Tom’s version (at the expense of slightly more pattern matching).</p>
<p>Let me first duplicate his setup using the standard <a href="/practical-recursion-schemes">recursion
scheme</a> machinery:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
<span class="kr">data</span> <span class="kt">ExprF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">VarF</span>
<span class="o">|</span> <span class="kt">ZeroF</span>
<span class="o">|</span> <span class="kt">OneF</span>
<span class="o">|</span> <span class="kt">NegateF</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">SumF</span> <span class="n">r</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">ProductF</span> <span class="n">r</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">ExpF</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Expr</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">ExprF</span>
</code></pre></div></div>
<p>Since my expression type uses a fixed-point wrapper I’ll define my own
embedded language terms to get around it:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">var</span> <span class="o">::</span> <span class="kt">Expr</span>
<span class="n">var</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">VarF</span>
<span class="n">zero</span> <span class="o">::</span> <span class="kt">Expr</span>
<span class="n">zero</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">ZeroF</span>
<span class="n">one</span> <span class="o">::</span> <span class="kt">Expr</span>
<span class="n">one</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">OneF</span>
<span class="n">neg</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">neg</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">NegateF</span> <span class="n">x</span><span class="p">)</span>
<span class="n">add</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">add</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">SumF</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span>
<span class="n">prod</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">prod</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">ProductF</span> <span class="n">a</span> <span class="n">b</span><span class="p">)</span>
<span class="n">e</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">e</span> <span class="n">x</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">ExpF</span> <span class="n">x</span><span class="p">)</span>
</code></pre></div></div>
<p>To implement a corresponding <code class="language-plaintext highlighter-rouge">eval</code> function we can use a catamorphism:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">eval</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="n">eval</span> <span class="n">x</span> <span class="o">=</span> <span class="n">cata</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">VarF</span> <span class="o">-></span> <span class="n">x</span>
<span class="kt">ZeroF</span> <span class="o">-></span> <span class="mi">0</span>
<span class="kt">OneF</span> <span class="o">-></span> <span class="mi">1</span>
<span class="kt">NegateF</span> <span class="n">a</span> <span class="o">-></span> <span class="n">negate</span> <span class="n">a</span>
<span class="kt">SumF</span> <span class="n">a</span> <span class="n">b</span> <span class="o">-></span> <span class="n">a</span> <span class="o">+</span> <span class="n">b</span>
<span class="kt">ProductF</span> <span class="n">a</span> <span class="n">b</span> <span class="o">-></span> <span class="n">a</span> <span class="o">*</span> <span class="n">b</span>
<span class="kt">ExpF</span> <span class="n">a</span> <span class="o">-></span> <span class="n">exp</span> <span class="n">a</span>
</code></pre></div></div>
<p>Very clear. We just match things mechanically.</p>
<p>Now, symbolic differentiation. If you refer to the original <code class="language-plaintext highlighter-rouge">diff</code> function
you’ll notice that in cases like <code class="language-plaintext highlighter-rouge">Product</code> or <code class="language-plaintext highlighter-rouge">Exp</code> there are uses of both an
original expression and also its derivative. This can be captured simply by a
paramorphism:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">diff</span> <span class="o">::</span> <span class="kt">Expr</span> <span class="o">-></span> <span class="kt">Expr</span>
<span class="n">diff</span> <span class="o">=</span> <span class="n">para</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">VarF</span> <span class="o">-></span> <span class="n">one</span>
<span class="kt">ZeroF</span> <span class="o">-></span> <span class="n">zero</span>
<span class="kt">OneF</span> <span class="o">-></span> <span class="n">zero</span>
<span class="kt">NegateF</span> <span class="p">(</span><span class="kr">_</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="o">-></span> <span class="n">neg</span> <span class="n">x'</span>
<span class="kt">SumF</span> <span class="p">(</span><span class="kr">_</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="p">(</span><span class="kr">_</span><span class="p">,</span> <span class="n">y'</span><span class="p">)</span> <span class="o">-></span> <span class="n">add</span> <span class="n">x'</span> <span class="n">y'</span>
<span class="kt">ProductF</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y'</span><span class="p">)</span> <span class="o">-></span> <span class="n">add</span> <span class="p">(</span><span class="n">prod</span> <span class="n">x</span> <span class="n">y'</span><span class="p">)</span> <span class="p">(</span><span class="n">prod</span> <span class="n">x'</span> <span class="n">y</span><span class="p">)</span>
<span class="kt">ExpF</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="o">-></span> <span class="n">prod</span> <span class="p">(</span><span class="n">e</span> <span class="n">x</span><span class="p">)</span> <span class="n">x'</span>
</code></pre></div></div>
<p>Here the primes indicate derivatives in the usual fashion, and the standard
rules of differentiation are self-explanatory.</p>
<p>For automatic differentiation we just do sort of the same thing, except we’re
also also going to lug around the evaluated function value itself at each point
and evaluate to doubles instead of other expressions.</p>
<p>It’s worth noting here: why doubles? Because the expression type that we’ve
defined has no notion of sharing, and thus the expressions will blow up à la
<code class="language-plaintext highlighter-rouge">diff</code> (to see what I mean, try printing the analogue of <code class="language-plaintext highlighter-rouge">diff bigExpression</code>
in GHCi). This could probably be mitigated by <a href="sharing-in-haskell-edsls">incorporating sharing into the
embedded language</a> in some way, but that’s a topic
for another post.</p>
<p>Anyway, a catamorphism will do the trick:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">ad</span> <span class="o">::</span> <span class="kt">Double</span> <span class="o">-></span> <span class="kt">Expr</span> <span class="o">-></span> <span class="p">(</span><span class="kt">Double</span><span class="p">,</span> <span class="kt">Double</span><span class="p">)</span>
<span class="n">ad</span> <span class="n">x</span> <span class="o">=</span> <span class="n">cata</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span>
<span class="kt">VarF</span> <span class="o">-></span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="mi">1</span><span class="p">)</span>
<span class="kt">ZeroF</span> <span class="o">-></span> <span class="p">(</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="kt">OneF</span> <span class="o">-></span> <span class="p">(</span><span class="mi">1</span><span class="p">,</span> <span class="mi">0</span><span class="p">)</span>
<span class="kt">NegateF</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="n">negate</span> <span class="n">x</span><span class="p">,</span> <span class="n">negate</span> <span class="n">x'</span><span class="p">)</span>
<span class="kt">SumF</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y'</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="n">x</span> <span class="o">+</span> <span class="n">y</span><span class="p">,</span> <span class="n">x'</span> <span class="o">+</span> <span class="n">y'</span><span class="p">)</span>
<span class="kt">ProductF</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="p">(</span><span class="n">y</span><span class="p">,</span> <span class="n">y'</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="n">x</span> <span class="o">*</span> <span class="n">y</span><span class="p">,</span> <span class="n">x</span> <span class="o">*</span> <span class="n">y'</span> <span class="o">+</span> <span class="n">x'</span> <span class="o">*</span> <span class="n">y</span><span class="p">)</span>
<span class="kt">ExpF</span> <span class="p">(</span><span class="n">x</span><span class="p">,</span> <span class="n">x'</span><span class="p">)</span> <span class="o">-></span> <span class="p">(</span><span class="n">exp</span> <span class="n">x</span><span class="p">,</span> <span class="n">exp</span> <span class="n">x</span> <span class="o">*</span> <span class="n">x'</span><span class="p">)</span>
</code></pre></div></div>
<p>Take a look at the pairs to the right of the pattern matches; the first element
in each is just the corresponding term from <code class="language-plaintext highlighter-rouge">eval</code>, and the second is just the
corresponding term from <code class="language-plaintext highlighter-rouge">diff</code> (made ‘Double’-friendly). The catamorphism
gives us access to all the terms we need, and we can avoid a lot of work on
the right-hand side by doing some more pattern matching on the left.</p>
<p>Some sanity checks to make sure that these functions match up with Tom’s:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>*Main> map (snd . (`ad` testSmall)) [0.0009, 1.0, 1.0001]
[0.12254834896191881,1.0,1.0003000600100016]
*Main> map (snd . (`ad` testBig)) [0.00009, 1.0, 1.00001]
[3.2478565715996756e-6,1.0,1.0100754777229357]
</code></pre></div></div>
<p>UPDATE:</p>
<p>I had originally defined <code class="language-plaintext highlighter-rouge">ad</code> using a paramorphism but noticed that we can get
by just fine with <em>cata</em>.</p>
A Tour of Some Useful Recursive Types2015-12-09T00:00:00+04:00https://jtobin.io/tour-of-some-recursive-types<p>I’m presently at <a href="https://nips.cc/">NIPS</a> and so felt like writing about some
appropriate machine learning topic, but along the way I wound up talking about
parameterized recursive types, and here we are. Enjoy!</p>
<p>One starts to see common ‘shapes’ in algebraic data types after working with
them for a while. Take the natural numbers and a standard linked list, for
example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Natural</span> <span class="o">=</span>
<span class="kt">One</span>
<span class="o">|</span> <span class="kt">Succ</span> <span class="kt">Natural</span>
<span class="kr">data</span> <span class="kt">List</span> <span class="n">a</span> <span class="o">=</span>
<span class="kt">Empty</span>
<span class="o">|</span> <span class="kt">Cons</span> <span class="n">a</span> <span class="p">(</span><span class="kt">List</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>These are similar in some sense. There are some differences - a list has an
additional type parameter, and each recursive point in the list is tagged with
a value of that type - but the nature of the recursion in each is the same.
There is a single recursive point wrapped up in a single constructor, plus a
single base case.</p>
<p>Consider a recursive type that is parameterized by a functor with kind ‘* ->
*’, such that the kind of the resulting type is something like ‘(* -> *) ->
*’ or ‘(* -> *) -> * -> *’ or so on. It’s interesting to look at the
‘shapes’ of some useful types like this and see what kind of similarities and
differences in recursive structure that we can find.</p>
<p>In this article we’ll look at three such recursive types: ‘Fix’, ‘Free’, and
‘Cofree’. I’ll demonstrate that each can be viewed as a kind of program
parameterized by some underlying instruction set.</p>
<h2 id="fix">Fix</h2>
<p>To start, let’s review the famous fixed-point type ‘Fix’. I’ve talked about it
<a href="/practical-recursion-schemes">before</a>, but will go into a bit more detail here.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="cp">{-# LANGUAGE FlexibleContexts #-}</span>
<span class="cp">{-# LANGUAGE StandaloneDeriving #-}</span>
<span class="cp">{-# LANGUAGE UndecideableInstances #-}</span>
<span class="kr">newtype</span> <span class="kt">Fix</span> <span class="n">f</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">))</span>
<span class="kr">deriving</span> <span class="kr">instance</span> <span class="p">(</span><span class="kt">Show</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)))</span> <span class="o">=></span> <span class="kt">Show</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span>
</code></pre></div></div>
<p>Note: I’ll omit interpreter output for examples throughout this article, but
feel free to try the code yourself in GHCi. I’ll post some gists at the
bottom. The above code block also contains some pragmas that you can ignore;
they’re just there to help GHC derive some instances for us.</p>
<p>Anyway. ‘Fix’ is in some sense a template recursive structure. It relies on
some underlying functor ‘f’ to define the scope of recursion that you can
expect a value with type ‘Fix f’ to support. There is the degenerate constant
case, for example, which supports <em>no</em> recursion:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">DegenerateF</span> <span class="n">r</span> <span class="o">=</span> <span class="kt">DegenerateF</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Functor</span><span class="p">,</span> <span class="kt">Show</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Degenerate</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">DegenerateF</span>
<span class="n">degenerate</span> <span class="o">::</span> <span class="kt">Degenerate</span>
<span class="n">degenerate</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">DegenerateF</span>
</code></pre></div></div>
<p>Then you have the case like the one below, where <em>only</em> an infinitely recursive
value exists:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">newtype</span> <span class="kt">InfiniteF</span> <span class="n">r</span> <span class="o">=</span> <span class="kt">InfiniteF</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Functor</span><span class="p">,</span> <span class="kt">Show</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Infinite</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">InfiniteF</span>
<span class="n">infinite</span> <span class="o">::</span> <span class="kt">Infinite</span>
<span class="n">infinite</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">InfiniteF</span> <span class="n">infinite</span><span class="p">)</span>
</code></pre></div></div>
<p>But in practice you’ll have something in between; a type with at least one
recursive point or ‘running’ case and also at least one base or ‘terminating’
case. Take the natural numbers, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">NatF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">OneF</span>
<span class="o">|</span> <span class="kt">SuccF</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Functor</span><span class="p">,</span> <span class="kt">Show</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Nat</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">NatF</span>
<span class="n">one</span> <span class="o">::</span> <span class="kt">Nat</span>
<span class="n">one</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">OneF</span>
<span class="n">succ</span> <span class="o">::</span> <span class="kt">Nat</span> <span class="o">-></span> <span class="kt">Nat</span>
<span class="n">succ</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="o">.</span> <span class="kt">SuccF</span>
</code></pre></div></div>
<p>Here ‘NatF’ provides both a ‘running’ case - ‘SuccF’ - and a ‘terminating’ case
in - ‘OneF’. ‘Fix’ just lets ‘NatF’ do whatever it wants, having no say of its
own about termination. In fact, we could have defined ‘Fix’ like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Program</span> <span class="n">f</span> <span class="o">=</span> <span class="kt">Running</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Program</span> <span class="n">f</span><span class="p">))</span>
</code></pre></div></div>
<p>Indeed, you can think of ‘Fix’ as defining a program that runs until ‘f’
decides to terminate. In turn, you can think of ‘f’ as an instruction
set for the program. The whole shebang of ‘Fix f’ may only terminate if ‘f’
contains a terminating instruction.</p>
<p>Here’s a simple set of instructions, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Instruction</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">Increment</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">Decrement</span> <span class="n">r</span>
<span class="o">|</span> <span class="kt">Terminate</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Functor</span><span class="p">,</span> <span class="kt">Show</span><span class="p">)</span>
<span class="n">increment</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="o">-></span> <span class="kt">Program</span> <span class="kt">Instruction</span>
<span class="n">increment</span> <span class="o">=</span> <span class="kt">Running</span> <span class="o">.</span> <span class="kt">Increment</span>
<span class="n">decrement</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="o">-></span> <span class="kt">Program</span> <span class="kt">Instruction</span>
<span class="n">decrement</span> <span class="o">=</span> <span class="kt">Running</span> <span class="o">.</span> <span class="kt">Decrement</span>
<span class="n">terminate</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span>
<span class="n">terminate</span> <span class="o">=</span> <span class="kt">Running</span> <span class="kt">Terminate</span>
</code></pre></div></div>
<p>And we can write a sort of stack-based program like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">program</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span>
<span class="n">program</span> <span class="o">=</span>
<span class="n">increment</span>
<span class="o">.</span> <span class="n">increment</span>
<span class="o">.</span> <span class="n">decrement</span>
<span class="o">$</span> <span class="n">terminate</span>
</code></pre></div></div>
<h3 id="richness-of-fix">Richness of ‘Fix’</h3>
<p>It’s worthwhile to review two functions that are useful for working with ‘Fix’,
unimaginatively named ‘fix’ and ‘unfix’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">fix</span> <span class="o">::</span> <span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Fix</span> <span class="n">f</span>
<span class="n">fix</span> <span class="o">=</span> <span class="kt">Fix</span>
<span class="n">unfix</span> <span class="o">::</span> <span class="kt">Fix</span> <span class="n">f</span> <span class="o">-></span> <span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span>
<span class="n">unfix</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span> <span class="o">=</span> <span class="n">f</span>
</code></pre></div></div>
<p>You can think of them like so: ‘fix’ embeds a value of type ‘f’ into a
recursive structure by adding a new layer of recursion, while ‘unfix’ projects
a value of type ‘f’ out of a recursive structure by peeling back a layer of
recursion.</p>
<p>This is a pretty rich recursive structure - we have a guarantee that we can
<em>always</em> embed into or project out of something with type ‘Fix f’, no matter
what ‘f’ is.</p>
<h2 id="free">Free</h2>
<p>Next up is ‘Free’, which is really just ‘Fix’ with some added structure. It is
defined as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span>
<span class="kt">Free</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span><span class="p">))</span>
<span class="o">|</span> <span class="kt">Pure</span> <span class="n">a</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
<span class="kr">deriving</span> <span class="kr">instance</span> <span class="p">(</span><span class="kt">Show</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Show</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span><span class="p">)))</span> <span class="o">=></span> <span class="kt">Show</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>The ‘Free’ constructor has an analogous definition to the ‘Fix’ constructor,
meaning we can use ‘Free’ to implement the same things we did previously. Here
are the natural numbers redux, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">NatFree</span> <span class="o">=</span> <span class="kt">Free</span> <span class="kt">NatF</span>
<span class="n">oneFree</span> <span class="o">::</span> <span class="kt">NatFree</span> <span class="n">a</span>
<span class="n">oneFree</span> <span class="o">=</span> <span class="kt">Free</span> <span class="kt">OneF</span>
<span class="n">succFree</span> <span class="o">::</span> <span class="kt">NatFree</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">NatFree</span> <span class="n">a</span>
<span class="n">succFree</span> <span class="o">=</span> <span class="kt">Free</span> <span class="o">.</span> <span class="kt">SuccF</span>
</code></pre></div></div>
<p>There’s also another branch here called ‘Pure’, though, that just bluntly wraps
a value of type ‘a’, and has nothing to do with the parameter ‘f’. This has an
interesting consequence: it means that ‘Free’ can have an opinion of its own
about termination, regardless about what ‘f’ might decree:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">NotSoInfinite</span> <span class="o">=</span> <span class="kt">Free</span> <span class="kt">InfiniteF</span>
<span class="n">notSoInfinite</span> <span class="o">::</span> <span class="kt">NotSoInfinite</span> <span class="nb">()</span>
<span class="n">notSoInfinite</span> <span class="o">=</span> <span class="kt">Free</span> <span class="p">(</span><span class="kt">InfiniteF</span> <span class="p">(</span><span class="kt">Free</span> <span class="p">(</span><span class="kt">InfiniteF</span> <span class="p">(</span><span class="kt">Pure</span> <span class="nb">()</span><span class="p">))))</span>
</code></pre></div></div>
<p>(Note that here I’ve returned the value of type unit when terminating under the
‘Pure’ branch, but you could pick whatever else you’d like.)</p>
<p>You’ll recall that ‘InfiniteF’ provides no terminating instruction,
and left to its own devices will just recurse endlessly.</p>
<p>So: instead of being forced to choose a branch of the underlying functor to
recurse on, ‘Free’ can just bail out on a whim and return some value wrapped up
in ‘Pure’. We could have defined the whole type like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Program</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span>
<span class="kt">Running</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Program</span> <span class="n">f</span> <span class="n">a</span><span class="p">))</span>
<span class="o">|</span> <span class="kt">Terminated</span> <span class="n">a</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
</code></pre></div></div>
<p>Again, it’s ‘Fix’ with more structure. It’s a program that runs until ‘f’
decides to terminate, <em>or</em> that terminates and returns a value of type ‘a’</p>
<p>As a quick illustration, take our simple stack-based instruction set again. We
can define the following embedded language terms:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">increment</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="n">a</span>
<span class="n">increment</span> <span class="o">=</span> <span class="kt">Running</span> <span class="o">.</span> <span class="kt">Increment</span>
<span class="n">decrement</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="n">a</span>
<span class="n">decrement</span> <span class="o">=</span> <span class="kt">Running</span> <span class="o">.</span> <span class="kt">Decrement</span>
<span class="n">terminate</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="n">a</span>
<span class="n">terminate</span> <span class="o">=</span> <span class="kt">Running</span> <span class="kt">Terminate</span>
<span class="n">sigkill</span> <span class="o">::</span> <span class="kt">Program</span> <span class="n">f</span> <span class="kt">Int</span>
<span class="n">sigkill</span> <span class="o">=</span> <span class="kt">Terminated</span> <span class="mi">1</span>
</code></pre></div></div>
<p>So note that ‘sigkill’ is independent of whatever instruction set we’re working
with. We can thus write another simple program like before, except this time
have ‘sigkill’ terminate it:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">program</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span>
<span class="n">program</span> <span class="o">=</span>
<span class="n">increment</span>
<span class="o">.</span> <span class="n">increment</span>
<span class="o">.</span> <span class="n">decrement</span>
<span class="o">$</span> <span class="n">sigkill</span>
</code></pre></div></div>
<h3 id="richness-of-free">Richness of ‘Free’</h3>
<p>Try to define the equivalent versions of ‘fix’ and ‘unfix’ for ‘Free’. The
equivalent to ‘fix’ is easy:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">free</span> <span class="o">::</span> <span class="n">f</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span>
<span class="n">free</span> <span class="o">=</span> <span class="kt">Free</span>
</code></pre></div></div>
<p>You’ll hit a wall, though, if you want to implement the (total) analogue to
‘unfix’. One wants a function of type ‘Free f a -> f (Free f a)’, but the
existence of the ‘Pure’ branch makes this impossible to implement totally. In
general there is not going to be an ‘f’ to pluck out:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">unfree</span> <span class="o">::</span> <span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span> <span class="o">-></span> <span class="n">f</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span>
<span class="n">unfree</span> <span class="p">(</span><span class="kt">Free</span> <span class="n">f</span><span class="p">)</span> <span class="o">=</span> <span class="n">f</span>
<span class="n">unfree</span> <span class="p">(</span><span class="kt">Pure</span> <span class="n">a</span><span class="p">)</span> <span class="o">=</span> <span class="n">error</span> <span class="s">"kaboom"</span>
</code></pre></div></div>
<p>The recursion provided by ‘Free’ is thus a little less rich than that provided
by ‘Fix’. With ‘Fix’ one can <em>always</em> project a value out of its recursive
structure - but that’s not the case with ‘Free’.</p>
<p>It’s well-known that ‘Free’ is monadic, and indeed it’s usually called the
‘<a href="http://www.haskellforall.com/2012/06/you-could-have-invented-free-monads.html">free
monad</a>’.
The namesake ‘free’ comes from an algebraic definition; roughly, a free ‘foo’
is a ‘foo’ that satisfies the minimum possible constraints to make it a ‘foo’,
and nothing else. Check out the
<a href="https://drive.google.com/file/d/0B51SFgxqMDS-NDBOX0ZDdW52dEE/edit">slides</a>
from Dan Piponi’s excellent talk from Bayhac a few years back for a deeper dive
on algebraic freeness.</p>
<h2 id="cofree">Cofree</h2>
<p>‘Cofree’ is also like ‘Fix’, but again with some extra structure. It can be
defined as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Cofree</span> <span class="n">a</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span><span class="p">))</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
<span class="kr">deriving</span> <span class="kr">instance</span> <span class="p">(</span><span class="kt">Show</span> <span class="n">a</span><span class="p">,</span> <span class="kt">Show</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span><span class="p">)))</span> <span class="o">=></span> <span class="kt">Show</span> <span class="p">(</span><span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>Again, part of the definition - the second field of the ‘Cofree’ constructor -
looks just like ‘Fix’. So predictably we can do a redux-redux of the natural
numbers using ‘Cofree’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">NatCofree</span> <span class="o">=</span> <span class="kt">Cofree</span> <span class="kt">NatF</span>
<span class="n">oneCofree</span> <span class="o">::</span> <span class="kt">NatCofree</span> <span class="nb">()</span>
<span class="n">oneCofree</span> <span class="o">=</span> <span class="kt">Cofree</span> <span class="nb">()</span> <span class="kt">OneF</span>
<span class="n">succFree</span> <span class="o">::</span> <span class="kt">NatCofree</span> <span class="nb">()</span> <span class="o">-></span> <span class="kt">NatCofree</span> <span class="nb">()</span>
<span class="n">succFree</span> <span class="n">f</span> <span class="o">=</span> <span class="kt">Cofree</span> <span class="nb">()</span> <span class="p">(</span><span class="kt">SuccF</span> <span class="n">f</span><span class="p">)</span>
</code></pre></div></div>
<p>(Note that here I’ve again used unit to fill in the first field - you could
of course choose whatever you’d like.)</p>
<p>This looks a lot like ‘Free’, and in fact it’s the <em>categorical dual</em> of
‘Free’. Whereas ‘Free’ is a sum type with two <em>branches</em>, ‘Cofree’ is a
product type with two <em>fields</em>. In the case of ‘Free’, we could have a program
that either runs an instruction from a set ‘f’, <em>or</em> terminates with a value
having type ‘a’. In the case of ‘Cofree’, we have a program that runs an
instruction from a set ‘f’ <em>and</em> returns a value of type ‘a’.</p>
<p>A ‘Free’ value thus contains at most one recursive point wrapping the value
with type ‘a’, while a ‘Cofree’ value contains potentially infinite recursive
points - each one of which is tagged with a value of type ‘a’.</p>
<p>Rolling with the ‘Program’ analogy, we could have written this alternate
definition for ‘Cofree’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Program</span> <span class="n">f</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Program</span> <span class="p">{</span>
<span class="n">annotation</span> <span class="o">::</span> <span class="n">a</span>
<span class="p">,</span> <span class="n">running</span> <span class="o">::</span> <span class="n">f</span> <span class="p">(</span><span class="kt">Program</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span>
<span class="p">}</span> <span class="kr">deriving</span> <span class="kt">Show</span>
</code></pre></div></div>
<p>A ‘Cofree’ value is thus a program in which every instruction is annotated with
a value of type ‘a’. This means that, unlike ‘Free’, it can’t have its own
opinion on termination. Like ‘Fix’, it has to let ‘f’ decide how to do that.</p>
<p>We’ll use the stack-based instruction set example to wrap up. Here we can
annotate instructions with progress about how many instructions remain to
execute. First our new embedded language terms:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">increment</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span>
<span class="n">increment</span> <span class="n">p</span> <span class="o">=</span> <span class="kt">Program</span> <span class="p">(</span><span class="n">remaining</span> <span class="n">p</span><span class="p">)</span> <span class="p">(</span><span class="kt">Increment</span> <span class="n">p</span><span class="p">)</span>
<span class="n">decrement</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span>
<span class="n">decrement</span> <span class="n">p</span> <span class="o">=</span> <span class="kt">Program</span> <span class="p">(</span><span class="n">remaining</span> <span class="n">p</span><span class="p">)</span> <span class="p">(</span><span class="kt">Decrement</span> <span class="n">p</span><span class="p">)</span>
<span class="n">terminate</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span>
<span class="n">terminate</span> <span class="o">=</span> <span class="kt">Program</span> <span class="mi">0</span> <span class="kt">Terminate</span>
</code></pre></div></div>
<p>Notice that two of these terms use a helper function ‘remaining’ that counts
the number of instructions left in the program. It’s defined as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">remaining</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">remaining</span> <span class="o">=</span> <span class="n">loop</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="p">(</span><span class="kt">Program</span> <span class="n">a</span> <span class="n">f</span><span class="p">)</span> <span class="o">=</span> <span class="kr">case</span> <span class="n">f</span> <span class="kr">of</span>
<span class="kt">Increment</span> <span class="n">p</span> <span class="o">-></span> <span class="n">succ</span> <span class="p">(</span><span class="n">loop</span> <span class="n">p</span><span class="p">)</span>
<span class="kt">Decrement</span> <span class="n">p</span> <span class="o">-></span> <span class="n">succ</span> <span class="p">(</span><span class="n">loop</span> <span class="n">p</span><span class="p">)</span>
<span class="kt">Terminate</span> <span class="o">-></span> <span class="n">succ</span> <span class="n">a</span>
</code></pre></div></div>
<p>And we can write our toy program like so:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">program</span> <span class="o">::</span> <span class="kt">Program</span> <span class="kt">Instruction</span> <span class="kt">Int</span>
<span class="n">program</span> <span class="o">=</span>
<span class="n">increment</span>
<span class="o">.</span> <span class="n">increment</span>
<span class="o">.</span> <span class="n">decrement</span>
<span class="o">$</span> <span class="n">terminate</span>
</code></pre></div></div>
<p>Evaluate it in GHCi to see what the resulting value looks like.</p>
<h3 id="richness-of-cofree">Richness of ‘Cofree’</h3>
<p>If you try and implement the ‘fix’ and ‘unfix’ analogues for ‘Cofree’ you’ll
rapidly infer that we have the opposite situation to ‘Free’ here. Implementing
the ‘unfix’ analogue is easy:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">uncofree</span> <span class="o">::</span> <span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span> <span class="o">-></span> <span class="n">f</span> <span class="p">(</span><span class="kt">Cofree</span> <span class="n">f</span> <span class="n">a</span><span class="p">)</span>
<span class="n">uncofree</span> <span class="p">(</span><span class="kt">Cofree</span> <span class="kr">_</span> <span class="n">f</span><span class="p">)</span> <span class="o">=</span> <span class="n">f</span>
</code></pre></div></div>
<p>But implementing a total function corresponding to ‘fix’ is impossible - we
can’t just come up with something of arbitrary type ‘a’ to tag an instruction
‘f’ with, so, like before, we can’t do any better than define something
partially:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>cofree :: f (Cofree f a) -> Cofree f a
cofree f = Cofree (error "kaboom") f
</code></pre></div></div>
<p>Just as how ‘Free’ forms a monad, ‘Cofree’ forms a comonad. It’s thus known as
the ‘cofree comonad’, though I can’t claim to really have any idea what the
algebraic notion of ‘cofreeness’ captures, exactly.</p>
<h2 id="wrapping-up">Wrapping Up</h2>
<p>So: ‘Fix’, ‘Free’, and ‘Cofree’ all share a similar sort of recursive structure
that make them useful for encoding programs, given some instruction set. And
while their definitions are similar, ‘Fix’ supports the richest recursion of
the three in some sense - it can both ‘embed’ things into <em>and</em> ‘project’
things out of its recursive structure, while ‘Free’ supports only embedding and
‘Cofree’ supports only projecting.</p>
<p>This has a practical implication: it means one can’t make use of certain
recursion schemes for ‘Free’ and ‘Cofree’ in the same way that one can for
‘Fix’. There do exist analogues, but they’re sort of out-of-scope for this
post.</p>
<p>I haven’t actually mentioned any truly practical uses of ‘Free’ and ‘Cofree’
here, but they’re wonderful things to keep in your toolkit if you’re doing any
work with embedded languages, and I’ll likely write more about them in the
future. In the meantime, Dave Laing wrote an excellent <a href="http://dlaing.org/cofun/posts/free_and_cofree.html">series of
posts</a> on ‘Free’ and
‘Cofree’ that are more than worth reading. They go into much more interesting
detail than I’ve done here - in particular he details a nice pairing that
exists between ‘Free’ and ‘Cofree’ (also
<a href="http://blog.sigfpe.com/2014/05/cofree-meets-free.html">discussed</a> by Dan
Piponi), plus a whack of examples.</p>
<p>You can also find industrial-strength infrastructure for both ‘Free’ and
‘Cofree’ in Edward Kmett’s excellent
<a href="https://hackage.haskell.org/package/free">free</a> library, and for ‘Fix’ in
<a href="https://hackage.haskell.org/package/recursion-schemes">recursion-schemes</a>.</p>
<p>I’ve dumped the code for this article into a few gists.
<a href="https://gist.github.com/jtobin/c95efd75bd8b894d10c0">Here’s</a> one of everything
excluding the running ‘Program’ examples, and here are the corresponding
‘Program’ examples for the
<a href="https://gist.github.com/jtobin/0b173dae5bdc46cf3fa3">Fix</a>,
<a href="https://gist.github.com/jtobin/0a609ae3f4704fafc611">Free</a>, and
<a href="https://gist.github.com/jtobin/ba992310771bd499e457">Cofree</a> cases
respectively.</p>
<p>Thanks to Fredrik Olsen for review and great feedback.</p>
Sorting with Style2015-12-02T00:00:00+04:00https://jtobin.io/sorting-with-style<p><a href="https://en.wikipedia.org/wiki/Merge_sort">Merge sort</a> is a famous
comparison-based sorting algorithm that starts by first recursively dividing a
collection of orderable elements into smaller subcollections, and then finishes
by recursively sorting and merging the smaller subcollections together to
reconstruct the (now sorted) original.</p>
<p>A clear implementation of mergesort should by definition be as faithful to that
high-level description as possible. We can get pretty close to that using the
whole <a href="/practical-recursion-schemes">recursion schemes</a>
business that I’ve talked about in the past. Near the end of that article I
briefly mentioned the idea of implementing mergesort via a
<a href="https://en.wikipedia.org/wiki/Hylomorphism_(computer_science)">hylomorphism</a>,
and here I just want to elaborate on that a little.</p>
<p>Start with a collection of orderable elements. We can divide the collection
into a bunch of smaller collections by using a binary tree:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span> <span class="p">(</span><span class="nf">hylo</span><span class="p">)</span>
<span class="kr">import</span> <span class="nn">Data.List.Ordered</span> <span class="p">(</span><span class="nf">merge</span><span class="p">)</span>
<span class="kr">data</span> <span class="kt">Tree</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">Empty</span>
<span class="o">|</span> <span class="kt">Leaf</span> <span class="n">a</span>
<span class="o">|</span> <span class="kt">Node</span> <span class="n">r</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="kt">Functor</span>
</code></pre></div></div>
<p>The idea is that each node in the tree holds two subtrees, each of which
contains half of the remaining elements. We can build a tree like this from a
collection - say, a basic Haskell list. The following <code class="language-plaintext highlighter-rouge">unfolder</code> function
defines what part of a tree to build for any corresponding part of a list:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">unfolder</span> <span class="kt">[]</span> <span class="o">=</span> <span class="kt">Empty</span>
<span class="n">unfolder</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span> <span class="o">=</span> <span class="kt">Leaf</span> <span class="n">x</span>
<span class="n">unfolder</span> <span class="n">xs</span> <span class="o">=</span> <span class="kt">Node</span> <span class="n">l</span> <span class="n">r</span> <span class="kr">where</span>
<span class="p">(</span><span class="n">l</span><span class="p">,</span> <span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">splitAt</span> <span class="p">(</span><span class="n">length</span> <span class="n">xs</span> <span class="p">`</span><span class="n">div</span><span class="p">`</span> <span class="mi">2</span><span class="p">)</span> <span class="n">xs</span>
</code></pre></div></div>
<p>On the other hand, we can also collapse an existing tree back into a list. The
following <code class="language-plaintext highlighter-rouge">folder</code> function defines how to collapse any given part of a tree
into the corresponding part of a list; again we just pattern match on whatever
part of the tree we’re looking at, and construct the complementary list:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">folder</span> <span class="kt">Empty</span> <span class="o">=</span> <span class="kt">[]</span>
<span class="n">folder</span> <span class="p">(</span><span class="kt">Leaf</span> <span class="n">x</span><span class="p">)</span> <span class="o">=</span> <span class="p">[</span><span class="n">x</span><span class="p">]</span>
<span class="n">folder</span> <span class="p">(</span><span class="kt">Node</span> <span class="n">l</span> <span class="n">r</span><span class="p">)</span> <span class="o">=</span> <span class="n">merge</span> <span class="n">l</span> <span class="n">r</span>
</code></pre></div></div>
<p>Now to sort a list we can just glue these instructions together using
a hylomorphism:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">mergesort</span> <span class="o">::</span> <span class="kt">Ord</span> <span class="n">a</span> <span class="o">=></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">mergesort</span> <span class="o">=</span> <span class="n">hylo</span> <span class="n">folder</span> <span class="n">unfolder</span>
</code></pre></div></div>
<p>And it works just like you’d expect:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> mergesort [1,10,3,4,5]
[1,3,4,5,10]
> mergesort "aloha"
"aahlo"
> mergesort [True, False, False, True, False]
[False, False, False, True, True]
</code></pre></div></div>
<p>Pretty concise!</p>
<p>The code is eminently clean and faithful to the high-level algorithm
description: first recursively divide a collection into smaller subcollections</p>
<ul>
<li>via a binary tree and <code class="language-plaintext highlighter-rouge">unfolder</code> - and then recursively sort and merge the
subcollections to reconstruct the (now sorted) original one - via <code class="language-plaintext highlighter-rouge">folder</code>.</li>
</ul>
<p>A version of this post originally appeared on the <a href="https://blog.fugue.co/">Fugue
blog</a>.</p>
Markov Chains à la Carte2015-10-14T00:00:00+04:00https://jtobin.io/markov-chains-a-la-carte<p>I’ve released a number of libraries for doing Markov Chain Monte Carlo (MCMC)
in Haskell.</p>
<p>You can get at them via a ‘frontend’ library,
<a href="https://hackage.haskell.org/package/declarative">declarative</a>, but each can
also be used fruitfully on its own. À la carte, if you will.</p>
<p>Some background: MCMC is a family of stateful algorithms for sampling from a
large class of probability distributions. Typically one is interested in doing
this to approximate difficult integrals; instead of choosing some suitable grid
of points in parameter space over which to approximate an integral, just
offload the problem to probability theory and use a Markov chain to find them
for you.</p>
<p>For an excellent introduction to MCMC you won’t find better than <a href="http://videolectures.net/mlss09uk_murray_mcmc/">Iain Murray’s
lectures</a> from MLSS ’09 in
Cambridge, so check those out if you’re interested in more details.</p>
<p>I’ve put together a handful of popular MCMC algorithms as well as an easy way
to glue them together in a couple of useful ways. At present these
implementations are useful in cases where you can write your target function in
closed form, and that’s pretty much all that’s required (aside from the
standard algorithm-specific tuning parameters).</p>
<p>The API should be pretty easy to work with — write your target as a function of
its parameters, specify a start location, and away you go. It’s also cool if
your target accepts its parameters via most common traversable
functors — lists, vectors, sequences, maps, etc.</p>
<p>That’s sort of the goal of this first release: if you can give me a target
function, I’ll do my best to give you samples from it. Less is more and all
that.</p>
<h2 id="whats-in-the-box">What‘s In The Box</h2>
<p>There are a number of libraries involved. I have a few more in the queue and
there are a number of additional features I plan to support for these ones in
particular, but without further ado:</p>
<ul>
<li><a href="https://hackage.haskell.org/package/mwc-probability">mwc-probability</a>, a
sampling-function based probability monad implemented as a thin wrapper over
the excellent <a href="https://hackage.haskell.org/package/mwc-random">mwc-random</a>
library.</li>
<li><a href="https://hackage.haskell.org/package/mcmc-types">mcmc-types</a>, housing a
number of types used by the the whole family.</li>
<li><a href="https://hackage.haskell.org/package/mighty-metropolis">mighty-metropolis</a>,
an implementation of the famous Metropolis algorithm.</li>
<li><a href="https://hackage.haskell.org/package/speedy-slice">speedy-slice</a>, a slice
sampling implementation suitable for both continuous & discrete parameter
spaces.</li>
<li><a href="https://hackage.haskell.org/package/hasty-hamiltonian">hasty-hamiltonian</a>,
an implementation of the gradient-based Hamiltonian Monte Carlo algorithm.</li>
<li><a href="https://hackage.haskell.org/package/declarative">declarative</a>, the one ring
to rule them all.</li>
</ul>
<p>Pull down <em>declarative</em> if you just want to have access to all of them. If
you’re a Haskell neophyte you can find installation instructions at the <a href="https://github.com/jtobin/declarative">Github
repo</a>.</p>
<h2 id="motivation">Motivation</h2>
<p>MCMC is fundamentally about observing <em>Markov chains</em> over probability spaces.
In this context a chain is a stochastic process that wanders around a state
space, eventually visiting regions of the space in proportion to their
probability.</p>
<p>Markov chains are constructed by <em>transition operators</em> that obey the Markov
property: that the probability of transitioning to the next
location — conditional on the history of the chain — depends only on the
current location. For MCMC we’re also interested in operators that satisfy the
<em>reversibility</em> property — that the probability a transition from state A to
state B occurs is the same as that a transition from state B to state A occurs.
A chain is characterized by a transition operator T that drives it from state
to state, and for MCMC we want the stationary or limiting distribution of the
chain to be the distribution we’re sampling from.</p>
<p>One of the major cottage industries in Bayesian research is inventing new
transition operators to drive the Markov chains used in MCMC. This has been
fruitful, but it could likely be aided by a practical way to make existing
transition operators work together.</p>
<p>This is easy to do in theory: there are a couple of ways to combine transition
operators such that the resulting composite operator preserves all the
properties we’re interested in for MCMC — the stationary distribution,
reversibility, and Markov property. See <a href="http://www.stat.umn.edu/geyer/f05/8931/n1998.pdf">Geyer,
2005</a> for details here, but
the crux is that we can establish the following simple grammar for transition
operators:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>transition ::= primitive <transition>
| concat transition transition
| sample transition transition
</code></pre></div></div>
<p>A transition is either some primitive operator, a deterministic concatenation
of operators (via ‘concat’), or a probabilistic concatenation of operators (via
‘sample’). A deterministic concatenation works by just transitioning through
two operators one after the other; a probabilistic concatenation works by
randomly choosing one transition operator or the other to use on any given
transition. These kinds of concatenation preserve all the properties we’re
interested in.</p>
<p>We can trivially generalize this further by adding a term that concatenates n
transition operators together deterministically, or another for
probabilistically concatenating a bunch of operators according to some desired
probability distribution.</p>
<p>The idea here is that there are tradeoffs involved in different transition
operators. Some may be more computationally expensive than others (perhaps
requiring a gradient evaluation, or evaluation of some inner loop) but have
better ability to make ‘good’ transitions in certain situations. Other
operators are cheap, but can be inefficient (taking a long time to visit
certain regions of the space).</p>
<p>By employing deterministic or probabilistic concatenation, one can concoct a
Markov chain that uses a varied range of tuning parameters, for example. Or
only occasionally employs a computationally expensive transition, otherwise
preferring some cheaper, reliable operator.</p>
<h2 id="usage">Usage</h2>
<p>The <em>declarative</em> library implements this simple language for transition
operators, and the <em>mighty-metropolis</em>, <em>speedy-slice</em>, and <em>hasty-hamiltonian</em>
libraries provide some primitive transitions that you can combine as needed.</p>
<p>The Metropolis and slice sampling transitions are cheap and require little
information, whereas Hamiltonian Monte Carlo exploits information about the
target’s gradient and also involves evaluation of an inner loop (the length of
which is determined by a tuning parameter). Feel free to use one that suits
your problem, or combine them together using the combinators supplied in
<em>declarative</em> to build a custom solution.</p>
<p>As an example, the <a href="https://en.wikipedia.org/wiki/Rosenbrock_function">Rosenbrock
density</a> is a great test
dummy as it’s simple, low-dimensional, and can be easily visualized, but it
still exhibits a pathological anisotropic structure that makes it somewhat
tricky to sample from.</p>
<p>Getting started via declarative is pretty simple:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">import</span> <span class="nn">Numeric.MCMC</span>
</code></pre></div></div>
<p>You’ll want to supply a target to sample over, and if you want to use an
algorithm like Hamiltonian Monte Carlo you’ll also need to provide a gradient.
If you can’t be bothered to calculate gradients by hand, you can always turn to
your friend <a href="/automasymbolic-differentiation">automatic
differentiation</a>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">import</span> <span class="nn">Numeric.AD</span>
</code></pre></div></div>
<p>The Rosenbrock log-density and its gradient can then be written as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">target</span> <span class="o">::</span> <span class="kt">Num</span> <span class="n">a</span> <span class="o">=></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="n">a</span>
<span class="n">target</span> <span class="p">[</span><span class="n">x0</span><span class="p">,</span> <span class="n">x1</span><span class="p">]</span> <span class="o">=</span> <span class="n">negate</span> <span class="p">(</span><span class="mi">100</span> <span class="o">*</span> <span class="p">(</span><span class="n">x1</span> <span class="err">—</span> <span class="n">x0</span> <span class="o">^</span> <span class="mi">2</span><span class="p">)</span> <span class="o">^</span> <span class="mi">2</span> <span class="o">+</span> <span class="p">(</span><span class="mi">1</span> <span class="err">—</span> <span class="n">x0</span><span class="p">)</span> <span class="o">^</span> <span class="mi">2</span><span class="p">)</span>
<span class="n">gTarget</span> <span class="o">::</span> <span class="kt">Num</span> <span class="n">a</span> <span class="o">=></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span> <span class="o">-></span> <span class="p">[</span><span class="n">a</span><span class="p">]</span>
<span class="n">gTarget</span> <span class="o">=</span> <span class="n">grad</span> <span class="n">target</span>
</code></pre></div></div>
<p>All you need to do here is provide a function <em>proportional</em> to a
log-probability density. The logarithmic scale is important; various internals
expect to be passed (something proportional to) a log-probability density.</p>
<p>To package these guys up together we can wrap them in a <code class="language-plaintext highlighter-rouge">Target</code>. Note that we
don’t always care about including a gradient, so that part is optional:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">rosenbrock</span> <span class="o">::</span> <span class="kt">Target</span> <span class="p">[</span><span class="kt">Double</span><span class="p">]</span>
<span class="n">rosenbrock</span> <span class="o">=</span> <span class="kt">Target</span> <span class="n">target</span> <span class="p">(</span><span class="kt">Just</span> <span class="n">gTarget</span><span class="p">)</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">Target</code> type is parameterized over the shape of the parameter space. You
could similarly have a <code class="language-plaintext highlighter-rouge">Target (Seq Double)</code>, <code class="language-plaintext highlighter-rouge">Target (Map String Double)</code>, and
so on. Your target may be implemented using a boxed vector for efficiency, for
example. Or using a Map or HashMap with string/text keys such that parameter
names are preserved. They should work just fine.</p>
<p>Given a target, we can sample from it a bunch of times using a simple
Metropolis transition via the <em>mcmc</em> function. Aside from the target and a
PRNG, provide it with the desired number of transitions, a starting point, and
the transition operator to use:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span> <span class="c1">-- haskell</span>
<span class="o">></span> <span class="n">prng</span> <span class="o"><-</span> <span class="n">create</span>
<span class="o">></span> <span class="n">mcmc</span> <span class="mi">10000</span> <span class="p">[</span><span class="mi">0</span><span class="p">,</span> <span class="mi">0</span><span class="p">]</span> <span class="p">(</span><span class="n">metropolis</span> <span class="mi">1</span><span class="p">)</span> <span class="n">rosenbrock</span> <span class="n">prng</span>
</code></pre></div></div>
<p>In return you’ll get the desired trace of the chain dumped to stdout:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>8.136972300105949e-2,0.273896953404261
0.4657348148676972,0.17462596647788464
-0.48609414127836326,9.465052854751566e-2
-0.49781488399832785,0.42092910345708523
-0.3019713424699155,0.39135350029173566
0.12058426470979189,0.12485407390388925
..
</code></pre></div></div>
<p>The intent is for the chain to be processed elsewhere — if you’re me, that will
usually be in R. Libraries like
<a href="https://cran.r-project.org/web/packages/coda/coda.pdf">coda</a> have a ton of
functionality useful for working with Markov chain traces, and
<a href="http://ggplot2.org/">ggplot2</a> as a library for static statistical graphics
can’t really be beat:</p>
<div class="language-r highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span><span class="w"> </span><span class="c1"># r</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">d</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="n">read.csv</span><span class="p">(</span><span class="err">‘</span><span class="n">rosenbrock</span><span class="o">-</span><span class="n">trace.dat</span><span class="err">’</span><span class="p">,</span><span class="w"> </span><span class="n">header</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nb">F</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="nf">names</span><span class="p">(</span><span class="n">d</span><span class="p">)</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="nf">c</span><span class="p">(</span><span class="err">‘</span><span class="n">x</span><span class="err">’</span><span class="p">,</span><span class="w"> </span><span class="err">‘</span><span class="n">y</span><span class="err">’</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">require</span><span class="p">(</span><span class="n">ggplot2</span><span class="p">)</span><span class="w">
</span><span class="o">></span><span class="w"> </span><span class="n">ggplot</span><span class="p">(</span><span class="n">d</span><span class="p">,</span><span class="w"> </span><span class="n">aes</span><span class="p">(</span><span class="n">x</span><span class="p">,</span><span class="w"> </span><span class="n">y</span><span class="p">))</span><span class="w"> </span><span class="o">+</span><span class="w"> </span><span class="n">geom_point</span><span class="p">(</span><span class="n">colour</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="err">‘</span><span class="n">darkblue</span><span class="err">’</span><span class="p">,</span><span class="w"> </span><span class="n">alpha</span><span class="w"> </span><span class="o">=</span><span class="w"> </span><span class="m">0.2</span><span class="p">)</span><span class="w">
</span></code></pre></div></div>
<p>You get the following trace over the Rosenbrock density, taken for 10k
iterations. This is using a Metropolis transition with variance 1:</p>
<p><img src="/images/rosenbrock-trace.png" alt="metropolis" /></p>
<p>If you do want to work with chains in memory in Haskell you can do that by
writing your own handling code around the supplied transition operators. I’ll
probably make this a little easier in later versions.</p>
<p>The implementations are reasonably quick and don’t leak memory — the traces are
streamed to stdout as the chains are traversed. Compiling the above with ‘-O2’
and running it for 100k iterations yields the following performance
characteristics on my mid-2011 model MacBook Air:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>$ ./test/Rosenbrock +RTS -s > /dev/null
3,837,201,632 bytes allocated in the heap
8,453,696 bytes copied during GC
89,600 bytes maximum residency (2 sample(s))
23,288 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.539s ( 3.598s elapsed)
GC time 0.049s ( 0.058s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 3.591s ( 3.656s elapsed)
%GC time 1.4% (1.6% elapsed)
Alloc rate 1,084,200,280 bytes per MUT second
Productivity 98.6% of total user, 96.8% of total elapsed
</code></pre></div></div>
<p>The beauty is that rather than running a chain solely on something like the
simple Metropolis operator used above, you can sort of ‘hedge your sampling
risk’ and use a composite operator that proposes transitions using a multitude
of ways. Consider this guy, for example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">transition</span> <span class="o">=</span>
<span class="n">concatT</span>
<span class="p">(</span><span class="n">sampleT</span> <span class="p">(</span><span class="n">metropolis</span> <span class="mf">0.5</span><span class="p">)</span> <span class="p">(</span><span class="n">metropolis</span> <span class="mf">1.0</span><span class="p">))</span>
<span class="p">(</span><span class="n">sampleT</span> <span class="p">(</span><span class="n">slice</span> <span class="mf">2.0</span><span class="p">)</span> <span class="p">(</span><span class="n">slice</span> <span class="mf">3.0</span><span class="p">))</span>
</code></pre></div></div>
<p>Here <code class="language-plaintext highlighter-rouge">concatT</code> and <code class="language-plaintext highlighter-rouge">sampleT</code> correspond to the <code class="language-plaintext highlighter-rouge">concat</code> and <code class="language-plaintext highlighter-rouge">sample</code> terms in
the BNF description in the previous section. This operator performs two
transitions back-to-back; the first is randomly a Metropolis transition with
standard deviation 0.5 or 1 respectively, and the second is a slice sampling
transition using a step size of 2 or 3, randomly.</p>
<p>Running it for 5000 iterations (to keep the total computation approximately
constant), we see a chain that has traversed the space a little better:</p>
<div class="language-plaintext highlighter-rouge"><div class="highlight"><pre class="highlight"><code>> mcmc 5000 [0, 0] transition rosenbrock prng
</code></pre></div></div>
<p><img src="/images/rosenbrock-composite-trace.png" alt="composite" /></p>
<p>It’s worth noting that I didn’t put any work into coming up with this composite
transition: this was just the first example I thought up, and a lot of the
benefits here probably come primarily from including the eminently-reliable
slice sampling transition. But from informal experimentation, it does seem that
chains driven by composite transitions involving numerous operators and tuning
parameter settings often seem to perform better on average than a given chain
driven by a single (poorly-selected) transition.</p>
<p>I know exactly how meticulous proofs and benchmarks must be so I haven’t
rigorously established any properties around this, but hey: it ‘seems to be the
case’, and intuitively, including varied transition operators surely hedges
your bets when compared to using a single one.</p>
<p>Try it out and see how your mileage varies, and be sure to let me know if you
find some killer apps where composite transitions really seem to win.</p>
<h2 id="implementation-notes">Implementation Notes</h2>
<p>If you’re just interested in using the libraries you can skip the following
section, but I just want to point out how easy this is to implement.</p>
<p>The implementations are defined using a small set of types living in
<em>mcmc-types</em>:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">Transition</span> <span class="n">m</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">StateT</span> <span class="n">a</span> <span class="p">(</span><span class="kt">Prob</span> <span class="n">m</span><span class="p">)</span> <span class="nb">()</span>
<span class="kr">data</span> <span class="kt">Chain</span> <span class="n">a</span> <span class="n">b</span> <span class="o">=</span> <span class="kt">Chain</span> <span class="p">{</span>
<span class="n">chainTarget</span> <span class="o">::</span> <span class="kt">Target</span> <span class="n">a</span>
<span class="p">,</span> <span class="n">chainScore</span> <span class="o">::</span> <span class="kt">Double</span>
<span class="p">,</span> <span class="n">chainPosition</span> <span class="o">::</span> <span class="n">a</span>
<span class="p">,</span> <span class="n">chainTunables</span> <span class="o">::</span> <span class="kt">Maybe</span> <span class="n">b</span>
<span class="p">}</span>
<span class="kr">data</span> <span class="kt">Target</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Target</span> <span class="p">{</span>
<span class="n">lTarget</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">Double</span>
<span class="p">,</span> <span class="n">glTarget</span> <span class="o">::</span> <span class="kt">Maybe</span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="n">a</span><span class="p">)</span>
<span class="p">}</span>
</code></pre></div></div>
<p>Most important here is the <code class="language-plaintext highlighter-rouge">Transition</code> type, which is just a state transformer
over a probability monad (itself defined in mwc-probability). The probability
monad is the source of randomness used to define transition operators useful
for MCMC, and values with type <code class="language-plaintext highlighter-rouge">Transition</code> are the transition operators in
question.</p>
<p>The <code class="language-plaintext highlighter-rouge">Chain</code> type is the state of the Markov chain at any given iteration. All
that’s really required here is the <code class="language-plaintext highlighter-rouge">chainPosition</code> field, which represents the
location of the chain in parameter space. But adding some additional
information here is convenient; <code class="language-plaintext highlighter-rouge">chainScore</code> caches the most recent score of
the chain (which is typically used in internal calculations, and caching avoids
recomputing things needlessly) and <code class="language-plaintext highlighter-rouge">chainTunables</code> is an optional record
intended to be used for stateful tuning parameters (used by adaptive algorithms
or in burn-in phases and the like). Additionally the target being sampled from
itself — <code class="language-plaintext highlighter-rouge">chainTarget</code> — is included in the state.</p>
<p>Undisciplined use of <code class="language-plaintext highlighter-rouge">chainTarget</code> and <code class="language-plaintext highlighter-rouge">chainTunables</code> can have all sorts of
nasty consequences — you can use them to change the stationary distribution
you’re sampling from or invalidate the Markov property — but keeping them
around is useful for implementing some desirable features. Tweaking
<code class="language-plaintext highlighter-rouge">chainTarget</code>, for example, allows one to easily implement annealing, which can
be very useful for sampling from annoying multi-modal densities.</p>
<p>Setting everything up like this makes it trivial to mix-and-match transition
operators as required — the state and probability monad stack provides
everything we need. Deterministic concatenation is implemented as follows, for
example:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">concatT</span> <span class="o">=</span> <span class="p">(</span><span class="o">>></span><span class="p">)</span>
</code></pre></div></div>
<p>and a generalized version of probabilistic concatenation just requires a coin
flip:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">bernoulliT</span> <span class="n">p</span> <span class="n">t0</span> <span class="n">t1</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">heads</span> <span class="o"><-</span> <span class="n">lift</span> <span class="p">(</span><span class="kt">MWC</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">p</span><span class="p">)</span>
<span class="kr">if</span> <span class="n">heads</span> <span class="kr">then</span> <span class="n">t0</span> <span class="kr">else</span> <span class="n">t1</span>
</code></pre></div></div>
<p>A uniform probabilistic concatenation over two operators, implemented in
<code class="language-plaintext highlighter-rouge">sampleT</code>, is then just <code class="language-plaintext highlighter-rouge">bernoulliT 0.5</code>.</p>
<p>The difficulty of implementing primitive operators just depends on the operator
itself; the surrounding framework is extremely lightweight. Here’s the
Metropolis transition, for example (with type signatures omitted to keep the
noise down):</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">metropolis</span> <span class="n">radial</span> <span class="o">=</span> <span class="kr">do</span>
<span class="kt">Chain</span> <span class="p">{</span><span class="o">..</span><span class="p">}</span> <span class="o"><-</span> <span class="n">get</span>
<span class="n">proposal</span> <span class="o"><-</span> <span class="n">lift</span> <span class="p">(</span><span class="n">propose</span> <span class="n">radial</span> <span class="n">chainPosition</span><span class="p">)</span>
<span class="kr">let</span> <span class="n">proposalScore</span> <span class="o">=</span> <span class="n">lTarget</span> <span class="n">chainTarget</span> <span class="n">proposal</span>
<span class="n">acceptProb</span> <span class="o">=</span> <span class="n">whenNaN</span> <span class="mi">0</span>
<span class="p">(</span><span class="n">exp</span> <span class="p">(</span><span class="n">min</span> <span class="mi">0</span> <span class="p">(</span><span class="n">proposalScore</span> <span class="o">-</span> <span class="n">chainScore</span><span class="p">)))</span>
<span class="n">accept</span> <span class="o"><-</span> <span class="n">lift</span> <span class="p">(</span><span class="kt">MWC</span><span class="o">.</span><span class="n">bernoulli</span> <span class="n">acceptProb</span><span class="p">)</span>
<span class="n">when</span> <span class="n">accept</span>
<span class="p">(</span><span class="n">put</span> <span class="p">(</span><span class="kt">Chain</span> <span class="n">chainTarget</span> <span class="n">proposalScore</span> <span class="n">proposal</span> <span class="n">chainTunables</span><span class="p">))</span>
<span class="n">propose</span> <span class="n">radial</span> <span class="o">=</span> <span class="n">traverse</span> <span class="n">perturb</span> <span class="kr">where</span>
<span class="n">perturb</span> <span class="n">m</span> <span class="o">=</span> <span class="kt">MWC</span><span class="o">.</span><span class="n">normal</span> <span class="n">m</span> <span class="n">radial</span>
</code></pre></div></div>
<p>And the excellent <a href="https://hackage.haskell.org/package/pipes">pipes</a> library is
used to generate a Markov chain:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">chain</span> <span class="n">radial</span> <span class="o">=</span> <span class="n">loop</span> <span class="kr">where</span>
<span class="n">loop</span> <span class="n">state</span> <span class="n">prng</span> <span class="o">=</span> <span class="kr">do</span>
<span class="n">next</span> <span class="o"><-</span> <span class="n">lift</span>
<span class="p">(</span><span class="kt">MWC</span><span class="o">.</span><span class="n">sample</span> <span class="p">(</span><span class="n">execStateT</span> <span class="p">(</span><span class="n">metropolis</span> <span class="n">radial</span><span class="p">)</span> <span class="n">state</span><span class="p">)</span> <span class="n">prng</span><span class="p">)</span>
<span class="n">yield</span> <span class="n">next</span>
<span class="n">loop</span> <span class="n">next</span> <span class="n">prng</span>
</code></pre></div></div>
<p>The <code class="language-plaintext highlighter-rouge">mcmc</code> functions are also implemented using pipes. Take the first <em>n</em>
iterations of a chain and print them to stdout. That simple.</p>
<h2 id="future-work">Future Work</h2>
<p>In the near term I plan on updating some old MCMC implementations I have
kicking around on Github (<a href="https://github.com/jtobin/flat-mcmc">flat-mcmc</a>,
<a href="https://github.com/jtobin/lazy-langevin">lazy-langevin</a>,
<a href="https://github.com/jtobin/hnuts">hnuts</a>) and releasing them within this
framework. Additionally I’ve got some code for building annealed operators that
I want to release — it has been useful in some situations when sampling from
things like the <a href="https://en.wikipedia.org/wiki/Himmelblau%27s_function">Himmelblau
density</a>, which has a
few disparate clumps of probability that make it tricky to sample from with
conventional algorithms.</p>
<p>This framework is also useful as an inference backend to languages for working
with directed graphical models (think BUGS/Stan). The idea here is that you
don’t need to specify your target function (typically a posterior density)
explicitly: just describe your model and I’ll give you samples from the
posterior distribution. A similar version has been put to use around the
<a href="http://bayeshive.com/">BayesHive</a> project.</p>
<p>Longer term — I’ll have to see what’s up in terms of demand. There are
performance improvements and straightforward extensions to things like
<a href="https://en.wikipedia.org/wiki/Parallel_tempering">parallel tempering</a>, but I’m
growing more interested in ‘online’ methods like <a href="http://www.stats.ox.ac.uk/~doucet/andrieu_doucet_holenstein_PMCMC.pdf">particle
MCMC</a>
and friends that are proving useful for inference in more general probabilistic
programs (think those expressible by <a href="https://probmods.org/">Church</a> and its
ilk).</p>
<p>Let me know if you get any use out of these things, or please file an issue if
there’s some particular feature you’d like to see supported.</p>
<p>Thanks to Niffe Hermansson for review and helpful comments.</p>
Practical Recursion Schemes2015-09-06T00:00:00+04:00https://jtobin.io/practical-recursion-schemes<p>Recursion schemes are elegant and useful patterns for expressing general
computation. In particular, they allow you to ‘factor recursion out’ of
whatever semantics you may be trying to express when interpreting programs,
keeping your interpreters concise, your concerns separated, and your code more
maintainable.</p>
<p>What’s more, formulating programs in terms of recursion schemes seems to help
suss out particular similarities in structure between what might be seen as
<a href="https://colah.github.io/posts/2015-09-NN-Types-FP/">disparate problems</a> in
other domains. So aside from being a practical computational tool, they seem to
be of some use when it comes to ‘hacking understanding’ in varied areas.</p>
<p>Unfortunately, they come with a pretty forbidding barrier to entry. While there
are a
<a href="http://jozefg.bitbucket.org/posts/2014-05-19-like-recursion-but-cooler.html">few</a>
<a href="https://www.youtube.com/watch?v=Zw9KeP3OzpU">nice</a>
<a href="http://patrickthomson.ghost.io/an-introduction-to-recursion-schemes/">resources</a>
out there for learning about recursion schemes and how they work, most
literature around them is <a href="http://eprints.eemcs.utwente.nl/7281/01/db-utwente-40501F46.pdf">quite
academic</a> and
awash in some astoundingly technical jargon (more on this later). Fortunately,
the accessible resources out there do a great job of explaining what recursion
schemes are and how you might use them, so they go through some effort to build
up the required machinery from scratch.</p>
<p>In this article I want to avoid building up the machinery meticulously and
instead concentrate mostly on understanding and using Edward Kmett’s
<a href="https://hackage.haskell.org/package/recursion-schemes">recursion-schemes</a>
library, which, while lacking in documentation, is very well put together and
implements all the background plumbing one needs to get started.</p>
<p>In particular, to feel comfortable using recursion-schemes I found that there
were a few key patterns worth understanding:</p>
<ul>
<li>Factoring recursion out of your data types using pattern functors and a
fixed-point wrapper.</li>
<li>Using the ‘Foldable’ & ‘Unfoldable’ classes, plus navigating the ‘Base’ type
family.</li>
<li>How to use some of the more common recursion schemes out there for everyday
tasks.</li>
</ul>
<h2 id="the-basics">The Basics</h2>
<p>If you’re following along in GHCi, I’m going to first bring in some
imports and add a useful pragma. I’ll dump a gist at the bottom; note that this
article targets GHC 7.10.2 and recursion-schemes-4.1.2, plus I’ll also require
data-ordlist-0.4.7.0 for an example later. Here’s the requisite boilerplate:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE DeriveFunctor #-}</span>
<span class="kr">import</span> <span class="nn">Data.Functor.Foldable</span>
<span class="kr">import</span> <span class="nn">Data.List.Ordered</span> <span class="p">(</span><span class="nf">merge</span><span class="p">)</span>
<span class="kr">import</span> <span class="nn">Prelude</span> <span class="k">hiding</span> <span class="p">(</span><span class="kt">Foldable</span><span class="p">,</span> <span class="nf">succ</span><span class="p">)</span>
</code></pre></div></div>
<p>So, let’s get started.</p>
<p>Recursion schemes are applicable to data types that have a suitable recursive
structure. Lists, trees, and natural numbers are illustrative candidates.</p>
<p>Being so dead-simple, let’s take the natural numbers as an illustrative/toy
example. We can define them recursively as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">Natural</span> <span class="o">=</span>
<span class="kt">Zero</span>
<span class="o">|</span> <span class="kt">Succ</span> <span class="kt">Natural</span>
</code></pre></div></div>
<p>This is a fine definition, but many such recursive structures can also be
defined in a different way: we can first ‘factor out’ the recursion by defining
some base structure, and then ‘add it back in’ by using a recursive wrapper
type.</p>
<p>The price of this abstraction is a slightly more involved type definition, but
it unlocks some nice benefits — namely, the ability to reason about recursion
and base structures separate from each other. This turns out to be a very
useful pattern for getting up and running with recursion schemes.</p>
<p>The trick is to create a different, parameterized type, in which the new
parameter takes the place of all recursive points in the original type. We can
create this kind of base structure for the natural numbers example as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">NatF</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">ZeroF</span>
<span class="o">|</span> <span class="kt">SuccF</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">)</span>
</code></pre></div></div>
<p>This type must be a functor in this new parameter, so the type is often called
a ‘pattern functor’ for some other type. I like to use the notation
‘<Constructor>F’ when defining constructors for pattern functors.</Constructor></p>
<p>We can define pattern functors for lists and trees in the same way:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">data</span> <span class="kt">ListF</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">NilF</span>
<span class="o">|</span> <span class="kt">ConsF</span> <span class="n">a</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">)</span>
<span class="kr">data</span> <span class="kt">TreeF</span> <span class="n">a</span> <span class="n">r</span> <span class="o">=</span>
<span class="kt">EmptyF</span>
<span class="o">|</span> <span class="kt">LeafF</span> <span class="n">a</span>
<span class="o">|</span> <span class="kt">NodeF</span> <span class="n">r</span> <span class="n">r</span>
<span class="kr">deriving</span> <span class="p">(</span><span class="kt">Show</span><span class="p">,</span> <span class="kt">Functor</span><span class="p">)</span>
</code></pre></div></div>
<p>Now, to add recursion to these pattern functors we’re going to use the famous
fixed-point type, ‘Fix’, to wrap them in:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kt">Nat</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">NatF</span>
<span class="kr">type</span> <span class="kt">List</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">ListF</span> <span class="n">a</span><span class="p">)</span>
<span class="kr">type</span> <span class="kt">Tree</span> <span class="n">a</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">TreeF</span> <span class="n">a</span><span class="p">)</span>
</code></pre></div></div>
<p>‘Fix’ is a standard fixed-point type imported from the recursion-schemes
library. You can get a ton of mileage from it. It introduces the ‘Fix’
constructor everywhere, but that’s actually not much of an issue in practice.
One thing I typically like to do is add some smart constructors to get around
it:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">zero</span> <span class="o">::</span> <span class="kt">Nat</span>
<span class="n">zero</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">ZeroF</span>
<span class="n">succ</span> <span class="o">::</span> <span class="kt">Nat</span> <span class="o">-></span> <span class="kt">Nat</span>
<span class="n">succ</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="o">.</span> <span class="kt">SuccF</span>
<span class="n">nil</span> <span class="o">::</span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">nil</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="kt">NilF</span>
<span class="n">cons</span> <span class="o">::</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">cons</span> <span class="n">x</span> <span class="n">xs</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">xs</span><span class="p">)</span>
</code></pre></div></div>
<p>Then you can write expressions like ‘succ (succ (succ zero))’ without having to
deal with the ‘Fix’ constructor explicitly. Note also that these expressions
are Showable à la carte, for example in GHCi:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span> <span class="n">succ</span> <span class="p">(</span><span class="n">succ</span> <span class="p">(</span><span class="n">succ</span> <span class="n">zero</span><span class="p">))</span>
<span class="kt">Fix</span> <span class="p">(</span><span class="kt">SuccF</span> <span class="p">(</span><span class="kt">Fix</span> <span class="p">(</span><span class="kt">SuccF</span> <span class="p">(</span><span class="kt">Fix</span> <span class="p">(</span><span class="kt">SuccF</span> <span class="p">(</span><span class="kt">Fix</span> <span class="kt">ZeroF</span><span class="p">))))))</span>
</code></pre></div></div>
<h3 id="a-short-digression-on-fix">A Short Digression on ‘Fix’</h3>
<p>The ‘Fix’ type is brought into scope from ‘Data.Functor.Foldable’, but it’s
worth looking at it in some detail. It can be defined as follows, along with
two helpful functions for working with it:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">newtype</span> <span class="kt">Fix</span> <span class="n">f</span> <span class="o">=</span> <span class="kt">Fix</span> <span class="p">(</span><span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">))</span>
<span class="n">fix</span> <span class="o">::</span> <span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span> <span class="o">-></span> <span class="kt">Fix</span> <span class="n">f</span>
<span class="n">fix</span> <span class="o">=</span> <span class="kt">Fix</span>
<span class="n">unfix</span> <span class="o">::</span> <span class="kt">Fix</span> <span class="n">f</span> <span class="o">-></span> <span class="n">f</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span>
<span class="n">unfix</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span> <span class="o">=</span> <span class="n">f</span>
</code></pre></div></div>
<p>‘Fix’ has a simple recursive structure. For a given value, you can think of
‘fix’ as adding one level of recursion to it. ‘unfix’ in turn removes one level
of recursion.</p>
<p>This generic recursive structure is what makes ‘Fix’ so useful: we can write
some nominally recursive type we’re interested in without actually using
recursion, but then package it up in ‘Fix’ to hijack the recursion it provides
automatically.</p>
<h2 id="understanding-some-internal-plumbing">Understanding Some Internal Plumbing</h2>
<p>If we wrap a pattern functor in ‘Fix’ then the underlying machinery of
recursion-schemes should ‘just work’. Here it’s worth explaining a little as to
why that’s the case.</p>
<p>There are two fundamental type classes involved in recursion-schemes:
‘Foldable’ and ‘Unfoldable’. These serve to tease apart the recursive structure
of something like ‘Fix’ even more: loosely, ‘Foldable’ corresponds to types
that can be ‘unfixed’, and ‘Unfoldable’ corresponds to types that can be
‘fixed’. That is, we can add more layers of recursion to instances of
‘Unfoldable’, and we can peel off layers of recursion from instances of
‘Foldable’.</p>
<p>In particular ‘Foldable’ and ‘Unfoldable’ contain functions called ‘project’
and ‘embed’ respectively, corresponding to more general forms of ‘unfix’ and
‘fix’. Their types are as follows:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">project</span> <span class="o">::</span> <span class="kt">Foldable</span> <span class="n">t</span> <span class="o">=></span> <span class="n">t</span> <span class="o">-></span> <span class="kt">Base</span> <span class="n">t</span> <span class="n">t</span>
<span class="n">embed</span> <span class="o">::</span> <span class="kt">Unfoldable</span> <span class="n">t</span> <span class="o">=></span> <span class="kt">Base</span> <span class="n">t</span> <span class="n">t</span> <span class="o">-></span> <span class="n">t</span>
</code></pre></div></div>
<p>I’ve found it useful while using recursion-schemes to have a decent
understanding of how to interpret the type family ‘Base’. It appears frequently
in type signatures of various recursion schemes and being able to reason about
it can help a lot.</p>
<h2 id="base-and-basic-type-families">‘Base’ and Basic Type Families</h2>
<p>Type families are type-level functions; they take types as input and return
types as output. The ‘Base’ definition in recursion-schemes looks like this:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="n">family</span> <span class="kt">Base</span> <span class="n">t</span> <span class="o">::</span> <span class="o">*</span> <span class="o">-></span> <span class="o">*</span>
</code></pre></div></div>
<p>You can interpret this as a function that takes one type ‘t’ as input and
returns some other type. An implementation of this function is called an
instance of the family. The instance for ‘Fix’, for example, looks like:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="kr">type</span> <span class="kr">instance</span> <span class="kt">Base</span> <span class="p">(</span><span class="kt">Fix</span> <span class="n">f</span><span class="p">)</span> <span class="o">=</span> <span class="n">f</span>
</code></pre></div></div>
<p>In particular, a type family like ‘Base’ is a synonym for instances of the
family. So using the above example: anywhere you see something like ‘Base (Fix
f)’ you can mentally replace it with ‘f’.</p>
<p>Instances of the ‘Base’ type family have a structure like ‘Fix’, but using
‘Base’ enables all the internal machinery of recursion-schemes to work
out-of-the-box for types other than ‘Fix’ alone. This has a typical Kmettian
flavour: first solve the most general problem, and then recover useful,
specific solutions to it automatically.</p>
<p>Predictably, ‘Fix f’ is an instance of ‘Base’, ‘Foldable’, and ‘Unfoldable’ for
some functor ‘f’, so if you use it, you can freely use all of
recursion-schemes’s innards without needing to manually write any instances for
your own data types. But as mentioned above, it’s worth noting that you can
exploit the various typeclass & type family machinery to get by without using
‘Fix’ at all: see e.g. Danny Gratzer’s recursion-schemes post for an example of
this.</p>
<h2 id="some-useful-schemes">Some Useful Schemes</h2>
<p>So, with some discussion of the internals out of the way, we can look at some
of the more common and useful recursion schemes. I’ll concentrate on the
following four, as they’re the ones I’ve found the most use for:</p>
<ul>
<li>catamorphisms, implemented via ‘cata’, are generalized folds.</li>
<li>anamorphisms, implemented via ‘ana’, are generalized unfolds.</li>
<li>hylomorphisms, implemented via ‘hylo’, are anamorphisms followed by catamorphisms (corecursive production followed by recursive consumption).</li>
<li>paramorphisms, implemented via ‘para’, are generalized folds with access to the input argument corresponding to the most recent state of the computation.</li>
</ul>
<p>Let me digress slightly on nomenclature.</p>
<p>Yes, the names of these things are celebrations of the ridiculous. There’s no
getting around it; they look like self-parody to almost anyone not
pre-acquainted with categorical concepts. They have been accused — probably
correctly — of being off-putting.</p>
<p>That said, they communicate important technical details and are actually not so
bad when you get used to them. It’s perfectly fine and even encouraged to
arm-wave about folds or unfolds when speaking informally, but the moment
someone distinguishes one particular style of fold from another via a prefix
like e.g. para, I know exactly the relevant technical distinctions required to
understand the discussion. The names might be silly, but they have their place.</p>
<p>Anyway.</p>
<p>There are many other more exotic schemes that I’m sure are quite useful (see
Tim Williams’s recursion schemes talk, for example), but I haven’t made use of
any outside of these four just yet. The recursion-schemes library contains a
plethora of unfamiliar schemes just waiting to be grokked, but in the interim
even cata and ana alone will get you plenty far.</p>
<p>Now let’s use the motley crew of schemes to do some useful computation on our
example data types.</p>
<h3 id="catamorphisms">Catamorphisms</h3>
<p>Take our natural numbers type, ‘Nat’. To start, we can use a catamorphism to
represent a ‘Nat’ as an ‘Int’ by summing it up.</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">natsum</span> <span class="o">::</span> <span class="kt">Nat</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">natsum</span> <span class="o">=</span> <span class="n">cata</span> <span class="n">alg</span> <span class="kr">where</span>
<span class="n">alg</span> <span class="kt">ZeroF</span> <span class="o">=</span> <span class="mi">0</span>
<span class="n">alg</span> <span class="p">(</span><span class="kt">SuccF</span> <span class="n">n</span><span class="p">)</span> <span class="o">=</span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>
<p>Here ‘alg’ refers to ‘algebra’, which is the function that we use to define our
reducing semantics. Notice that the semantics are not defined recursively! The
recursion present in ‘Nat’ has been decoupled and is handled for us by ‘cata’.
And as a plus, we still don’t have to deal with the ‘Fix’ constructor anywhere.</p>
<p>As a brief aside: I like to write my recursion schemes in this way, but your
mileage may vary. If you’d like to enable the ‘LambdaCase’ extension, then
another option is to elide mentioning the algebra altogether using a very
simple case statement:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="cp">{-# LANGUAGE LambdaCase #-}</span>
<span class="n">natsum</span> <span class="o">::</span> <span class="kt">Nat</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">natsum</span> <span class="o">=</span> <span class="n">cata</span> <span class="o">$</span> <span class="nf">\</span><span class="kr">case</span> <span class="o">-></span>
<span class="kt">ZeroF</span> <span class="o">-></span> <span class="mi">0</span>
<span class="kt">SuccF</span> <span class="n">n</span> <span class="o">-></span> <span class="n">n</span> <span class="o">+</span> <span class="mi">1</span>
</code></pre></div></div>
<p>Some people find this more readable.</p>
<p>To understand how we used ‘cata’ to build this function, take a look at its
type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">cata</span> <span class="o">::</span> <span class="kt">Foldable</span> <span class="n">t</span> <span class="o">=></span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="n">a</span> <span class="o">-></span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">t</span> <span class="o">-></span> <span class="n">a</span>
</code></pre></div></div>
<p>The ‘Base t a -> a’ term is the algebra; ‘t’ is our recursive datatype (i.e.
‘Nat’), and ‘a’ is whatever type we’re reducing a value to.</p>
<p>Historically I’ve found ‘Base’ here to be confusing, but here’s a neat trick to
help reason through it.</p>
<p>Remember that ‘Base’ is a type family, so for some appropriate ‘t’ and ‘a’,
‘Base t a’ is going to be a synonym for some other type. To figure out what
‘Base t a’ corresponds to for some concrete ‘t’ and ‘a’, we can ask GHCi via
this lesser-known command that evaluates type families:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span> <span class="o">:</span><span class="n">kind</span><span class="o">!</span> <span class="kt">Base</span> <span class="kt">Nat</span> <span class="kt">Int</span>
<span class="kt">Base</span> <span class="kt">Nat</span> <span class="kt">Int</span> <span class="o">::</span> <span class="o">*</span>
<span class="o">=</span> <span class="kt">NatF</span> <span class="kt">Int</span>
</code></pre></div></div>
<p>So in the ‘natsum’ example the algebra used with ‘cata’ must have type ‘NatF
Int -> Int’. This is pretty obvious for ‘cata’, but I initially found that
figuring out what type should be replaced for ‘Base’ exactly could be confusing
for some of the more exotic schemes.</p>
<p>As another example, we can use a catamorphism to implement ‘filter’ for our list type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">filterL</span> <span class="o">::</span> <span class="p">(</span><span class="n">a</span> <span class="o">-></span> <span class="kt">Bool</span><span class="p">)</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span> <span class="o">-></span> <span class="kt">List</span> <span class="n">a</span>
<span class="n">filterL</span> <span class="n">p</span> <span class="o">=</span> <span class="n">cata</span> <span class="n">alg</span> <span class="kr">where</span>
<span class="n">alg</span> <span class="kt">NilF</span> <span class="o">=</span> <span class="n">nil</span>
<span class="n">alg</span> <span class="p">(</span><span class="kt">ConsF</span> <span class="n">x</span> <span class="n">xs</span><span class="p">)</span>
<span class="o">|</span> <span class="n">p</span> <span class="n">x</span> <span class="o">=</span> <span class="n">cons</span> <span class="n">x</span> <span class="n">xs</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="n">xs</span>
</code></pre></div></div>
<p>It follows the same simple pattern: we define our semantics by interpreting
recursion-less constructors through an algebra, then pump it through ‘cata’.</p>
<h3 id="anamorphisms">Anamorphisms</h3>
<p>These running examples are toys, but even here it’s really annoying to have to
type ‘succ (succ (succ (succ (succ (succ zero)))))’ to get a natural number
corresponding to six for debugging or what have you.</p>
<p>We can use an anamorphism to build a ‘Nat’ value from an ‘Int’:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">nat</span> <span class="o">::</span> <span class="kt">Int</span> <span class="o">-></span> <span class="kt">Nat</span>
<span class="n">nat</span> <span class="o">=</span> <span class="n">ana</span> <span class="n">coalg</span> <span class="kr">where</span>
<span class="n">coalg</span> <span class="n">n</span>
<span class="o">|</span> <span class="n">n</span> <span class="o"><=</span> <span class="mi">0</span> <span class="o">=</span> <span class="kt">ZeroF</span>
<span class="o">|</span> <span class="n">otherwise</span> <span class="o">=</span> <span class="kt">SuccF</span> <span class="p">(</span><span class="n">n</span> <span class="o">-</span> <span class="mi">1</span><span class="p">)</span>
</code></pre></div></div>
<p>Just as a small detail: to be descriptive, here I’ve used ‘coalg’ as the
argument to ‘ana’, for ‘coalgebra’.</p>
<p>Now the expression ‘nat 6’ will do the same for us as the more verbose example
above. As always, recursion is not part of the semantics; to have the integer
‘n’ we pass in correspond to the correct natural number, we use the successor
value of ‘n — 1’.</p>
<h3 id="paramorphisms">Paramorphisms</h3>
<p>As an example, try to express a factorial on a natural number in terms of
‘cata’. It’s (apparently) doable, but an implementation is not immediately
clear.</p>
<p>A paramorphism will operate on an algebra that provides access to the input
argument corresponding to the running state of the recursion. Check out the
type of ‘para’ below:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">para</span> <span class="o">::</span> <span class="kt">Foldable</span> <span class="n">t</span> <span class="o">=></span> <span class="p">(</span><span class="kt">Base</span> <span class="n">t</span> <span class="p">(</span><span class="n">t</span><span class="p">,</span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">a</span><span class="p">)</span> <span class="o">-></span> <span class="n">t</span> <span class="o">-></span> <span class="n">a</span>
</code></pre></div></div>
<p>If we’re implementing a factorial on ‘Nat’ values then ‘t’ is going to
correspond to ‘Nat’ and ‘a’ is going to correspond to (say) ‘Integer’. Here it
might help to use the ‘:kind!’ trick to help reason through the requirements of
the algebra. We can ask GHCi to help us out:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="o">></span> <span class="o">:</span><span class="n">kind</span><span class="o">!</span> <span class="kt">Base</span> <span class="kt">Nat</span> <span class="p">(</span><span class="kt">Nat</span><span class="p">,</span> <span class="kt">Int</span><span class="p">)</span>
<span class="kt">Base</span> <span class="kt">Nat</span> <span class="p">(</span><span class="kt">Nat</span><span class="p">,</span> <span class="kt">Int</span><span class="p">)</span> <span class="o">::</span> <span class="o">*</span>
<span class="o">=</span> <span class="kt">NatF</span> <span class="p">(</span><span class="kt">Nat</span><span class="p">,</span> <span class="kt">Int</span><span class="p">)</span>
</code></pre></div></div>
<p>Side note: after doing this trick a few times you’ll probably find it much
easier to reason about type families sans-GHCi. In any case, we can now
implement an algebra corresponding to the required type:</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="n">natfac</span> <span class="o">::</span> <span class="kt">Nat</span> <span class="o">-></span> <span class="kt">Int</span>
<span class="n">natfac</span> <span class="o">=</span> <span class="n">para</span> <span class="n">alg</span> <span class="kr">where</span>
<span class="n">alg</span> <span class="kt">ZeroF</span> <span class="o">=</span> <span class="mi">1</span>
<span class="n">alg</span> <span class="p">(</span><span class="kt">SuccF</span> <span class="p">(</span><span class="n">n</span><span class="p">,</span> <span class="n">f</span><span class="p">))</span> <span class="o">=</span> <span class="n">natsum</span> <span class="p">(</span><span class="n">succ</span> <span class="n">n</span><span class="p">)</span> <span class="o">*</span> <span class="n">f</span>
</code></pre></div></div>
<p>Here there are some details to point out.</p>
<p>The type of our algebra is ‘NatF (Nat, Int) -> Int’; the value with the ‘Nat’
type, ‘n’, holds the most recent input argument used to compute the state of
the computation, ‘f’.</p>
<p>If you picture a factorial defined as</p>
<div class="language-haskell highlighter-rouge"><div class="highlight"><pre class="highlight"><code><span class="mi">0</span><span class="o">!</span> <span class="o">=</span> <span class="mi">1</span>
<span class="p">(</span><span class="n">k</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span><span class="o">!</span> <span class="o">=</span> <span class="p">(</span><span class="n">k</span> <span class="o">+</span> <span class="mi">1</span><span class="p">)</span> <span class="o">*</span> <span class="n">k</span><span class="o">!</span>
</code></pre></div></div>
<p>Then ‘n’ corresponds to ‘k’ and ‘f’ corresponds to ‘k!’. To compute the
factorial of the successor to ‘n’, we just convert ‘succ n’ to an integer (via
‘natsum’) and multiply it by ‘f’.</p>
<p>Paramorphisms tend to be pretty useful for a lot of mund