02 Jul 2018
Why does my blog often feature its typical motley mix of probability,
functional programming, and computer science anyway?
From 2011 through 2017 I slogged through a Ph.D. in statistics, working on it
full time in 2012, and parttime in every other year. It was an interesting
experience. Although everything worked out for me in the end – I managed to
do a lot of good and interesting work in industry while still picking up a
Ph.D. on the side – it’s not something I’d necessarily recommend to others.
The smart strategy is surely to choose one thing and give it one’s maximum
effort; by splitting my time between work and academia, both obviously suffered
to some degree.
That said, at the end of the day I was pretty happy with the results on both
fronts. On the academic side of things, the main product was a dissertation,
Embedded DomainSpecific Languages for Bayesian Modelling and
Inference, supporting my thesis: that novel and useful DSLs for solving
problems in Bayesian statistics can be embedded in staticallytyped, purely
functional programming languages.
It helps to remember that in this day and age, one can still typically graduate
by, uh, “merely” submitting and defending a dissertation. Publishing in
academic venues certainly helps focus one’s work, and is obviously necessary
for a career in academia (or, increasingly, industrial research). But it’s
optional when it comes to getting your degree, so if it doesn’t help you
achieve your goals, you may want to reconsider it, as I did.
The problem with the dissertationfirst approach, of course, is that nobody
reads your work. To some extent I think I’ve mitigated that; most of the
content in my dissertation is merely a fleshedout version of various ideas
I’ve written about on this blog. Here I’ll continue that tradition and write a
brief, informal summary of my dissertation and Ph.D. more broadly – what I
did, how I approached it, and what my thoughts are on everything after the
fact.
The Idea
Following the advice of Olin Shivers (by way of Matt Might), I
oriented my work around a concrete thesis, which wound up more or less
being that embedding DSLs in a Haskelllike language can be a useful technique
for solving statistical problems. This thesis wasn’t born into the world
fullyformed, of course – it began as quite a vague (or misguided) thing, but
matured naturally over time. Using the tools of programming languages and
compilers to do statistics and machine learning is the motivation behind
probabilistic programming in general; what I was interested in was exploring
the problem in the setting of languages embedded in a purely functional host.
Haskell was the obvious choice of host for all of my implementations.
It may sound obvious that putting together a thesis is a good strategy for a
Ph.D. But here I’m talking about a thesis in the original (Greek) sense
of a proposition, i.e. a falsifiable idea or claim (in contrast to a
dissertation, from the Latin disserere, i.e. to examine or to discuss).
Having a central idea to orient your work around can be immensely useful in
terms of focus. When you read a dissertation with a clear thesis, it’s easy to
know what the writer is generally on about – without one it can (increasingly)
be tricky.
My thesis is pretty easy to defend in the abstract. A DSL really exposes the
structure of one’s problem while also constraining it appropriately, and
embedding one in a host language means that one doesn’t have to implement an
entire compiler toolchain to support it. I reckoned that simply pointing the
artillery of “language engineering” at the statistical domain would lead to
some interesting insight on structure, and maybe even produce some useful
tools. And it did!
The Contributions
Of course, one needs to do a little more defending than that to satisfy his or
her examination committee. Doctoral research is supposed to be substantial and
novel. In my experience, reviewers are concerned with your answers to the
following questions:
 What, specifically, are your claims?
 Are they novel contributions to your field?
 Have you backed them up sufficiently?
At the end of the day, I claimed the following advances from my work.

Novel probabilistic interpretations of the Giry monad’s algebraic
structure. The Giry monad (Lawvere, 1962; Giry, 1981) is the
“canonical” probability monad, in a meaningful sense, and I demonstrated that
one can characterise the measuretheoretic notion of image measure by its
functorial structure, as well as the notion of product measure by its
monoidal structure. Having the former around makes it easy to transform the
support of a probability distribution while leaving its density structure
invariant, and the latter lets one encode probabilistic independence, enabling
things like measure convolution and the like. What’s more, the analogous
semantics carry over to other probability monads – for example the wellknown
sampling monad, or more abstract variants.

A novel characterisation of the Giry monad as a restricted continuation
monad. Ramsey & Pfeffer (2002) discussed an “expectation monad,”
and I had independently come up with my own “measure monad” based on
continuations. But I showed both reduce to a restricted form of the
continuation monad of Wadler (1994) – and that indeed, when the
return type of Wadler’s continuation monad is restricted to the reals, it is
the Giry monad.
To be precise it’s actually somewhat more general – it permits integration
with respect to any measure, not only a probability measure – but that
definition strictly subsumes the Giry monad. I also showed that product
measure, via the applicative instance, yields measure convolution and
associated operations.

A novel technique for embedding a staticallytyped probabilistic
programming language in a purely functional language. The general idea
itself is wellknown to those who have worked with DSLs in Haskell: one
constructs a base functor and wraps it in the free monad. But the reason
that technique is appropriate in the probabilistic programming domain is that
probabilistic models are fundamentally monadic constructs – merely recall
the existence of the Giry monad for proof!
To construct the requisite base functor, one maps some core set of concrete
probability distributions denoted by the Giry monad to a collection of
abstract probability distributions represented only by unique names. These
constitute the branches of one’s base functor, which is then wrapped in the
familiar ‘Free’ machinery that gives one access to the functorial,
applicative, and monadic structure that I talked about above. This abstract
representation of a probabilistic model allows one to implement other
probability monads, such as the wellknown sampling monad (Ramsey & Pfeffer,
2002; Park et al., 2008) or the Giry monad, by way of interpreters.
(N.b. Ścibior et al. (2015) did some very similar work to this,
although the monad they used was arguably more operational in its flavour.)

A novel characterisation of execution traces as cofree comonads. The
idea of an “execution trace” is that one runs a probabilistic program
(typically generating a sample) and then records how it executed – what
randomness was used, the execution path of the program, etc. To do Bayesian
inference, one then runs a Markov chain over the space of possible execution
traces, calculating statistics about the resulting distribution in trace
space (Wingate et al., 2011).
Remarkably, a cofree comonad over the same abstract probabilistic base
functor described above allows us to represent an execution trace at the
embedded language level itself. In practical terms, that means one can
denote a probabilistic model, and then run a Markov chain over the space of
possible ways it could have executed, without leaving GHCi. You can
alternatively examine and perturb the way the program executes, stepping
through it piece by piece, as I believe was originally a feature in Venture
(Mansinghka et al., 2014).
(N.b. this really blew my mind when I first started toying with it,
converting programs into execution traces and then manipulating them as
firstclass values, defining other probabilistic programs over spaces of
execution traces, etc. Meta.)

A novel technique for statically encoding conditional independence of
terms in this kind of embedded probabilistic programming language. If you
recall that I previously demonstrated the monoidal (i.e. applicative)
structure of the Giry monad encodes the notion of product measure, it will
not be too surprising to hear that I used the free applicative functor
(Capriotti & Kaposi, 2014) (again, over the same kind of abstract
probabilistic base functor) to reify applicative expressions such that they can
be identified statically.

A novel shallowlyembedded language for building custom transition
operators for use in Markov chain Monte Carlo. MCMC is the defacto
standard way to perform inference on Bayesian models (although it is not
limited to Bayesian models in particular). By wrapping a simple state monad
transformer around a probability monad, one can denote Markov transition
operators, combine them, and transform them in a few ways that are useful for
doing MCMC.
The framework here was inspired by the old parallel “strategies” idea of
Trinder et al. (1998). The idea is that you want to “evaluate” a
posterior via MCMC, and want to choose a strategy by which to do so – e.g.
Metropolis (Metropolis, 1953), slice sampling (Neal, 2003),
Hamiltonian (Neal, 2011), etc. Since Markov transition operators
are closed under composition and convex combinations, it is easy to write a
little shallowlyembedded combinator language for working with them –
effectively building evaluation strategies in a manner familiar to those
who’ve worked with Haskell’s parallel library.
(N.b. although this was the most trivial part of my research, theoretical or
implementationwise, it remains the most useful for daytoday practical
work.)
The Execution
One needs to stitch his or her contributions together in some kind of
overarching narrative that supports the underlying thesis. Mine went
something like this:
The Giry monad is appropriate for denoting probabilistic semantics in
languages with purelyfunctional hosts. Its functorial, applicative, and
monadic structure denote probability distributions, independence, and
marginalisation, respectively, and these are necessary and sufficient for
encoding probabilistic models. An embedded language based on the Giry monad is
typesafe and composable.
Probabilistic models in an embedded language, semantically denoted in terms of
the Giry monad, can be made abstract and interpretationindependent by defining
them in terms of a probabilistic base functor and a free monad instead. They
can be forwardinterpreted using standard free monad recursion schemes in order
to compute probabilities (via a measure intepretation) or samples (via a
sampling interpretation); the latter interpretation is useful for performing
limited forms of Bayesian inference, in particular. These freeencoded models
can also be transformed into cofreeencoded models, under which they represent
execution traces that can be perturbed arbitrarily by standard comonadic
machinery. This representation is amenable to more elaborate forms of Bayesian
inference. To accurately denote conditional independence in the embedded
language, the free applicative functor can also be used.
One can easily construct a shallowlyembedded language for building custom
Markov transitions. Markov chains that use these compound transitions can
outperform those that use only “primitive” transitions in certain settings.
The shallowly embedded language guarantees that transitions can only be
composed in welldefined, typesafe ways that preserve the properties desirable
for MCMC. What’s more, one can implement “transition transformers” for
implementing still more complex inference techniques, e.g. annealing or
tempering, over existing transitions.
Thus: novel and useful domainspecific languages for solving problems in
Bayesian statistics can be embedded in staticallytyped, purelyfunctional
programming languages.
I used the twentyminute talk period of my defence to go through this
narrative and point out my claims, after which I was grilled on them for an
hour or two. The defence was probably the funnest part of my whole Ph.D.
The Product
In the end, I mainly produced a dissertation, a few blog posts, and
some code. By my count, the following repos came out of the work:
If any of this stuff is or was useful to you, that’s great! I still use the
declarative libraries, flatmcmc, mwcprobability, and sampling pretty
regularly. They’re fast and convenient for practical work.
Some of the other stuff, e.g. measurable, is useful for building intuition,
but not so much in practice, and deanie, for example, is a workinprogress
that will probably not see much more progress (from me, at least). Continuing
from where I left off might be a good idea for someone who wants to explore
problems in this kind of setting in the future.
General Thoughts
When I first read about probabilistic (functional) programming in Dan Roy’s
2011 dissertation I was absolutely blown away by the idea. It seemed
that, since there was such an obvious connection between the structure of
Bayesian models and programming languages (via the underlying semantic graph
structure, something that has been exploited to some degree as far back as
BUGS), it was only a matter of time until someone was able to really
create a tool that would revolutionize the practice of Bayesian statistics.
Now I’m much more skeptical. It’s true that probabilistic programming tends to
expose some beautiful structure in statistical models, and that a probabilistic
programming language that was easy to use and “just worked” for inference would
be a very useful tool. But putting something expressive and usable together
that also “just works” for that inference step is very, very difficult. Very
difficult indeed.
Almost every probabilistic programming framework of the past ten years, from
Church down to my own stuff, has more or less wound up as “thesisware,” or
remains the exclusive publicationgenerating mechanism of a single research
group. The exceptions are almost unexceptional in of themselves: JAGS and Stan
are probably the mostused such frameworks, certainly in statistics (I will
mention the very honourable PyMC here as well), but they innovate little, if at
all, over the original BUGS in terms of expressiveness. Similarly it’s very
questionable whether the fancy MCMC algo du jour is really any better than
some combination of MetropolisHastings (even plain Metropolis), Gibbs (or its
approximate variant, slice sampling), or nested sampling in anything
outside of favourablyengineered examples (I will note that Hamiltonian Monte
Carlo could probably be counted in there too, but it can still be quite a pain
to use, its variants are probably overrated, and it is comparatively
expensive).
Don’t get me wrong. I am a militant Bayesian. Bayesian statistics, i.e., as
far as I’m concerned, probability theory, describes the world accurately.
And there’s nothing wrong with thesisware, either. Research is research, and
this is a very thorny problem area. I hope to see more abandoned, innovative
software that moves the ball up the field, or kicks it into another stadium
entirely. Not less. The more ingenious implementations and sampling schemes
out there, the better.
But more broadly, I often find myself in the camp of Leo Breiman, who
in 2001 characterised the two predominant cultures in statistics as those of
data modelling and algorithmic modelling respectively, the latter now known
as machine learning, of course. The crux of the data modelling argument,
which is of course predominant in probabilistic programming research and
Bayesian statistics more generally, is that a practitioner, by means of his or
her ingenuity, is able to suss out the essence of a problem and distill it into
a useful equation or program. Certainly there is something to this: science is
a matter of creating hypotheses, testing them against the world, and iterating
on that, and the “data modelling” procedure is absolutely scientific in
principle. Moreover, with a hat tip to Max Dama, one often wants to
impose a lot of structure on a problem, especially if the problem is in a
domain where there is a tremendous amount of noise. There are many areas where
this approach is just the thing one is looking for.
That said, it seems to me that a lot of the data modellingflavoured side of
probabilistic programming, Bayesian nonparametrics, etc., is to some degree
geared more towards being, uh, “research paper friendly” than anything else.
These are extremely seductive areas for curious folks who like to play at the
intersection of math, statistics, and computer science (raises hand), and one
can easily spend a lifetime chasing this or that exquisite theoretical
construct into any number of rabbit holes. But at the end of the day, the data
modelling culture, per Breiman:
.. has at its heart the belief that a statistician, by imagination
and by looking at the data, can invent a reasonably good parametric class of
models for a complex mechanism devised by nature.
Certainly the traditional statistics that Breiman wrote about in 2001 was very
different from probabilistic programming and similar fields in 2018. But I
think there is the same element of hubris in them, and to some extent, a
similar dissociation from reality. I have cultivated some of the applied bent
of a Breiman, or a Dama, or a Locklin, so perhaps this should not be
too surprising.
I feel that the 2012ish resurgence of neural networks jolted the machine
learning community out of a largescale descent into some rather dubious
Bayesian nonparametrics research, which, much as I enjoy that subject area,
seemed more geared towards generating fun machine learning summer school
lectures and NIPS papers than actually getting much practical work done. I
can’t help but feel that probabilistic programming may share a few of those
same characteristics. When all is said and done, answering the question is
this stuff useful? often feels like a stretch.
So: onward & upward and all that, but my enthusiasm has been tempered somewhat,
is all.
Fini
Administrative headaches and the existential questions associated with grad
school aside, I had a great time working in this area for a few years, if in my
own aloof and eccentric way.
If you ever interacted with this area of my work, I hope you got some utility
out of it: research ideas, use of my code, or just some blog post that you
thought was interesting during a slow day at the office. If you’re working in
the area, or are considering it, I wish you success, whether your goal is to
build practical tools, or to publish sexy papers. :)
27 Jun 2018
Take an iterated integral, e.g. . Fubini’s
Theorem describes the conditions under which the order of integration can
be swapped on this kind of thing while leaving its value invariant. If
Fubini’s conditions are met, you can convert your integral into and be guaranteed to obtain the same result you would have
gotten by going the other way.
What are these conditions? Just that you can glue your individual measures
together as a product measure, and that is integrable with respect to it.
I.e.,
Say you have a Giry monad implementation kicking around and you
want to see how Fubini’s Theorem works in terms of applicative functors,
monads, continuations, and all that. It’s pretty easy. You could start with
my old measurable library that sits on GitHub and attracts curious stars
from time to time and cook up the following example:
import Control.Applicative ((<$>), (<*>))
import Measurable
dx :: Measure Int
dx = bernoulli 0.5
dy :: Measure Double
dy = beta 1 1
dprod :: Measure (Int, Double)
dprod = (,) <$> dx <*> dy
Note that ‘dprod’ is clearly a product measure (I’ve constructed it using the
Applicative instance for the Giry monad, so it must be a product
measure) and take a simple, obviously integrable function:
add :: (Int, Double) > Double
add (m, x) = fromIntegral m + x
Since ‘dprod’ is a product measure, Fubini’s Theorem guarantees that the
following are equivalent:
i0 :: Double
i0 = integrate add dprod
i1 :: Double
i1 = integrate (\x > integrate (curry add x) dy) dx
i2 :: Double
i2 = integrate (\y > integrate (\x > curry add x y) dx) dy
And indeed they are – you can verify them yourself if you don’t believe me (or
our boy Fubini).
For an example of a where interchanging the order of integration would be
impossible, we can construct some other measure:
dpair :: Measure (Int, Double)
dpair = do
x < dx
y < fmap (* fromIntegral x) dy
return (x, y)
It can be integrated as follows:
i3 :: Double
i3 = integrate (\x > integrate (curry add x) (fmap (* fromIntegral x) dy)) dx
But notice how ‘dpair’ is constructed: it is strictly monadic, not
applicative, so the order of the expressions matters. Since ‘dpair’ can’t be
expressed as a product measure (i.e. by an applicative expression), Fubini says
that swapping the order of integration is a nono.
Note that if you were to just look at the types of ‘dprod’ and ‘dpair’ – both
‘Measure (Int, Double)’ – you wouldn’t be able to tell immediately that one
represents a product measure while the other one does not. If being able to
tell these things apart statically is important to you (say, you want to
statically apply orderofintegration optimisations to integral expressions or
what have you), you need look no further than the free applicative
functor to help you out.
Fun fact: there is a wellknown variant of Fubini’s Theorem, called Tonelli’s
Theorem, that was developed by another Italian guy at around the same time.
I’m not sure how early20th century Italy became so strong in
orderofintegration research, exactly.
22 Jan 2018
You can recognize truth by its beauty and simplicity.
– Richard Feynman (attributed)
In one of his early emails on the Cryptography mailing list, Satoshi
claimed that the proofofwork chain is a solution to the Byzantine Generals
Problem (BGP). He describes this via an example where a bunch of
generals – Byzantine ones, of course – collude to break a king’s wifi.
It’s interesting to look at this a little closer in the language of the
originallystated BGP itself. One doesn’t need to be too formal to glean
useful intuition here.
What, more precisely, did Satoshi claim?
The Decentralized Timestamp Server
Satoshi’s problem is that of a decentralized timestamp server (DTS).
Namely, he posits that any number of nodes, following some protocol, can
together act as a timestamping server – producing some consistent ordering on
what we’ll consider to be abstract ‘blocks’.
The decentralized timestamp server reduces to an instance of the Byzantine
Generals Problem as follows. There are a bunch of nodes, who could each be
honest or dishonest. All honest nodes want to agree on some ordering – a
history – of blocks, and a small number of dishonest nodes should not easily
be able to compromise that history – say, by convincing the honest nodes to
adopt some alternate one of their choosing.
(N.b. it’s unimportant here to be concerned about the contents of blocks.
Since the decentralized timestamp server problem is only concerned about
block orderings, we don’t need to consider the case of invalid transactions
within blocks or what have you, and can safely assume that any history must
be internally consistent. We only need to assume that child blocks depend
utterly on their parents, so that rewriting a history by altering some parent
block also necessitates rewriting its children, and that honest nodes are
constantly trying to append blocks.)
As demonstrated in the introduction to the original paper, the Byzantine
Generals Problem can be reduced to the problem of how any given node
communicates its information to others. In our context, it reduces to the
following:
Byzantine Generals Problem (DTS)
A node must broadcast a history of blocks to its peers, such that:
 (IC1) All honest peers agree on the history.
 (IC2) If the node is honest, then all honest peers agree with the history
it broadcasts.
To produce consensus, every node will communicate its history to others by
using a solution to the Byzantine Generals Problem.
Longest ProofofWork Chain
Satoshi’s proposed solution to the BGP has since come to be known as ‘Nakamoto
Consensus’. It is the following protocol:
Nakamoto Consensus
 Always use the longest history.
 Appending a block to any history requires a proof that a certain amount of
work – proportional in expectation to the total ‘capability’ of the
network – has been completed.
To examine how it works, consider an abstract network and communication medium.
We can assume that messages are communicated instantly (it suffices that
communication is dwarfed in time by actually producing a proof of work) and
that the network is static and fixed, so that only active or ‘live’ nodes
actually contribute to consensus.
The crux of Nakamoto consensus is that nodes must always use the longest
available history – the one that provably has the largest amount of work
invested in it – and appending to any history requires a nontrivial amount of
work in of itself. Consider a set of nodes, each having some (not necessarily
shared) history. Whenever any node broadcasts a oneblock longer history,
all honest nodes will immediately agree on it, and conditions (IC1) and (IC2)
are thus automatically satisfied whether or not the broadcasting node is
honest. Nakamoto Consensus trivially solves the BGP in this most important
case; we can examine other cases by examining how they reduce to this one.
If two or more nodes broadcast longer histories at approximately the same time,
then honest nodes may not agree on a single history for as long as it takes a
longer history to be produced and broadcast. As soon as this occurs (which,
in all probability, is only a matter of time), we reduce to the previous case
in which all honest nodes agree with each other again, and the BGP is resolved.
The ‘bad’ outcome we’re primarily concerned about is that of dishonest nodes
rewriting history in their favour, i.e. by replacing some history by another one that somehow benefits them. The idea here is that some dishonest
node (or nodes) intends to use block as some sort of commitment, but
later wants to renege. To do so, the node needs to rewrite not only ,
but all other blocks that depend on (here , etc.), ultimately
producing a longer history than is currently agreed upon by honest peers.
Moreover, it needs to do this faster than honest nodes are able to produce
longer histories on their own. Catching up to and exceeding the honest nodes
becomes exponentially unlikely in the number of blocks to be rewritten, and so
a measure of confidence can be ascribed to agreement on the state of any
subhistory that has been ‘buried’ by a certain number of blocks (see the
penultimate section of Satoshi’s paper for details).
Dishonest nodes that seek to replace some wellestablished, agreedupon history
with another will thus find it effectively impossible (i.e. the probability is
negligible) unless they control a majority of the network’s capability – at
which point they no longer constitute a small number of peers.
Summary
So in the language of the originallystated BGP: Satoshi claimed that the
decentralized timestamp server is an instance of the Byzantine Generals
Problem, and that Nakamoto Consensus (as it came to be known) is a solution to
the Byzantine Generals Problem. Because Nakamoto Consensus solves the BGP,
honest nodes that always use the longest proofofwork history in the
decentralized timestamp network will eventually come to consensus on the
ordering of blocks.
01 Mar 2017
Last week Dan Peebles asked me on Twitter if I knew of any writing on the use
of recursion schemes for expressing stochastic processes or other probability
distributions. And I don’t! So I’ll write some of what I do know myself.
There are a number of popular statistical models or stochastic processes that
have an overtly recursive structure, and when one has some recursive structure
lying around, the elegant way to represent it is by way of a recursion scheme.
In the case of stochastic processes, this typically boils down to using an
anamorphism to drive things. Or, if you actually want to be able to observe
the thing (note: you do), an apomorphism.
By representing a stochastic process in this way one can really isolate the
probabilistic phenomena involved in it. One bundles up the essence of a
process in a coalgebra, and then drives it via some appropriate recursion
scheme.
Let’s take a look at three stochastic processes and examine their probabilistic
and recursive structures.
Foundations
To start, I’m going to construct a simple embedded language in the spirit of
the ones used in my simple probabilistic programming and comonadic
inference posts. Check those posts out if this stuff looks too
unfamiliar. Here’s a preamble that constitutes the skeleton of the code we’ll
be working with.
{# LANGUAGE DeriveFunctor #}
{# LANGUAGE FlexibleContexts #}
{# LANGUAGE LambdaCase #}
{# LANGUAGE RankNTypes #}
{# LANGUAGE TypeFamilies #}
import Control.Monad
import Control.Monad.Free
import qualified Control.Monad.Trans.Free as TF
import Data.Functor.Foldable
import Data.Random (RVar, sample)
import qualified Data.Random.Distribution.Bernoulli as RF
import qualified Data.Random.Distribution.Beta as RF
import qualified Data.Random.Distribution.Normal as RF
 probabilistic instruction set, program definitions
data ModelF a r =
BernoulliF Double (Bool > r)
 GaussianF Double Double (Double > r)
 BetaF Double Double (Double > r)
 DiracF a
deriving Functor
type Program a = Free (ModelF a)
type Model b = forall a. Program a b
type Terminating a = Program a a
 core language terms
bernoulli :: Double > Model Bool
bernoulli p = liftF (BernoulliF vp id) where
vp
 p < 0 = 0
 p > 1 = 1
 otherwise = p
gaussian :: Double > Double > Model Double
gaussian m s
 s <= 0 = error "gaussian: variance out of bounds"
 otherwise = liftF (GaussianF m s id)
beta :: Double > Double > Model Double
beta a b
 a <= 0  b <= 0 = error "beta: parameter out of bounds"
 otherwise = liftF (BetaF a b id)
dirac :: a > Program a b
dirac x = liftF (DiracF x)
 interpreter
rvar :: Program a a > RVar a
rvar = iterM $ \case
BernoulliF p f > RF.bernoulli p >>= f
GaussianF m s f > RF.normal m s >>= f
BetaF a b f > RF.beta a b >>= f
DiracF x > return x
 utilities
free :: Functor f => Fix f > Free f a
free = cata Free
affine :: Num a => a > a > a > a
affine translation scale = (+ translation) . (* scale)
Just as a quick review, we’ve got:
 A probabilistic instruction set defined by ‘ModelF’. Each constructor
represents a foundational probability distribution that we can use in our
embedded programs.
 Three types corresponding to probabilistic programs. The ‘Program’ type
simply wraps our instruction set up in a naïve free monad. The ‘Model’
type denotes probabilistic programs that may not necessarily
terminate (in some weak sense), while the ‘Terminating’ type denotes
probabilistic programs that terminate (ditto).
 A bunch of embedded language terms. These are just probability
distributions; here we’ll manage with the Bernouli, Gaussian, and beta
distributions. We also have a ‘dirac’ term for constructing a Dirac
distribution at a point.
 A single interpeter ‘rvar’ that interprets a probabilistic program into a
random variable (where the ‘RVar’ type is provided by randomfu).
Typically I use mwcprobability for this but randomfu is quite
nice. When a program has been interpreted into a random variable we can use
‘sample’ to sample from it.
So: we can write simple probabilistic programs in standard monadic fashion,
like so:
betaBernoulli :: Double > Double > Model Bool
betaBernoulli a b = do
p < beta a b
bernoulli p
and then interpret them as needed:
> replicateM 10 (sample (rvar (betaBernoulli 1 8)))
[False,False,False,False,False,False,False,True,True,False]
The Geometric Distribution
The geometric distribution is not a stochastic process per se, but it
can be represented by one. If we repeatedly flip a coin and then count the
number of flips until the first head, and then consider the probability
distribution over that count, voilà. That’s the geometric distribution. You
might see a head right away, or you might be infinitely unlucky and never see
a head. So the distribution is supported over the entirety of the natural
numbers.
For illustration, we can encode the coin flipping process in a straightforward
recursive manner:
simpleGeometric :: Double > Terminating Int
simpleGeometric p = loop 1 where
loop n = do
accept < bernoulli p
if accept
then dirac n
else loop (n + 1)
We start flipping Bernoullidistributed coins, and if we observe a head we stop
and return the number of coins flipped thus far. Otherwise we keep flipping.
The underlying probabilistic phenomena here are the Bernoulli draw, which
determines if we’ll terminate, and the dependent Dirac return, which will wrap
a terminating value in a point mass. The recursive procedure itself has the
pattern of:
 If some condition is met, abort the recursion and return a value.
 Otherwise, keep recursing.
This pattern describes an apomorphism, and the recursionschemes type
signature of ‘apo’ is:
apo :: Corecursive t => (a > Base t (Either t a)) > a > t
It takes a coalgebra that returns an ‘Either’ value wrapped up in a base
functor, and uses that coalgebra to drive the recursion. A ‘Left’returned
value halts the recursion, while a ‘Right’returned value keeps it going.
Don’t be put off by the type of the coalgebra if you’re unfamiliar with
apomorphisms  its bark is worse than its bite. Check out my older post on
apomorphisms for a brief introduction to them.
With reference to the ‘apo’ type signature, The main thing to choose here is
the recursive type that we’ll use to wrap up the ‘ModelF’ base functor.
‘Fix’ might be conceivably simpler to start, so I’ll begin with that. The
coalgebra defining the model looks like this:
geoCoalg p n = BernoulliF p (\accept >
if accept
then Left (Fix (DiracF n))
else Right (n + 1))
Then given the coalgebra, we can just wrap it up in ‘apo’ to represent the
geometric distribution.
geometric :: Double > Terminating Int
geometric p = free (apo (geoCoalg p) 1)
Since the geometric distribution (weakly) terminates, the program has return
type ‘Terminating Int’.
Since we’ve encoded the coalgebra using ‘Fix’, we have to explicitly convert
to ‘Free’ via the ‘free’ utility function I defined in the preamble. Recent
versions of recursionschemes have added a ‘Corecursive’ instance for ‘Free’,
though, so the superior alternative is to just use that:
geometric :: Double > Terminating Int
geometric p = apo coalg 1 where
coalg n = TF.Free (BernoulliF p (\accept >
if accept
then Left (dirac n)
else Right (n + 1)))
The point of all this is that we can isolate the core probabilistic phenomena
of the recursive process by factoring it out into a coalgebra. The recursion
itself takes the form of an apomorphism, which knows nothing about probability
or flipping coins or what have you  it just knows how to recurse, or stop.
For illustration, here’s a histogram of samples drawn from the geometric via:
> replicateM 100 (sample (rvar (geometric 0.2)))
An Autoregressive Process
Autoregressive (AR) processes simply use a previous epoch’s output as the
current epoch’s input; the number of previous epochs used as input on any given
epoch is called the order of the process. An AR(1) process looks like this,
for example:
Here are independent and identicallydistributed random
variables that follow some error distribution. In other words, in this model
the value follows some probability distribution
given the last epoch’s output and some parameters and
.
An autoregressive process doesn’t have any notion of termination built into it,
so the purest way to represent one is via an anamorphism. We’ll focus on AR(1)
processes in this example:
ar1 :: Double > Double > Double > Double > Model Double
ar1 a b s = ana coalg where
coalg x = TF.Free (GaussianF (affine a b x) s (affine a b))
Each epoch is just a Gaussiandistributed affine transformation of the previous
epochs’s output. But the problem with using an anamorphism here is that it will
just shoot off to infinity, recursing endlessly. This doesn’t do us a ton of
good if we want to actually observe the process, so if we want to do that
we’ll need to bake in our own conditions for termination. Again we’ll rely on
an apomorphism for this; we can just specify how many periods we want to
observe the process for, and stop recursing as soon as we exceed that.
There are two ways to do this. We can either get a view of the process at
periods in the future, or we can get a view of the process over
periods in the future. I’ll write both, for illustration. The coalgebra for
the first is simpler, and looks like:
arCoalg (n, x) = TF.Free (GaussianF (affine a b x) s (\y >
if n <= 0
then Left (dirac x)
else Right (pred m, y)))
The coalgebra is saying:
 Given , let have a Gaussian distribution with mean and standard deviation .
 If we’re on the last epoch, return as a Dirac point mass.
 Otherwise, continue recursing with as input to the next epoch.
Now, to observe the process over the next periods we can just collect
the observations we’ve seen so far in a list. An implementation of the
process, apomorphism and all, looks like this:
ar :: Int > Double > Double > Double > Double > Terminating [Double]
ar n a b s origin = apo coalg (n, [origin]) where
coalg (epochs, history@(x:_)) =
TF.Free (GaussianF (affine a b x) s (\y >
if epochs <= 0
then Left (dirac (reverse history))
else Right (pred epochs, y:history)))
(Note that I’m deliberately not handling the error condition here so as to
focus on the essence of the coalgebra.)
We can generate some traces for it in the standard way. Here’s how we’d sample
a 100long trace from an AR(1) process originating at 0 with ,
, and :
> sample (rvar (ar 100 0 1 1 0))
and here’s a visualization of 10 of those traces:
The StickBreaking Process
The stick breaking process is one of any number of whimsical stochastic
processes used as prior distributions in nonparametric Bayesian models. The
idea here is that we want to take a stick and endlessly break it into smaller
and smaller pieces. Every time we break a stick, we recursively take the rest
of the stick and break it again, ad infinitum.
Again, if we wanted to represent this endless process very faithfully, we’d use
an anamorphism to drive it. But in practice we’re going to only want to break
a stick some finite number of times, so we’ll follow the same pattern as the AR
process and use an apomorphism to do that:
sbp :: Int > Double > Terminating [Double]
sbp n a = apo coalg (n, 1, []) where
coalg (epochs, stick, sticks) = TF.Free (BetaF 1 a (\p >
if epochs <= 0
then Left (dirac (reverse (stick : sticks)))
else Right (pred epochs, (1  p) * stick, (p * stick):sticks)))
The coalgebra that defines the process says the following:
 Let the location of the break on the next (normalized) stick be
betadistributed.
 If we’re on the last epoch, return all the pieces of the stick that we broke
as a Dirac point mass.
 Otherwise, break the stick again and recurse.
Here’s a plot of five separate draws from a stick breaking process with
, each one observed for five breaks. Note that each draw
encodes a categorical distribution over the set ; the stick
breaking process is a ‘distribution over distributions’ in that sense:
The stick breaking process is useful for developing mixture models with an
unknown number of components, for example. The parameter can be
tweaked to concentrate or disperse probability mass as needed.
Conclusion
This seems like enough for now. I’d be interested in exploring other models
generated by recursive processes just to see how they can be encoded, exactly.
Basically all of Bayesian nonparametrics is based on using recursive processses
as prior distributions, so the Dirichlet process, Chinese Restaurant Process,
Indian Buffet Process, etc. should work beautifully in this setting.
Fun fact: back in 2011 before neural networks deep learning had taken over
machine learning, Bayesian nonparametrics was probably the hottest research
area in town. I used to joke that I’d create a new prior called the Malaysian
Takeaway Process for some esoteric nonparametric model and thus achieve machine
learning fame, but never did get around to that.
Addendum
I got a question about how I produce these plots. And the answer is the only
sane way when it comes to visualization in Haskell: dump the output to disk and
plot it with something else. I use R for most of my interactive/exploratory
data sciencefiddling, as well as for visualization. Python with matplotlib is
obviously a good choice too.
Here’s how I made the autoregressive process plot, for example. First, I just
produced the actual samples in GHCi:
> samples < replicateM 10 (sample (rvar (ar 100 0 1 1 0)))
Then I wrote them to disk:
> let render handle = hPutStrLn handle . filter (`notElem` "[]") . show
> withFile "trace.dat" WriteMode (\handle > mapM_ (render handle) samples)
The following R script will then get you the plot:
require(ggplot2)
require(reshape2)
raw = read.csv('trace.dat', header = F)
d = data.frame(t(raw), x = seq_along(raw))
m = melt(d, id.vars = 'x')
ggplot(m, aes(x, value, colour = variable)) + geom_line()
I used ggplot2 for the other plots as well; check out the ggplot2
functions geom_histogram
, geom_bar
, and facet_grid
in particular.
26 Feb 2017
In my last two posts about the Giry monad I derived the thing
from its categorical and measuretheoretic foundations. I kind of thought that
those posts wouldn’t be of much interest to people but they turned out to be a
hit. I clearly can’t tell what the internet likes.
Anyway, something I left out was the theoretical foundation of the Giry monad’s
Applicative instance, which seemed a bit odd. I also pointed out that
applicativeness in the context of probability implies independence between
probability measures.
In this article I’m going to look at each of these issues. After playing
around with the foundations, it looks like the applicative instance for the
Giry monad can be put on a sound footing in terms of the standard
measuretheoretic concept of product measure. Also, it turns out that the
claim I made of applicativeness independence is somewhat
illposed. But, using the shiny new intuition we’ll glean from a better
understanding of the applicative structure, we can put that on a solid footing
too.
So let’s look at both of these things and see what’s up.
Monoidal Functors
The foundational categorical concept behind applicative functors is the
monoidal functor, which is a functor between monoidal categories that
preserves monoidal structure.
Formally: for monoidal categories and ,
a monoidal functor is a functor and associated natural
transformations and that satisfy some coherence conditions that I won’t mention here.
Notably, if and are isomorphisms (i.e. are invertible) then
is called a strong monoidal functor. Otherwise it’s called lax.
Applicative functors in particular are lax monoidal functors.
This can be made much clearer for endofunctors on a monoidal category . Then you only have and to worry about. If we sub in the Giry monad
from the last couple of posts, we’d want and .
Does the category of measurable spaces have a monoidal
structure? Yup. Take measurable spaces and . From the Giry monad derivation we already have that the
monoidal identity corresponds to a Dirac measure
at a point, so that’s well and good. And we can define the tensor product
between and as follows: let be the
standard Cartesian product on and and let be the smallest algebra generated by the Cartesian
product of measurable sets and . Then is a
measurable space, and so is monoidal.
Recall that and  the space of measures
over and respectively  are themselves objects in
. So, clearly is a
measurable space, and if is monoidal then there must exist a
natural transformation that can take us from there to . This is the space of measures over the product .
So the question is: does have the required monoidal structure?
Yes. It must, since is a monad, and any monad can generate the
required natural transformation. Let be the monadic ‘join’ operator
and be the monadic identity
. We have, evaluating righttoleft:
Using makes this much easier to read:
or in code, just:
phi :: Monad m => (m a, m b) > m (a, b)
phi (m, n) = liftM2 (,) m n
So with that we have that is a (lax) monoidal
functor. And you can glean a monadgenerated applicative operator from
that immediately (this leads to the function called ‘ap’ in Control.Monad
):
ap :: Monad m => m (a > b) > m a > m b
ap f x = fmap (\(g, z) > g z) (phi f x)
(Note: I won’t refer to as the join operator from this point out in
order to free it up for denoting measures.)
Probabilistic Interpretations
Product Measure
The correct probabilistic interpretation here is that takes a pair of
probability measures to the product measure over the appropriate product
space. For probability measures and on measurable spaces
and respectively, the product measure is the (unique) measure on such that:
for a measurable set in .
Going through the monoidal functor route seems to put the notion of the Giry
applicative instance on a more firm measuretheoretic foundation. Instead of
considering the following from the Giry monad foundations article:
which is defined in terms of the dubious space of measures over measurable
functions , we can better view things using the monoidal
structurepreserving natural transformation . For measures and
on and respectively, we have:
and then for we can use the functor structure of
to do:
which corresponds to a standard applicative expression g <$> mu <*> nu
. I
suspect there’s then some sort of Yoneda argument or something that makes
currying and partial function application acceptable.
Independence
Now. What does this have to say about independence?
In particular, it’s too fast and loose to claim measures can be ‘independent’
at all. Independence is a property of measurable sets, measurable functions,
and algebras. Not of measures! But there is a really useful
connection, so let’s illustrate that.
First, let’s define independence formally as follows. Take a probability space
. Then any measurable sets and
in are independent if
That’s the simplest notion.
Next, consider two subalgebras and
of (a subalgebra is just a a subset of a
algebra that itself happens to be a algebra). Then
and are independent if, for any in
and any in , we have that and
are independent.
The final example is independence of measurable functions. Take measurable
functions and both from to the real numbers equipped with some
appropriate algebra . Then each of these functions
generates a sub algebra of as follows:
Then and are independent if the generated algebras
and are independent.
Note that in every case independence is defined in terms of a single
measure, . We can’t talk about different measures being
independent. To paraphrase Terry Tao here:
The notion of independence between [measurable functions] does not make sense
if the [measurable functions] are being modeled by separate probability
spaces; they have to be coupled together into a single probability space
before independence becomes a meaningful notion.
To be pedantic and explicitly specify the measure by which some things are
independent, some authors state that measurable functions and are
independent, for example.
We can see a connection to independence when we look at convolution and
associated operators. Recall that for measures and on the same
measurable space that supports some notion of
addition, their convolution looks like:
The probabilistic interpretation here (see Terry Tao on this too) is
that is the measure corresponding to the sum of independent
measurable functions and with corresponding measures and
respectively.
That looks weird though, since we clearly defined independence between
measurable functions using a single probability measure. How is it we can talk
about independent measurable functions and having different
corresponding measures?
We first need to couple everything together into a single probability space as
per Terry’s quote. Complete with some abstract probability measure
to form the probability space .
Now we have and measurable functions from to .
To say that and are independent is to say that their generated
algebras are independent. And the measures that they
correspond to are the pushforwards of under and
respectively. So, and . The result is that the measurable functions correspond to
different (pushforward) measures and , but are independent with
respect to the same underlying probability measure .
The monoidal structure of then gets us to convolution. Given a
product of measures and each on we can
immediately retrieve their product measure via
. And from there we can get to via the functor structure
of  we just find the pushforward of with
respect to a function that collapses a product via addition. So
is defined as:
and then the convolution is thus:
Other operations can be defined similarly, e.g. for we get:
The crux of all this is whenever we apply a measurable function to a product
measure, we can always extract notions of independent measurable functions
from the result. And the measures corresponding to those independent
measurable functions will be the components of the product measure
respectively.
This is super useful and lets one claim something stronger than what the
monadic structure gives you. In an expression like g <$> mu <*> nu <*> rho
,
you are guaranteed that the corresponding random variables ,
, (for suitable projections) are independent. The same
cannot be said if you use the monadic structure to do something like g mu nu
rho
where the product structure is not enforced  in that case you’re not
guaranteed anything of the sort. This is why the applicative structure is
useful for encoding independence in a way that the monadic structure is
not.
Conclusion
So there you have it. Applicativeness can seemingly be put on a
straightforward measuretheoretic grounding and has some useful implications
for independence.
It’s worth noting that, in the case of the Giry monad, we don’t need to go
through its monadic structure in order to recover an applicative instance. We
can do so entirely by hacking together continuations without using a single
monadic bind. This is actually how I defined the applicative instance in the
Giry monad implementation article previously:
instance Applicative Measure where
pure x = Measure (\f > f x)
Measure g <*> Measure h = Measure $ \f >
g (\k > h (f . k))
Teasing out the exact structure of this and its relation to the codensity monad
is again something I’ll leave to others.