26 Oct 2016
Some time ago I came across a way to in-principle perform inference on certain
probabilistic programs using comonadic structures and operations.
I decided to dig it up and try to use it to extend the simple probabilistic
programming language I talked about a few days ago with a stateful,
experimental inference backend. In this post we’ll
- Represent probabilistic programs as recursive types parameterized by
a terminating instruction set.
- Represent execution traces of probabilistic programs via a simple
transformation of our program representation.
- Implement the Metropolis-Hastings algorithm over this space of execution
traces and thus do some inference.
Let’s get started!
Representing Programs That Terminate
I like thinking of embedded languages in terms of instruction sets. That is:
I want to be able to construct my embedded language by first defining a
collection of abstract instructions and then using some appropriate recursive
structure to represent programs over that set.
In the case of probabilistic programs, our instructions are probability
distributions. Last time we used the following simple instruction set to
define our embedded language:
data ModelF r =
BernoulliF Double (Bool -> r)
| BetaF Double Double (Double -> r)
deriving Functor
We then created an embedded language by just wrapping it up in the
higher-kinded Free
type to denote programs of type Model
.
data Free f a =
Pure a
| Free (f (Free f a))
type Model = Free ModelF
Recall that Free
represents programs that can terminate, either by some
instruction in the underlying instruction set, or via the Pure
constructor of
the Free
type itself. The language defined by Free ModelF
is expressive
enough to easily construct a ‘forward-sampling’ interpreter, as well as a
simple rejection sampler for performing inference.
Notice that we don’t have a terminating instruction in ModelF
itself - if
we’re using it, then we need to rely on the Pure
constructor of Free
to
terminate programs. Otherwise they’d just have to recurse forever. This can
be a bit limiting if we want to transform a program of type Free ModelF
to
something else that doesn’t have a notion of termination baked-in (Fix
, for
example).
Let’s tweak the ModelF
type to get the following:
data ModelF a r =
BernoulliF Double (Bool -> r)
| BetaF Double Double (Double -> r)
| NormalF Double Double (Double -> r)
| DiracF a
deriving Functor
Aside from adding another foundational distribution - NormalF
- we’ve also
added a new constructor, DiracF
, which carries a parameter with type a
. We
need to incorporate this carrier type in the overall type of ModelF
as well,
so ModelF
itself also gets a new type parameter to carry around.
The DiracF
instruction is a terminating instruction; it has no recursive
point and just terminates with a value of type a
when reached. It’s
structurally equivalent to the Pure a
branch of Free
that we were relying
on to terminate our programs previously - the only thing we’ve done is add it
to our instruction set proper.
Why DiracF
? A Dirac distribution places the entirety of its
probability mass on a single point, and this is the exact probabilistic
interpretation of the applicative pure
or monadic return
that one
encounters with an appropriate probability type. Intuitively, if I sample a
value \(x\) from a uniform distribution, then that is indistinguishable from
sampling \(x\) from said uniform distribution and then sampling from a Dirac
distribution with parameter \(x\).
Make sense? If not, it might be helpful to note that there is no difference
between any of the following (to which uniform
and dirac
are analogous):
> action :: m a
> action >>= return :: m a
> action >>= return >>= return >>= return :: m a
Wrapping ModelF a
up in Free
, we get the following general type for our
programs:
type Program a = Free (ModelF a)
And we can construct a bunch of embedded language terms in the standard way:
beta :: Double -> Double -> Program a Double
beta a b = liftF (BetaF a b id)
bernoulli :: Double -> Program a Bool
bernoulli p = liftF (BernoulliF p id)
normal :: Double -> Double -> Program a Double
normal m s = liftF (NormalF m s id)
dirac :: a -> Program a b
dirac x = liftF (DiracF x)
Program
is a general type, capturing both terminating and nonterminating
programs via its type parameters. What do I mean by this? Note that in
Program a b
, the a
type parameter can only be concretely instantiated via
use of the terminating dirac
term. On the other hand, the b
type parameter
is unaffected by the dirac
term; it can only be instantiated by the other
nonterminating terms: beta
, bernoulli
, normal
, or compound expressions of
these.
We can thus distinguish between terminating and nonterminating programs at the
type level, like so:
type Terminating a = Program a Void
type Model b = forall a. Program a b
Void
is the uninhabited type, brought into scope via Data.Void
or simply
defined via data Void = Void Void
. Any program that ends via a dirac
instruction must be Terminating
, and any program that doesn’t end with a
dirac
instruction can not be Terminating
. We’ll just continue to call
a nonterminating program a Model
, as before.
Good. So if it’s not clear: from a user’s perspective, nothing has changed.
We still write probabilistic programs using simple monadic language terms.
Here’s a Gaussian mixture model where the mixing parameter follows a beta
distribution, for example:
mixture :: Double -> Double -> Model Double
mixture a b = do
prob <- beta a b
accept <- bernoulli prob
if accept
then normal (negate 2) 0.5
else normal 2 0.5
Meanwhile the syntax tree generated looks something like the following. It’s
more or less a traditional probabilistic graphical model description of our
program:

It’s important to note that in this embedded framework, the only pieces of the
syntax tree that we can observe are those related directly to our primitive
instructions. For our purposes this is excellent - we can focus on programs
entirely at the level of their probabilistic components, and ignore the
deterministic parts that would otherwise be distractions.
To collect samples from mixture
, we can first interpret it into a sampling
function, and then simulate from it. The toSampler
function from last
time doesn’t change much:
toSampler :: Program a a -> Prob IO a
toSampler = iterM $ \case
BernoulliF p f -> Prob.bernoulli p >>= f
BetaF a b f -> Prob.beta a b >>= f
NormalF m s f -> Prob.normal m s >>= f
DiracF x -> return x
Sampling from mixture 2 3
a thousand times yields the following
> simulate (toSampler (mixture 2 3))

Note that the rightmost component gets more traffic due to the hyperparameter
combination of 2 and 3 that we provided to mixture
.
Also, a note - since we have general recursion in Haskell, so-called
‘terminating’ programs here can actually.. uh, fail to terminate. They must
only terminate as far as we can express the sentiment at the embedded language
level. Consider the following, for example:
foo :: Terminating a
foo = (loop 1) >>= dirac where
loop a = do
p <- beta a 1
loop p
foo
here doesn’t actually terminate. But at least this kind of weird case
can be picked up in the types:
> :t simulate (toSampler foo)
simulate (toSampler foo) :: IO Void
If you try to sample from a distribution over Void
or forall a. a
then I
can’t be held responsible for what you get up to. But there are other cases,
sadly, where we’re also out of luck:
trollGeometric :: Double -> Model Int
trollGeometric p = loop where
loop = do
accept <- return False
if accept
then return 1
else fmap succ loop
A geometric distribution that actually used its argument \(p\), for \(0 < p
\leq 1\), could be guaranteed to terminate with probability 1. This one
doesn’t, so trollGeometric undefined >>= dirac
won’t.
At the end of the day we’re stuck with what our host language offers us. So,
take the termination guarantees for our embedded language with a grain of salt.
Stateful Inference
In the previous post we used a simple rejection sampler to sample from
a conditional distribution. ‘Vanilla’ Monte Carlo algorithms like rejection
and importance sampling are stateless. This makes them nice in some ways -
they tend to be simple to implement and are embarrassingly parallel, for
example. But the curse of dimensionality prevents them from scaling
well to larger problems. I won’t go into detail on that here - for a deep dive
on the topic, you probably won’t find anything better than this phenomenal
couple of talks on MCMC that Iain Murray gave at a MLSS session in
Cambridge in 2009. I think they’re unparalleled to this day.
The point is that in higher dimensions we tend to get a lot out of state.
Essentially, if one finds an interesting region of high-dimensional parameter
space, then it’s better to remember where that is, rather than forgetting it
exists as soon as one stumbles onto it. The manifold hypothesis conjectures
that interesting regions of space tend to be near other interesting regions
of space, so exploring neighbourhoods of interesting places tends to pay off.
Stateful Monte Carlo methods - namely, the family of Markov chain Monte Carlo
algorithms - handle exactly this, by using a Markov chain to wander over
parameter space. I’ve written on MCMC in the past -
you can check out some of those articles if you’re interested.
In the stateless rejection sampler we just performed conditional inference via
the following algorithm:
- Sample from a parameter model.
- Sample from a data model, using the sample from the parameter model as
input.
- If the sample from the data model matches the provided observations, return
the sample from the parameter model.
By repeating this many times we get a sample of arbitrary size from the
appropriate conditional, inverse, or posterior distribution (whatever you want
to call it).
In a stateful inference routine - here, the good old Metropolis-Hastings
algorithm - we’re instead going to do the following repeatedly:
- Sample from a parameter model, recording the way the program executed in
order to return the sample that it did.
- Compute the cost, in some sense, of generating the provided observations,
using the sample from the parameter model as input.
- Propose a new sample from the parameter model by perturbing the way the
program executed and recording the new sample the program outputs.
- Compute the cost of generating the provided observations using this new
sample from the parameter model as input.
- Compare the costs of generating the provided observations under the
respective samples from the parameter models.
- With probability depending on the ratio of the costs, flip a coin. If you
see a head, then move to the new, proposed execution trace of the program.
Otherwise, stay at the old execution trace.
This procedure generates a Markov chain over the space of possible execution
traces of the program - essentially, plausible ways that the program could have
executed in order to generate the supplied observations.
Implementations of Church use variations of this method to do
inference, the most famous of which is a low-overhead transformational
compilation procedure described in a great and influential 2011 paper
by David Wingate et al.
Representing Running Programs
To perform inference on probabilistic programs according to the aforementioned
Metropolis-Hastings algorithm, we need to represent executing programs
somehow, in a form that enables us to examine and modify their internal state.
How can we do that? We’ll pluck another useful recursive structure from our
repertoire and consider the humble Cofree
:
data Cofree f a = a :< f (Cofree f a)
Recall that Cofree
allows one to annotate programs with arbitrary
information at each internal node. This is a great feature; if we can annotate
each internal node with important information about its state - its current
value, the current state of its generator, the ‘cost’ associated with it - then
we can walk through the program and examine it as required. So, it can capture
a ‘running’ program in exactly the way we need.
Let’s describe running programs as values having the following Execution
type:
type Execution a = Cofree (ModelF a) Node
The Node
type is what we’ll use to describe the internal state of each node
on the program. I’ll define it like so:
data Node = Node {
nodeCost :: Double
, nodeValue :: Dynamic
, nodeSeed :: MWC.Seed
, nodeHistory :: [Dynamic]
} deriving Show
I’ll elaborate on this type below, but you can see that it captures a bunch of
information about the state of each node.
One can mechanically transform any Free
-encoded program into a
Cofree
-encoded program, so long as the original Free
-encoded program can
terminate of its own accord, i.e. on the level of its own instructions. Hence
the need for our Terminating
type and all that.
In our case, setting everything up just right takes a bit of code, mainly
around handling pseudo-random number generators in a pure fashion. So
I won’t talk about every little detail of it right here. The general idea is
to write a function that takes instructions to the appropriate state captured
by a Node
value, like so:
initialize :: Typeable a => MWC.Seed -> ModelF a b -> Node
initialize seed = \case
BernoulliF p _ -> runST $ do
(nodeValue, nodeSeed) <- samplePurely (Prob.bernoulli p) seed
let nodeCost = logDensityBernoulli p (unsafeFromDyn nodeValue)
nodeHistory = mempty
return Node {..}
BetaF a b _ -> runST $ do
(nodeValue, nodeSeed) <- samplePurely (Prob.beta a b) seed
let nodeCost = logDensityBeta a b (unsafeFromDyn nodeValue)
nodeHistory = mempty
return Node {..}
...
You can see that for each node, I sample from it, calculate its cost, and then
initialize its ‘history’ as an empty list.
Here it’s worth going into a brief aside.
There are two mildly annoying things we have to deal with in this situation.
First, individual nodes in the program typically sample values at different
types, and second, we can’t easily use effects when annotating. This means
that we have to pack heterogeneously-typed things into a homogeneously-typed
container, and also use pure random number generation facilities to sample
them.
A quick-and-dirty answer for the first case is to just use dynamic typing when
storing the values. It works and is easy, but of course is subject to the
standard caveats. I use a function called unsafeFromDyn
to convert
dynamically-typed values back to a typed form, so you can gauge the safety of
all this for yourself.
For the second case, I just use the ST
monad, along with manual state
snapshotting, to execute and iterate a random number generator. Pretty
simple.
Also: in terms of efficiency, keeping a node’s history on-site at each
execution falls into the ‘completely insane’ category, but let’s not worry much
about efficiency right now. Prototypes gonna prototype and all that.
Anyway.
Given this initialize
function, we can transform a terminating program into a
running program by simple recursion. Again, we can only transform programs
with type Terminating a
because we need to rule out the case of ever visiting
the Pure
constructor of Free
. We handle that by the absurd
function
provided by Data.Void
:
execute :: Typeable a => Terminating a -> Execution a
execute = annotate defaultSeed where
defaultSeed = (42, 108512)
annotate seeds term = case term of
Pure r -> absurd r
Free instruction ->
let (nextSeeds, generator) = xorshift seeds
seed = MWC.toSeed (V.singleton generator)
node = initialize seed instruction
in node :< fmap (annotate nextSeeds) instruction
And there you have it - execute
takes a terminating program as input and
returns a running program - an execution trace - as output. The syntax tree we
had previously gets turned into something like this:

Perturbing Running Programs
Given an execution trace, we’re able to step through it sequentially and
investigate the program’s internal state. But to do inference we also need to
modify it as well. What’s the answer here?
Just as Free
has a monadic structure that allows us to write embedded
programs using built-in monadic combinators and do-notation, Cofree
has a
comonadic structure that is amenable to use with the various comonadic
combinators found in Control.Comonad
. The most important for our purposes is
the comonadic ‘extend’ operation that’s dual to monad’s ‘bind’:
extend :: Comonad w => (w a -> b) -> w a -> w b
extend f = fmap f . duplicate
To perturb a running program, we can thus write a function that perturbs any
given annotated node, and then extend
it over the entire execution trace.
The perturbNode
function can be similar to the initialize
function from
earlier; it describes how to perturb every node based on the instruction found
there:
perturbNode :: Execution a -> Node
perturbNode (node@Node {..} :< cons) = case cons of
BernoulliF p _ -> runST $ do
(nvalue, nseed) <- samplePurely (Prob.bernoulli p) nodeSeed
let nscore = logDensityBernoulli p (unsafeFromDyn nvalue)
return $! Node nscore nvalue nseed nodeHistory
BetaF a b _ -> runST $ do
(nvalue, nseed) <- samplePurely (Prob.beta a b) nodeSeed
let nscore = logDensityBeta a b (unsafeFromDyn nvalue)
return $! Node nscore nvalue nseed nodeHistory
...
Note that this is a very crude way to perturb nodes - we’re just sampling from
whatever distribution we find at each one. A more refined procedure would
sample from each node on a more local basis, sampling from its respective
domain in a neighbourhood of its current location. For example, to perturb a
BetaF
node we might sample from a tiny Gaussian bubble around its current
location, repeating the process if we happen to ‘fall off’ the support. I’ll
leave matters like that for another post.
Perturbing an entire trace is then as easy as I claimed it to be:
perturb :: Execution a -> Execution a
perturb = extend perturbNode
For some comonadic intuition: when we ‘extend’ a function over an execution,
the trace itself gets ‘duplicated’ in a comonadic context. Each node in the
program becomes annotated with a view of the rest of the execution trace from
that point forward. It can be difficult to visualize at first, but I reckon
the following image is pretty faithful:

Each annotation then has perturbNode
applied to it, which reduces the trace
back to the standard annotated version we saw before.
Iterating the Markov Chain
So: to move around in parameter space, we’ll propose state changes by
perturbing the current state, and then accept or reject proposals according to
local economic conditions.
If you already have no idea what I’m talking about, then the phrase ‘local
economic conditions’ probably didn’t help you much. But it’s a useful analogy
to have in one’s head. Each state in parameter space has a cost associated
with it - the cost of generating the observations that we’re conditioning on
while doing inference. If certain parameter values yield a data model that is
unlikely to generate the provided observations, then those observations will be
expensive to generate when measured in terms of log-likelihood. Parameter
values that yield data models more likely to generate the supplied observations
will be comparatively cheaper.
If a proposed execution trace is significantly cheaper than the trace we’re
currently at, then we usually want to move to it. We allow some randomness in
our decision to keep everything nice and measure-preserving.
We can thus construct the conditional distribution over execution traces using
the following invert
function, using the same nomenclature as the rejection
sampler we used previously. To focus on the main points, I’ll elide some of
its body:
invert
:: (Eq a, Typeable a, Typeable b)
=> Int -> [a] -> Model b -> (b -> a -> Double)
-> Model (Execution b)
invert epochs obs prior ll = loop epochs (execute terminated) where
terminated = prior >>= dirac
loop n current
| n == 0 = return current
| otherwise = do
let proposal = perturb current
-- calculate costs and movement probability here
accept <- bernoulli prob
let next = if accept then proposal else stepGenerators current
loop (pred n) (snapshot next)
There are a few things to comment on here.
First, notice how the return type of invert
is Model (Execution b)
? Using
the semantics of our embedded language, it’s literally a standard model over
execution traces. The above function returns a first-class value that is
completely uninterpreted and abstract. Cool.
We’re also dealing with things a little differently from the rejection sampler
that we built previously. Here, the data model is expressed by a cost
function; that is, a function that takes a parameter value and observation as
input, and returns the cost of generating the observation (conditional on the
supplied parameter value) as output. This is the approach used in the
excellent Practical Probabilistic Programming with Monads paper by Adam
Scibior et al and also mentioned by Dan Roy in his recent talk at the
Simons Institute. Ideally we’d just reify the cost function here from the
description of a model directly (to keep the interface similar to the one used
in the rejection sampler implementation), but I haven’t yet found a way to do
this in a type-safe fashion.
Regardless of whether or not we accept a proposed move, we need to snapshot the
current value of each node and add it to that node’s history. This can be done
using another comonadic extend:
snapshotValue :: Cofree f Node -> Node
snapshotValue (Node {..} :< cons) = Node { nodeHistory = history, .. } where
history = nodeValue : nodeHistory
snapshot :: Functor f => Cofree f Node -> Cofree f Node
snapshot = extend snapshotValue
The other point of note is minor, but an extremely easy detail to overlook.
Since we’re handling random value generation at each node purely, using on-site
PRNGs, we need to iterate the generators forward a step in the event that we
don’t accept a proposal. Otherwise we’d propose a new execution based on the
same generator states that we’d used previously! For now I’ll just iterate the
generators by forcing a sample of a uniform variate at each node, and then
throwing away the result. To do this we can use the now-standard comonadic
pattern:
stepGenerator :: Cofree f Node -> Node
stepGenerator (Node {..} :< cons) = runST $ do
(nval, nseed) <- samplePurely (Prob.beta 1 1) nodeSeed
return Node {nodeSeed = nseed, ..}
stepGenerators :: Functor f => Cofree f Node -> Cofree f Node
stepGenerators = extend stepGenerator
Inspecting Execution Traces
Alright so let’s see how this all works. Let’s write a model, condition it
on some observations, and do inference.
We’ll choose our simple Gaussian mixture model from earlier, where the mixing
probability follows a beta distribution, and cluster assignment itself follows
a Bernoulli distribution. We thus choose the ‘leftmost’ component of the
mixture with the appropriate mixture probability.
We can break the mixture model up as follows:
prior :: Double -> Double -> Model Bool
prior a b = do
p <- beta a b
bernoulli p
likelihood :: Bool -> Model Double
likelihood left
| left = normal (negate 2) 0.5
| otherwise = normal 2 0.5
Let’s take a look at some samples from the marginal distribution. This time
I’ll flip things and assign hyperparameters of 3 and 2 for the prior:
> simulate (toSampler (prior 3 2 >>= likelihood))

It looks like we’re slightly more likely to sample from the left mixture
component than the right one. Again, this makes sense - the mean of a beta(3,
2) distribution is 0.6.
Now, what about inference? I’ll define the conditional model as follows:
posterior :: Model (Execution Bool)
posterior = invert 1000 obs prior ll where
obs = [ -1.7, -1.8, -2.01, -2.4
, 1.9, 1.8
]
ll left
| left = logDensityNormal (negate 2) 0.5
| otherwise = logDensityNormal 2 0.5
Here we have four observations that presumably arise from the leftmost
component, and only two that match up with the rightmost. Note also that I’ve
replaced the likelihood
model with its appropriate cost function due to
reasons I mentioned in the last section. (It would be easy to reify this
model as its cost function, but doing it for general models is trickier)
Anyway, let’s sample from the conditional distribution:
> simulate (toSampler posterior)
Sampling returns a running program, of course, and we can step through it to
examine its structure. We can use the supplied values recorded at each node
to ‘automatically’ step through execution, or we can supply our own values to
investigate arbitrary branches.
The conditional distribution we’ve found over the mixing probability is as
follows:

Looks like we’re in the right ballpark.
We can examine the traces of other elements of the program as well. Here’s the
recorded distribution over component assignments, for example - note that the
rightmost bar here corresponds to the leftmost component in the mixture:

You can see that whenever we wandered into the rightmost component, we’d
swiftly wind up jumping back out of it:

This is a fun take on probabilistic programming. In particular I find a few
aspects of the whole setup to be pretty attractive:
We use a primitive, limited instruction set to parameterize both programs - via
Free
- and running programs - via Cofree
. These off-the-shelf recursive
types are used to wrap things up and provide most of our required control flow
automatically. It’s easy to transparently add structure to embedded programs
built in this way; for example, we can statically encode independence
by replacing our ModelF a
type with something like:
data InstructionF a = Coproduct (ModelF a) (Ap (ModelF a))
This can be hidden from the user so that we’re left with the same simple
monadic syntax we presently enjoy, but we also get to take independence into
account when performing inference, or any other structural interpretation for
that matter.
When it comes to inference, the program representation is completely separate
from whatever inference backend we choose to augment it with. We can deal with
traces as first-class values that can be directly stored, inspected,
manipulated, and so on. And everything is done in a typed and
purely-functional framework. I’ve used dynamic typing functionality from
Data.Dynamic
to store values in execution traces here, but we could similarly
just define a concrete Value
type with the appropriate constructors for
integers, doubles, bools, etc., and use that to store everything.
At the same time, this is a pretty early concept - doing inference
efficiently in this setting is another matter, and there are a couple of
computational and statistical issues here that need to be ironed out to make
further progress.
The current way I’ve organized Markov chain generation and iteration is just
woefully inefficient. Storing the history of each node on-site is needlessly
costly and I’m sure results in a ton of unnecessary allocation. On a semantic
level, it also ‘complects’ state and identity: why, after all, should a single
execution trace know anything about traces that preceded it? Clearly this
should be accumulated in another data structure. There is a lot of other
low-hanging fruit around strictness and PRNG management as well.
From a more statistical angle, the present implementation does a poor job when
it comes to perturbing execution traces. Some changes - such as improving the
proposal mechanism for a given instruction - are easy to implement, and
representing distributions as instructions indeed makes it easy to tailor local
proposal distributions in a context-independent way. But another problem is
that, by using a ‘blunt’ comonadic extend
, we perturb an execution by
perturbing every node in it. In general it’s better to make small
perturbations rather than large ones to ensure a reasonable acceptance ratio,
but to do that we’d need to perturb single nodes (or at least subsets of nodes)
at a time.
There may be some inroads here via comonad transformers like StoreT
or
lenses that would allow us to zoom in on a particular node and perturb it,
rather than perturbing everything at once. But my comonad-fu is not yet quite
at the required level to evaluate this, so I’ll come back to that idea some
other time.
I’m interested in playing with this concept some more in the future, though I’m
not yet sure how much I expect it to be a tenable way to do inference at scale.
If you’re interested in playing with it, I’ve dumped the code from this post
into this gist.
Thanks to Niffe Hermansson and Fredrik Olsen for reviewing a draft of this
post and providing helpful comments.
17 Oct 2016
What does a dead-simple probabilistic programming language look like? The
simplest thing I can imagine involves three components:
- A representation for probabilistic models.
- A way to simulate from those models (‘forward’ sampling).
- A way to sample from a conditional model (‘backward’ sampling).
Rob Zinkov wrote an article on this type of thing around a year ago,
and Dan Roy recently gave a talk on the topic as well. In the spirit
of unabashed unoriginality, I’ll give a sort of composite example of the two.
Most of the material here comes directly from Dan’s talk; definitely check it
out if you’re curious about this whole probabilistic programming mumbojumbo.
Let’s whip together a highly-structured, typed, embedded probabilistic
programming language - the core of which will encompass a tiny amount of code.
Some preliminaries - note that you’ll need my simple little
mwc-probability library handy for when it comes time to do sampling:
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE LambdaCase #-}
import Control.Monad
import Control.Monad.Free
import qualified System.Random.MWC.Probability as MWC
Representing Probabilistic Models
Step one is to represent the fundamental constructs found in probabilistic
programs. These are abstract probability distributions; I like to call them
models:
data ModelF r =
BernoulliF Double (Bool -> r)
| BetaF Double Double (Double -> r)
deriving Functor
type Model = Free ModelF
Each foundational probability distribution we want to consider is represented
as a constructor of the ModelF
type. You can think of them as probabilistic
instructions, in a sense. A Model
itself is a program parameterized
by this probabilistic instruction set.
In a more sophisticated implementation you’d probably want to add more
primitives, but you can get pretty far with the beta and Bernoulli
distributions alone. Here are some embedded language terms, only two of which
correspond one-to-one with to the constructors themselves:
bernoulli :: Double -> Model Bool
bernoulli p = liftF (BernoulliF p id)
beta :: Double -> Double -> Model Double
beta a b = liftF (BetaF a b id)
uniform :: Model Double
uniform = beta 1 1
binomial :: Int -> Double -> Model Int
binomial n p = fmap count coins where
count = length . filter id
coins = replicateM n (bernoulli p)
betaBinomial :: Int -> Double -> Double -> Model Int
betaBinomial n a b = do
p <- beta a b
binomial n p
You can build a lot of other useful distributions by just starting from the
beta and Bernoulli as well. And technically I guess the more foundational
distributions to use here would be the Dirichlet and
categorical, of which the beta and Bernoulli are special cases. But I
digress. The point is that other distributions are easy to construct from a
set of reliable primitives; you can check out the old lambda-naught
paper by Park et al for more examples.
See how binomial
and betaBinomial
are defined? In the case of binomial
we’re using the property that models have a functorial structure by just
mapping a counting function over the result of a bunch of Bernoulli
random variables. For betaBinomial
we’re directly making use of our monadic
structure, first describing a weight parameter via a beta distribution and then
using it as an input to a binomial distribution.
Note in particular that we’ve expressed betaBinomial
by binding a parameter
model to a data model. This is a foundational pattern in Bayesian
statistics; in the more usual lingo, the parameter model corresponds to the
prior distribution, and the data model is the likelihood.
Forward-Mode Sampling
So we have our representation. Next up, we want to simulate from these
models. Thus far they’re purely abstract, and don’t encode any information
about probability or sampling or what have you. We have to ascribe that
ourselves.
mwc-probability defines a monadic sampling-based probability distribution
type called Prob
, and we can use a basic recursion scheme on free
monads to adapt our own model type to that:
toSampler :: Model a -> MWC.Prob IO a
toSampler = iterM $ \case
BernoulliF p f -> MWC.bernoulli p >>= f
BetaF a b f -> MWC.beta a b >>= f
We can glue that around the relevant mwc-probability functionality to
simulate from models directly:
simulate :: Model a -> IO a
simulate model = MWC.withSystemRandom . MWC.asGenIO $
MWC.sample (toSampler model)
And this can be used with standard monadic combinators like replicateM
to
collect larger samples:
> replicateM 10 $ simulate (betaBinomial 10 1 4)
[5,7,1,4,4,1,1,0,4,2]
Reverse-Mode Sampling
Now. Here we want to condition our model on some observations and then recover
the conditional distribution over its internal parameters.
This part - inference - is what makes probabilistic programming hard, and doing
it really well remains an unsolved problem. One of the neat theoretical
results in this space due to Ackerman, Freer, and Roy is that in the
general case the problem is actually unsolvable, in that one can encode as a
probabilistic program a conditional distribution that computes the halting
problem. Similarly, in general it’s impossible to do this sort of thing
efficiently even for computable conditional distributions. Consider the case
of a program that returns the hash of a random n-long binary string, and then
try to infer the distribution over strings given some hashes, for example.
This is never going to be a tractable problem.
For now let’s use a simple rejection sampler to encode a conditional
distribution. We’ll require some observations, a proposal distribution, and
the model that we want to invert:
invert :: (Monad m, Eq b) => m a -> (a -> m b) -> [b] -> m a
invert proposal model observed = loop where
loop = do
parameters <- proposal
generated <- replicateM (length observed) (model parameters)
if generated == observed
then return parameters
else loop
Let’s use it to compute the posterior or inverse model of an (apparently)
biased coin, given a few observations. We’ll just use a uniform distribution
as our proposal:
posterior :: Model Double
posterior = invert [True, True, False, True] uniform bernoulli
Let’s grab some samples from the posterior distribution:
> replicateM 1000 (simulate posterior)

The central tendency of the posterior floats about 0.75, which is what we’d
expect, given our observations. This has been inferred from only four
points; let’s try adding a few more. But before we do that, note that the
present way the rejection sampling algorithm works is:
- Propose a parameter value according to the supplied proposal distribution.
- Generate a sample from the model, of equal size to the supplied observations.
- Compare the collected sample to the supplied observations. If they’re equal,
then return the proposed parameter value. Otherwise start over.
Rejection sampling isn’t exactly efficient in nontrivial settings anyway, but
it’s supremely inefficient for our present case. The random variables we’re
interested in are exchangeable, so what we’re concerned about is the
total number of True
or False
values observed - not any specific order they
appear in.
We can add an ‘assistance’ function to the rejection sampler to help us out in
this case:
invertWithAssistance
:: (Monad m, Eq c) => ([a] -> c) -> m b -> (b -> m a) -> [a] -> m b
invertWithAssistance assister proposal model observed = loop where
loop = do
parameters <- proposal
generated <- replicateM (length observed) (model parameters)
if assister generated == assister observed
then return parameters
else loop
The assister summarizes both our observations and collected sample to ensure
they’re efficiently comparable. In our situation, we can use a simple counting
function to tally up the number of True
values we observe:
count :: [Bool] -> Int
count = length . filter id
Now let’s create another posterior by conditioning on a few more observations:
posterior0 :: Model Double
posterior0 = invertWithAssitance count uniform bernoulli obs where
obs =
[True, True, True, False, True, True, False, True, True, True, True, False]
and collect another thousand samples from it. This would likely take an
annoying amount of time without the use of our count
function for assistance
above:
> replicateM 1000 (simulate posterior0)

Note that with more information to condition on, we get a more informative
posterior.
Conclusion
This is a really basic formulation - too basic to be useful in any meaningful
way - but it illustrates some of the most important concepts in probabilistic
programming. Representation, simulation, and inference.
I think it’s also particularly nice to do this in Haskell, rather than something
like Python (which Dan used in his talk) - it provides us with a lot of
extensible structure in a familiar framework for language hacking. It sort of
demands you’re a fan of all these higher-kinded types and structured recursions
and all that, but if you’re reading this blog then you’re probably in that camp
anyway.
I’ll probably write a few more little articles like this over time. There are
a ton of improvements that we can make to this basic setup - encoding
independence, sampling via MCMC, etc. - and it might be fun to
grow everything out piece by piece.
I’ve dropped the code from this post into this gist.
01 Oct 2016
Randomness is a constant nuisance point for Haskell beginners who may be coming
from a language like Python or R. While in Python you can just get away with
something like:
In [2]: numpy.random.rand(3)
Out[2]: array([ 0.61426175, 0.05309224, 0.38861597])
or in R:
> runif(3)
[1] 0.49473012 0.68436352 0.04135914
In Haskell, the situation is more complicated. It’s not too much worse when
you get the hang of things, but it’s certainly one of those things that throws
beginners for a loop - and for good reason.
In this article I want to provide a simple guide, with examples, for getting
started and becoming comfortable with randomness in Haskell. Hopefully it
helps!
I’m writing this from a hotel during my girlfriend’s birthday, so it’s being
slapped together very rapidly with a kind of get-it-done attitude. If anything
is unclear or you have any questions, feel free to shoot me a ping and I’ll try
to improve it when I get a chance.
Randomness on Computers in General
Check out the R code I posted previously. If you just open R and type
runif(3)
on your machine, then odds are you’ll get a different triple of
numbers than what I got above.
These numbers are being generated based on R’s global random number generator
(RNG), which, absent any fiddling by the user, is initialized as needed based
on the system time and ID of the R process. So: if you open up the R
interpreter and call runif(3)
, then behind the scenes R will initialize the
RNG based on the time and process ID, and then use a particular algorithm to
generate random numbers based on that initialized value (called the ‘seed’).
These numbers aren’t truly random - they’re pseudo-random, which means
they’re generated by a deterministic algorithm such that the resulting values
appear random over time. The default algorithm used by R, for example, is
the famous Mersenne Twister, which you can verify as follows:
> RNGkind()
[1] "Mersenne-Twister" "Inversion"
You can also set the seed yourself in R, using the set.seed
function. Then
if you type something like runif(3)
, R will use this initialized RNG rather
than coming up with its own seed based on the time and process ID. Setting
the seed allows you to reproduce operations involving pseudo-random numbers;
just re-set the seed and perform the same operations again:
> set.seed(42)
> runif(3)
[1] 0.9148060 0.9370754 0.2861395
> set.seed(42)
> runif(3)
[1] 0.9148060 0.9370754 0.2861395
(It’s good practice to always initialize the RNG using some known seed before
running an experiment, simulation, and so on.)
So the big thing to notice here, in any case, is that R uses a global RNG.
It maintains the state of this RNG implicitly and behind the scenes. When you
type runif(3)
, R consults this implicit RNG, gives you your pseudo-random
numbers based on its value, and updates the global RNG without you needing to
worry about any of this plumbing yourself. The same is generally true for
randomness in most programming languages - Python, C, Ruby, and so on.
Explicit RNG Management
But let’s come back to Haskell. Haskell, unlike R or Python, is
purely-functional. State, or effects in general, are never implicit in the
same way that R updates its global RNG. We need to either explicitly pass
around a RNG ourselves, or at least allow some explicit monad to do it for us.
Passing around a RNG manually is annoying, so in practice this means everyone
uses a monad to handle RNG state. This means that one needs to be
comfortable working with monadic code in order to practically use random
numbers in Haskell, which presents a big hurdle for beginners who may have
been able to ignore monads thus far on their Haskell journey.
Let’s see what I mean by all of this by going through a few examples. Make
sure you have stack installed, and then grab a few libraries that we’ll
make use of in the remainder of this post:
$ stack install random mwc-random primitive
The Really Annoying Method - Manual RNG Management
Let me demonstrate the simplest conceptual method for dealing with random
numbers: manually grabbing and passing around a RNG without involving any
monads whatsoever.
First, open up GHCi:
And let’s also get some quick preliminaries out of the way:
Prelude> :set prompt "> "
> import System.Random
> import Control.Monad
> let runif_pure = randomR (0 :: Double, 1)
> let runif n = replicateM n (randomRIO (0 :: Double, 1))
> let set_seed = setStdGen . mkStdGen
We’ll first use the basic System.Random
module for illustration. To
initialize a RNG, we can make one by providing the mkStdGen
function
with an integer seed:
> let rng = mkStdGen 42
> rng
43 1
We can use this thing to generate random numbers. A simple function to do that
is randomR
, which will generate pseudo-random values for some ordered
type in a given range. We’ll use the runif_pure
alias for it that we defined
previously, just to make things look similar to the previous R example and also
emphasize that this one is a pure function:
> runif_pure rng
(1.0663729393723398e-2,2060101257 2103410263)
You can see that we got back a pair of values, the first element of which is
our random number 1.0663729393723398e-2
. Cool. Let’s try to generate
another:
> runif_pure rng
(1.0663729393723398e-2,2060101257 2103410263)
Hmm. We generated the same number again. This is because the value of rng
hasn’t changed - it’s still the same value we made via mkStdGen 42
. Since
we’re using the same random number generator to generate a pseudo-random value,
we get the same pseudo-random value.
If we want to make new random numbers, then we need to use a different
generator. And the second element of the pair returned from our call to
runif_pure
is exactly that - an updated RNG that we can use to generate
additional random numbers.
Let’s try that all again, using the generator we get back from the first
function call as an input to the second:
> let (x, rng1) = runif_pure rng
> x
1.0663729393723398e-2
> let (y, rng2) = runif_pure rng1
> y
0.9827538369038856
Success!
I mean.. sort of. It works and all, and it does constitute a general-purpose
solution. But manually binding updated RNG states to names and swapping those
in for new values is still pretty annoying.
You could also generate an infinite list of random numbers using the
randomRs
function and just take from it as needed, but you still
probably need to manage that list to make sure you don’t re-use any numbers.
You kind of trade off managing the RNG for managing an infinite list of random
numbers, which isn’t much better.
The Less-Annoying Method - Get A Monad To Do It
The good news is that we can offload the job of managing the RNG state to a
monad. I won’t actually explain how that works in detail here - I think most
people facing this problem are initially more concerned with getting something
working, rather than deeply grokking monads off the bat - so I’ll just claim
that we can get a monad to handle the RNG state for us, and that will hopefully
(mostly) suffice for now.

Still rolling with the System.Random
module for the time being, we’ll use the
runif
alias for the randomRIO
function that we defined previously to
generate some new random numbers:
> runif 3
[0.9873934690803106,0.3794382930121829,0.2285653405908732]
> runif 3
[0.7651878964537555,0.2623159001635825,0.7683468476766804]
Simpler! Notice we haven’t had to do anything with a generator manually - we
just ask for random numbers and then get them, just like in R. And if we want
to set the value of the RNG being used here, we can use the setStdGen
function with an RNG that we’ve already created. Here let’s just use the
set_seed
alias we defined earlier, to mimic R’s set.seed
function:
> set_seed 42
> runif 3
[1.0663729393723398e-2,0.9827538369038856,0.7042944187434987]
> set_seed 42
> runif 3
[1.0663729393723398e-2,0.9827538369038856,0.7042944187434987]
So things are similar to how they work in R here - we have a global RNG of
sorts, and we can set its state as desired using the set_seed
function. But
since this is Haskell, the effects of creating and updating the generator state
must still be explicit. And they are explicit - it’s just that they’re
explicit in the type of runif
:
> :t runif
runif :: Int -> IO [Double]
Note that runif
returns a value that’s wrapped up in IO
. This is how we
indicate explicitly - at the type level - that something is being done with the
generator in the background. IO
is a monad, and it happens to be the thing
that’s dealing with the generator for us here.
What this means for you, the practitioner, is that you can’t just mix values of
some type a
with values of type IO a
willy-nilly. You may be writing a
function f
with type [Double] -> Double
, where the input list of doubles is
intended to be randomly-generated. But if you just go ahead and generate a
list xs
of random numbers, they’ll have type IO [Double]
, and you’ll stare
in confusion at some type error from GHC when you try to apply f
to xs
.
Here’s what I mean. Take the example of just generating some random numbers
and then summing them up. First, in R:
> xs = runif(3)
> sum(xs)
[1] 1.20353
And now in Haskell, using the same mechanism we tried earlier:
> let xs = runif 3
> :t xs
xs :: IO [Double]
> sum xs
<interactive>:16:1:
No instance for (Num [Double]) arising from a use of ‘sum’
In the expression: sum xs
In an equation for ‘it’: it = sum xs
This means that to deal with the numbers we generate, we have to treat them a
little differently than we would in R, or compared to the situation where we
were managing the RNG explicitly in Haskell. Concretely: if we use a monad to
manage the RNG for us, then the numbers we generate will be ‘tagged’ by the
monad. So we need to do something or other to make those tagged numbers work
with ‘untagged’ numbers, or functions designed to work with ‘untagged’ numbers.
This is where things get confusing for beginners. Here’s how we could add up
some random numbers in GHCi:
> xs <- runif 3
> sum xs
1.512024272587933
We’ve used the <-
symbol to bind the result of runif 3
to the name xs
,
rather than let xs = ...
. But this is sort of particular to running code in
GHCi; if you try to do this in a generic Haskell function, you’ll possibly wind
up with some more weird type errors. To do this in regular ol’ Haskell code,
you need to both use <-
-style binding and also acknowledge the ‘tagged’
nature of randomly-generated values.
The crux is that, when you’re using a monad to generate random numbers in
Haskell, you need to separate generating them from using them. Rather than
try to explain what I mean here precisely, let’s rely on example, and implement
a simple Metropolis sampler for illustration.
A Metropolis Sampler
The Metropolis algorithm will help you approximate expectations over
certain probability spaces. Here’s how it works. Picture yourself strolling
around some bumpy landscape; you want to walk around it in such a fashion that
you visit regions of it with probability proportional to their altitude. To do
that, you can repeatedly:
- Pick a random point near your current location.
- Compare your present altitude to the altitude of that point you picked.
Calculate a probability based on their ratio.
- Flip a coin where the chance of observing a head is equal to that
probability. If you get a head, move to the location you picked.
Otherwise, stay put.
Let’s implement it in Haskell, using a monadic random number generator to do
so. This time we’re going to use mwc-random
- a more industrial-strength
randomness library that you can confidently use in production code.
mwc-random
uses Marsaglia’s multiply-with-carry algorithm to generate
pseudo-random numbers. It requires you to explicitly create and pass a RNG to
functions that need to generate random numbers, but it uses a monad to update
the RNG state itself. This winds up being pretty nice; let’s dive in to see.
Create a module called Metropolis.hs
and get some imports out of the way:
module Metropolis where
import Control.Monad
import Control.Monad.Primitive
import System.Random.MWC as MWC
import System.Random.MWC.Distributions as MWC
Step One
The first thing we want to do is implement is point (1) from above:
Pick a random point near your current location.
We’ll just use a standard normal distribution of the appropriate dimension to
do this - we just want to take a location, perturb it, and return the perturbed
location.
propose :: [Double] -> Gen RealWorld -> IO [Double]
propose location rng = traverse (perturb rng) location where
perturb gen x = MWC.normal x 1 gen
So at finer detail: we’re walking over the coordinates of the current location
and generating a normally-distributed value centered at each coordinate. The
MWC.normal
function will do this for a given mean and standard
deviation, and we can use the traverse
function to walk over each coordinate.
Note that we pass a mwc-random
RNG - the value with type Gen RealWorld
- to
the propose
function. We need to supply this generator anywhere we want to
generate random numbers, but we don’t need to manually worry about tracking and
updating its state. The IO
monad will do that for us. The resulting
randomly-generated values will be tagged with IO
, so we’ll need to deal with
that appropriately.
Step Two
Now let’s implement point (2):
Compare your present altitude to the altitude of that point you picked.
Calculate a probability based on their ratio.
So, we need a function that will compare the altitude of our current point to
the altitude of a proposed point and compute a probability from that. The
following will do: it takes a function that will compute a (log-scale) altitude
for us, as well as the current and proposed locations, and returns a
probability.
moveProbability :: ([Double] -> Double) -> [Double] -> [Double] -> Double
moveProbability altitude current proposed =
whenNaN 0 (exp (min 0 (altitude proposed - altitude current)))
where
whenNaN val x
| isNaN x = val
| otherwise = x
Step Three
Finally, the third step of the algorithm:
Flip a coin where the chance of observing a head is equal to that
probability. If you get a head, move to the location you picked. Otherwise
stay put.
So let’s get to it:
decide :: [Double] -> [Double] -> Double -> Gen RealWorld -> IO [Double]
decide current proposed prob rng = do
accept <- MWC.bernoulli prob rng
return $
if accept
then proposed
else current
Here we need to flip a coin, so we require a source of randomness again. The
decide
function thus takes another generator of type Gen RealWorld
that we
then supply to the MWC.bernoulli
function, and the result - the final
location - is once again wrapped in IO
.
This function clearly demonstrates the typical way that you’ll deal with random
numbers in Haskell code. decide
is a monadic function, so it proceeds using
do-notation. When you need to generate a random value - here we generate a
random True
or False
value according to a Bernoulli distribution - you bind
the result to a name using the <-
symbol. Then afterwards, in the scope of
the function, you can use the bound value as if it were pure. But the entire
function must still return a ‘wrapped-up’ value that makes the effect of
passing the generator explicit at the type level; right here, that means that
the value will be wrapped up in IO
.
Putting Everything Together
The final Metropolis transition is a combination of steps one through three.
We can put them together like so:
metropolis :: ([Double] -> Double) -> [Double] -> Gen RealWorld -> IO [Double]
metropolis altitude current rng = do
proposed <- propose current rng
let prob = moveProbability altitude current proposed
decide current proposed prob rng
Again, metropolis
is monadic, so we start off with a do
to make monadic
programming easy on us. Whenever we need a random value, we bind the result of
a random number-returning function using the <-
notation.
The propose
function returns a random location, so we bind its result to the
name proposed
using the <-
symbol. The moveProbability
function, on the
other hand, is pure - so we bind that using a let prob = ...
expression. The
decide
function returns a random value, so we can just plop it right on the
end here. The entire result of the metropolis
function is random, so it is
wrapped up in IO
.
The result of metropolis
is just a single transition of the Metropolis
algorithm, which involves doing this kind of thing over and over. If we do
that, we observe a bunch of points that trace out a particular realization of a
Markov chain, which we can generate as follows:
chain
:: Int -> ([Double] -> Double) -> [Double] -> Gen RealWorld -> IO [[Double]]
chain epochs altitude origin rng = loop epochs [origin] where
loop n history@(current:_)
| n <= 0 = return history
| otherwise = do
next <- metropolis altitude current rng
loop (n - 1) (next:history)
An Example
Now that we have our chain
function, we can use it to trace out a collection
of points visited on a realization of a Markov chain. Remember that we’re
supposed to be wandering over some particular abstract landscape; here, let’s
stroll over the one defined by the following function:
landscape :: [Double] -> Double
landscape [x0, x1] =
-0.5 * (x0 ^ 2 * x1 ^ 2 + x0 ^ 2 + x1 ^ 2 - 8 * x0 - 8 * x1)
What we’ll now do is pick an origin to start from, wander over the landscape
for some number of steps, and then print the resulting realization of the
Markov chain to stdout. We’ll do all that through the following main
function:
main :: IO ()
main = do
rng <- MWC.createSystemRandom
let origin = [-0.2, 0.3]
trace <- chain 1000 landscape origin rng
mapM_ print trace
Running that will dump a trace to stdout. If you clean it up and plot it,
you’ll see that the visited points have traced out a rough approximation of the
landscape:

Fini
Hopefully this gives a broad idea of how to go about using random numbers in
Haskell. I’ve talked about:
- Why randomness in Haskell isn’t as simple as randomness in (say) Python or R.
- How to handle randomness in Haskell, either by manual generator management or
by offloading that job to a monad.
- How to get thing done when a monad manages the generator for you - separating
random number generation from random number processing.
- Doing all the above with an industrial-strength RNG, using a simple
Metropolis algorithm as an example.
Hopefully the example gives you an idea of how to work with random numbers in
practice.
I’ll be the first to admit that randomness in Haskell requires more work than
randomness in a language like R, which to this day remains my go-to interactive
data analysis language of choice. Using randomness effectively in Haskell
requires a decent understanding of how to work with monadic code, even if one
doesn’t quite understand monads entirely yet.
What I can say is that when one has developed some intuition for monads -
acquiring a ‘feel’ for how to work with monadic functions and values - the
difficulty and awkwardness drop off a bit, and working with randomness feels no
different than working with any other effect.
Happy generating! I’ve dumped the code for the Metropolis example into a
gist.
For a more production-quality Metropolis sampler, you can check out my
mighty-metropolis library, which is a member of the declarative
suite of MCMC algos.
18 Jul 2016
.. this one is pretty dry, I’ll admit. David Williams said it
best:
.. Measure theory, that most arid of subjects when done for its own sake,
becomes amazingly more alive when used in probability, not only because it is
then applied, but also because it is immensely enriched.
Unfortunately for you, dear reader, we won’t be talking about probability.
Moving on. What does it mean for something to be measurable in the
mathematical sense? Take some arbitrary collection \(X\) and slap an
appropriate algebraic structure \(\mathcal{X}\) on it - usually an
algebra or \(\sigma\)-algebra, etc. Then we can
refer to a few different objects as ‘measurable’, going roughly as follows.
The elements of the structure \(\mathcal{X}\) are called measurable sets.
They’re called this because they can literally be assigned a notion of measure,
whcih is a kind of generalized volume. If we’re just talking about some subset
of \(X\) out of the context of its structure then we can be pedantic and call
it measurable with respect to \(\mathcal{X}\), say. You could also call a
set \(\mathcal{X}\)-measurable, to be similarly precise.
The product of the original collection and its associated structure \((X,
\mathcal{X})\) is called a measurable space. It’s called that because it can
be completed with a measuring function \(\mu\) - itself called a measure - that
assigns notions of measure to measurable sets.
Now take some other measurable space \((Y, \mathcal{Y})\) and consider a
function \(f\) from \(X\) to \(Y\). This is a measurable function if it
satisfies the following technical requirement: that for any
\(\mathcal{Y}\)-measurable set, its preimage under \(f\) is an element of
\(\mathcal{X}\) (so: the preimage under \(f\) is \(\mathcal{X}\)-measurable).
The concept of measurability for functions probably feels the least intuitive
of the three - like one of those dry taxonomical classifications that we just
have to keep on the books. The ‘make sure your function is measurable and
everything will be ok’ heuristic will get you pretty far. But there is some
good intuition available, if you want to look for it.
Here’s an example: define a set \(X\) that consists of the elements \(A\),
\(B\), and \(C\). To talk about measurable functions, we first need to define
our measurable sets. The de-facto default structure used for this is a
\(\sigma\)-algebra, and we can always generate one from some
underlying class of sets. Let’s do that from the following plain old
partition that splits the original collection into a couple of disjoint
‘slices’:
\[H = \{\{A, B\}, \{C\}\}\]
The \(\sigma\)-algebra \(\mathcal{X}\) generated from this partition will just
be the partition itself, completed with the whole set \(X\) and the empty set.
To be clear, it’s the following:
\[\mathcal{X} = \left\{\{A, B, C\}, \{A, B\}, \{C\}, \emptyset\right\}\]
The resulting measurable space is \((X, \mathcal{X})\). So we could assign a
notion of generalized volume to any element of \(\mathcal{X}\), though I won’t
actually worry about doing that here.
Now. Let’s think about some functions from \(X\) to the real numbers, which
we’ll assume to be endowed with a suitable \(\sigma\)-algebra of their own (one
typically assumes the standard topology on \(\mathbb{R}\) and the
associated Borel \(\sigma\)-algebra).
How about this - a simple indicator function on the slice containing \(C\):
\[f(x) =
\begin{cases}
0, \, x \in \{A, B\} \\
1, \, x \in \{C\}
\end{cases}\]
Is it measurable? That’s easy to check. The preimage of \(\{0\}\) is \(\{A,
B\}\), the preimage of \(\{1\}\) is \(\{C\}\), and the preimage of \(\{0, 1\}\)
is \(X\) itself. Those are all in \(\mathcal{X}\), and the preimage of the
empty set is the empty set, so we’re good.
Ok. What about this one:
\[g(x) =
\begin{cases}
0, \, x \in \{A\} \\
1, \, x \in \{B\} \\
2, \, x \in \{C\}
\end{cases}\]
Check the preimage of \(\{1, 2\}\) and you’ll find it’s \(\{B, C\}\). But
that’s not a member of \(\mathcal{X}\), so \(g\) is not measurable!
What happened here? Failing to satisfying technical requirements aside: what,
intuitively, made \(f\) measurable where \(g\) wasn’t?
The answer is a problem of resolution. Look again at \(\mathcal{X}\):
\[\left\{\{A, B, C\}, \{A, B\}, \{C\}, \emptyset\right\}\]
The structure \(\mathcal{X}\) that we’ve endowed our collection \(X\) with is
too coarse to permit distinguishing between elements of the slice \(\{A,
B\}\). There is no measurable set \(A\), nor a measurable set \(B\) - just
a measurable set \(\{A, B\}\). And as a result, if we define a function that
says something about either \(A\) or \(B\) without saying the same thing about
the other, that function won’t be measurable. The function \(f\) assigned
the same value to both \(A\) and \(B\), so we didn’t have any problem there.
If we want to be able to distinguish between \(A\) and \(B\), we’ll need to
equip \(X\) with some structure that has a finer resolution. You can check
that if you make a measurable space out of \(X\) and its power set (the set of
all subsets of \(X\)) then \(g\) will be measurable there, for example.
So if we’re using partitions to define our measurable sets, we get a neat
little property: for any measurable function, elements in the same slice of the
partition must have the same value when passed through the function. So if
you have a function \(h : X \to H\) that takes an element to its respective
slice in a partition, you know that \(h(x_{0}) = h(x_{1})\) for any \(x_{0}\),
\(x_{1}\) in \(X\) implies that \(f(x_{0}) = f(x_{1})\) for any measurable
function \(f\).
Addendum
Whipping together a measurable space using a \(\sigma\)-algebra generated by a
partition of sets occurs naturally when we talk about correlated
equilibrium, a solution concept in non-cooperative game theory. It’s
common to say a function - in that context a correlated strategy - must be
measurable ‘with respect to the partition’, which sort of elides the fact that
we still need to generate a \(\sigma\)-algebra from it anyway.
Some oldschool authors (Halmos, at least) developed their measure theory using
\(\sigma\)-rings, but this doesn’t seem very popular nowadays.
Since a ring doesn’t require including the entire set \(X\), you need to go
through an awkward extra hoop when defining measurability on functions. But
regardless, it’s interesting to think about what happens when one uses
different structures to define measurable sets!
20 Apr 2016
Suppose you’re in the derivatives business. You are interested in making a
market on some events; say, whether or not your friend Jay will win tomorrow
night’s poker game, or that the winning pot will be at least USD 100. Let’s
examine some rules about how you should do business if you want this venture to
succeed.
What do I mean by ‘make a market’? I mean that you will be willing to buy and
sell units of a particular security that will be redeemable from the seller at
some particular value after tomorrow’s poker game has ended (you will be making
a simple prediction market, in other words). You can make bid offers to buy
securities at some price, and ask offers to sell securities at some price.
To keep things simple let’s say you’re doing this gratis; society rewards you
extrinsically for facilitating the market - your friends will give you free
pizza at the game, maybe - so you won’t levy any transaction fees for making
trades. Also scarcity isn’t a huge issue, so you’re willing to buy or sell any
finite number of securities.
Consider the possible outcomes of the game (one and only one of which must
occur):
- (A) Jay wins and the pot is at least USD 100.
- (B) Jay wins and the pot is less than USD 100.
- (C) Jay loses and the pot is at least USD 100.
- (D) Jay loses and the pot is less than USD 100.
The securities you are making a market on pay USD 1 if an event occurs, and USD
0 otherwise. So: if I buy 5 securities on outcome \(A\) from you, and outcome
\(A\) occurs, I’ll be able to go to you and redeem my securities for a total of
USD 5. Alternatively, if I sell you 5 securities on outcome \(A\), and outcome
\(A\) occurs, you’ll be able to come to me and redeem your securities for a
total of USD 5.
Consider what that implies: as a market maker, you face the prospect of making
hefty payments to customers who redeem valuable securities. For example,
imagine the situation where you charge USD 0.50 for a security on outcome
\(A\), but outcome \(A\) is almost certain to occur in some sense (Jay is a
beast when it comes to poker and a lot of high rollers are playing); if your
customers exclusively load up on 100 cheap securities on outcome \(A\), and
outcome \(A\) occurs, then you stand to owe them a total payment of USD 100
against the USD 50 that they have paid for the securities. You thus have a
heavy incentive to price your securities as accurately as possible - where
‘accurate’ means to minimize your expected loss.
It may always be the case, however, that it is difficult to price your
securities accurately. For example, if some customer has more information than
you (say, she privately knows that Jay is unusually bad at poker) then she
potentially stands to profit from holding said information in lieu of your
ignorance on the matter (and that of your prices). Such is life for a market
maker. But there are particular prices you could offer - independent of any
participant’s private information - that are plainly stupid or ruinous for you
(a set of prices like this is sometimes called a Dutch
book). Consider selling securities
on outcome \(A\) for the price of USD -1; then anyone who buys one of these
securities not only stands to redeem USD 1 in the event outcome \(A\) occurs,
but also gains USD 1 simply from the act of buying the security in the first
place.
Setting a negative price like this is irrational on your part; customers will
realize an arbitrage opportunity on securities for outcome \(A\) and will
happily buy as many as they can get their hands on, to your ruin. In other
words - and to nobody’s surprise - by setting a negative price, you can be
made a sure loser in the market.
There are other prices you should avoid setting as well, if you want to avoid
arbitrage opportunities like these. For starters:
- For any outcome \(E\), you must set the price of a security on \(E\) to be at
least USD 0.
- For any certain outcome \(E\), you must set the price of a security on \(E\) to
be exactly USD 1.
The first condition rules out negative prices, and the second ensures that your
books balance when it comes time to settle payment for securities on a certain
event.
What’s more, the price that you set on any given security doesn’t exist in
isolation. Given the outcomes \(A\), \(B\), \(C\), and \(D\) listed previously, at
least one must occur. So as per the second rule, the price of a synthetic
derivative on the outcome “Jay wins or loses, and the pot is any value” must be
set to USD 1. This places constraints on the prices that you can set for
individual securities. It suffices that:
- For any countable set of mutually exclusive outcomes \(E_{1}, E_{2}, \ldots\),
you must set the price of the security on outcome “\(E_{1}\) or \(E_{2}\) or..”
to exactly the sum of the prices of the individual outcomes.
This eliminates the possibility that your customers will make you a certain
loser by buying elaborate combinations of securities on different outcomes.
There are other rules that your prices must obey as well, but they fall out as
corollaries of these three. If you broke any of them you’d also be breaking
one of these.
It turns out that you cannot be made a sure loser if, and only if, your prices
obey these three rules. That is:
- If your prices follow these rules, then you will offer customers no arbitrage
opportunities.
- Any market absent of arbitrage opportunities must have prices that conform
to these rules.
These prices are called coherent, and absence of coherence implies the
existence of arbitrage opportunities for your customers.
But Why Male Models
The trick, of course, is that these prices correspond to probabilities, and
the rules for avoiding arbitrage correspond to the standard Kolmogorov
axioms of probability
theory.
The consequence is that if your description of uncertain phenomena does not
involve probability theory, or does not behave exactly like probability theory,
then it is an incoherent representation of information you have about those
phenomena.
As a result, probability theory should be your tool of choice when it comes
to describing uncertain phenomena. Granted you may not have to worry about
market making in return for pizza, but you’d like to be assured that there are
no structural problems with your description.
This is a summary of the development of probability presented in Jay Kadane’s
brilliant Principles of Uncertainty. The
original argument was developed by de Finetti and Savage in the mid-20th
century.
Kadane’s book makes for an exceptional read, and it’s free to boot. I
recommend checking it out if it has flown under your radar.
An interesting characteristic of this development of probability is that there
is no way to guarantee the nonexistence of arbitrage opportunities for a
countably infinite number of purchased securities. That is: if you’re a market
maker, you could be made a sure loser in the market when it came time for you
to settle a countably infinite number of redemption claims. The quirk here is
that you could also be made a sure winner as well; whether you win or lose with
certainty depends on the order in which the claims are settled! (Fortunately
this doesn’t tend to be an issue in practice.)
Thanks to Fredrik Olsen for reviewing a draft of this post.
References