26 Feb 2017
In my last two posts about the Giry monad I derived the thing
from its categorical and measure-theoretic foundations. I kind of thought that
those posts wouldn’t be of much interest to people but they turned out to be a
hit. I clearly can’t tell what the internet likes.
Anyway, something I left out was the theoretical foundation of the Giry monad’s
Applicative instance, which seemed a bit odd. I also pointed out that
applicativeness in the context of probability implies independence between
probability measures.
In this article I’m going to look at each of these issues. After playing
around with the foundations, it looks like the applicative instance for the
Giry monad can be put on a sound footing in terms of the standard
measure-theoretic concept of product measure. Also, it turns out that the
claim I made of applicativeness independence is somewhat
ill-posed. But, using the shiny new intuition we’ll glean from a better
understanding of the applicative structure, we can put that on a solid footing
too.
So let’s look at both of these things and see what’s up.
Monoidal Functors
The foundational categorical concept behind applicative functors is the
monoidal functor, which is a functor between monoidal categories that
preserves monoidal structure.
Formally: for monoidal categories and ,
a monoidal functor is a functor and associated natural
transformations and that satisfy some coherence conditions that I won’t mention here.
Notably, if and are isomorphisms (i.e. are invertible) then
is called a strong monoidal functor. Otherwise it’s called lax.
Applicative functors in particular are lax monoidal functors.
This can be made much clearer for endofunctors on a monoidal category . Then you only have and to worry about. If we sub in the Giry monad
from the last couple of posts, we’d want and .
Does the category of measurable spaces have a monoidal
structure? Yup. Take measurable spaces and . From the Giry monad derivation we already have that the
monoidal identity corresponds to a Dirac measure
at a point, so that’s well and good. And we can define the tensor product
between and as follows: let be the
standard Cartesian product on and and let be the smallest -algebra generated by the Cartesian
product of measurable sets and . Then is a
measurable space, and so is monoidal.
Recall that and - the space of measures
over and respectively - are themselves objects in
. So, clearly is a
measurable space, and if is monoidal then there must exist a
natural transformation that can take us from there to . This is the space of measures over the product .
So the question is: does have the required monoidal structure?
Yes. It must, since is a monad, and any monad can generate the
required natural transformation. Let be the monadic ‘join’ operator
and be the monadic identity
. We have, evaluating right-to-left:
Using makes this much easier to read:
or in code, just:
phi :: Monad m => (m a, m b) -> m (a, b)
phi (m, n) = liftM2 (,) m n
So with that we have that is a (lax) monoidal
functor. And you can glean a monad-generated applicative operator from
that immediately (this leads to the function called ‘ap’ in Control.Monad
):
ap :: Monad m => m (a -> b) -> m a -> m b
ap f x = fmap (\(g, z) -> g z) (phi f x)
(Note: I won’t refer to as the join operator from this point out in
order to free it up for denoting measures.)
Probabilistic Interpretations
Product Measure
The correct probabilistic interpretation here is that takes a pair of
probability measures to the product measure over the appropriate product
space. For probability measures and on measurable spaces
and respectively, the product measure is the (unique) measure on such that:
for a measurable set in .
Going through the monoidal functor route seems to put the notion of the Giry
applicative instance on a more firm measure-theoretic foundation. Instead of
considering the following from the Giry monad foundations article:
which is defined in terms of the dubious space of measures over measurable
functions , we can better view things using the monoidal
structure-preserving natural transformation . For measures and
on and respectively, we have:
and then for we can use the functor structure of
to do:
which corresponds to a standard applicative expression g <$> mu <*> nu
. I
suspect there’s then some sort of Yoneda argument or something that makes
currying and partial function application acceptable.
Independence
Now. What does this have to say about independence?
In particular, it’s too fast and loose to claim measures can be ‘independent’
at all. Independence is a property of measurable sets, measurable functions,
and -algebras. Not of measures! But there is a really useful
connection, so let’s illustrate that.
First, let’s define independence formally as follows. Take a probability space
. Then any measurable sets and
in are independent if
That’s the simplest notion.
Next, consider two sub--algebras and
of (a sub--algebra is just a a subset of a
-algebra that itself happens to be a algebra). Then
and are independent if, for any in
and any in , we have that and
are independent.
The final example is independence of measurable functions. Take measurable
functions and both from to the real numbers equipped with some
appropriate -algebra . Then each of these functions
generates a sub- algebra of as follows:
Then and are independent if the generated -algebras
and are independent.
Note that in every case independence is defined in terms of a single
measure, . We can’t talk about different measures being
independent. To paraphrase Terry Tao here:
The notion of independence between [measurable functions] does not make sense
if the [measurable functions] are being modeled by separate probability
spaces; they have to be coupled together into a single probability space
before independence becomes a meaningful notion.
To be pedantic and explicitly specify the measure by which some things are
independent, some authors state that measurable functions and are
-independent, for example.
We can see a connection to independence when we look at convolution and
associated operators. Recall that for measures and on the same
measurable space that supports some notion of
addition, their convolution looks like:
The probabilistic interpretation here (see Terry Tao on this too) is
that is the measure corresponding to the sum of independent
measurable functions and with corresponding measures and
respectively.
That looks weird though, since we clearly defined independence between
measurable functions using a single probability measure. How is it we can talk
about independent measurable functions and having different
corresponding measures?
We first need to couple everything together into a single probability space as
per Terry’s quote. Complete with some abstract probability measure
to form the probability space .
Now we have and measurable functions from to .
To say that and are independent is to say that their generated
-algebras are -independent. And the measures that they
correspond to are the pushforwards of under and
respectively. So, and . The result is that the measurable functions correspond to
different (pushforward) measures and , but are independent with
respect to the same underlying probability measure .
The monoidal structure of then gets us to convolution. Given a
product of measures and each on we can
immediately retrieve their product measure via
. And from there we can get to via the functor structure
of - we just find the pushforward of with
respect to a function that collapses a product via addition. So
is defined as:
and then the convolution is thus:
Other operations can be defined similarly, e.g. for we get:
The crux of all this is whenever we apply a measurable function to a product
measure, we can always extract notions of independent measurable functions
from the result. And the measures corresponding to those independent
measurable functions will be the components of the product measure
respectively.
This is super useful and lets one claim something stronger than what the
monadic structure gives you. In an expression like g <$> mu <*> nu <*> rho
,
you are guaranteed that the corresponding random variables ,
, (for suitable projections) are independent. The same
cannot be said if you use the monadic structure to do something like g mu nu
rho
where the product structure is not enforced - in that case you’re not
guaranteed anything of the sort. This is why the applicative structure is
useful for encoding independence in a way that the monadic structure is
not.
Conclusion
So there you have it. Applicativeness can seemingly be put on a
straightforward measure-theoretic grounding and has some useful implications
for independence.
It’s worth noting that, in the case of the Giry monad, we don’t need to go
through its monadic structure in order to recover an applicative instance. We
can do so entirely by hacking together continuations without using a single
monadic bind. This is actually how I defined the applicative instance in the
Giry monad implementation article previously:
instance Applicative Measure where
pure x = Measure (\f -> f x)
Measure g <*> Measure h = Measure $ \f ->
g (\k -> h (f . k))
Teasing out the exact structure of this and its relation to the codensity monad
is again something I’ll leave to others.
13 Feb 2017
In my last post I went over the categorical and measure-theoretic
foundations of the Giry monad, the ‘canonical’ probability monad that operates
on the level of probability measures.
In this post I’ll pick up from where I left off and talk about a neat and
faithful (if impractical) implementation of the Giry monad that one can put
together in Haskell.
Measure, Integral, and Continuation
So. For a quick review, we’ve established the Giry monad as a triple
, where is an endofunctor on the
category of measurable spaces , is a marginalizing
integration operation defined by:
and is a monoidal identity, defined by the Dirac measure at a point:
How do we actually implement this beast? If we’re looking to be suitably
general then it is unlikely that we’re going to be able to easily represent
something like a -algebra over some space of measures on a computer,
so that route is sort of a non-starter.
But it can be done. The key to implementing a general-purpose Giry monad is to
notice that the fundamental operation involved in it is integration, and that
we can avoid working with -algebras and measurable spaces directly if
we focus on dealing with measurable functions instead of measurable sets.
Consider the integration map on measurable functions that we’ve been
using this whole time. For some measurable function , takes a
measure on some measurable space and uses it to
integrate over . In other words:
A measure in has type , so
has corresponding type .
This might look familiar to you; it’s very similar to the type signature for a
continuation:
newtype Cont a r = Cont ((a -> r) -> r)
Indeed, if we restrict the carrier type of ‘Cont’ to the reals, we can be
really faithful to the type:
newtype Integral a = Integral ((a -> Double) -> Double)
Now, let’s overload notation and call the integration map itself a
measure. That is, is a mapping , so
we’ll just interpret the notation to mean the same thing -
. This is convenient because we can dispense with
and just pretend measures can be applied directly to measurable functions.
There’s no way we can get confused here; measures operate on sets, not
functions, so notation like is not currently in use. We just set
and that’s that. Let’s rename the ‘Integral’ type
to match:
newtype Measure a = Measure ((a -> Double) -> Double)
We can extract a very nice shallowly-embedded language for integration here,
the core of which is a single term:
integrate :: (a -> Double) -> Measure a -> Double
integrate f (Measure nu) = nu f
Note that this is the same way we’d express integration mathematically; we
specify that we want to integrate a measurable function with respect to
some measure :
The only subtle difference here is that we don’t specify the space we’re
integrating over in the integral expression - instead, we’ll bake that into the
definition of the measures we create themselves. Details in a bit.
What’s interesting here is that the Giry monad is the continuation monad with
the carrier type restricted to the reals. This isn’t surprising when you think
about what’s going on here - we’re representing measures as integration
procedures, that is, programs that take a measurable function as input and
then compute its integral in some particular way. A measure, as we’ve
implemented it here, is just a ‘program with a missing piece’. And this is
exactly the essence of the continuation monad in Haskell.
Typeclass Instances
We can fill out the functor, applicative, and monad instances mechanically by
reference to the a standard continuation monad implementation, and each
instance gives us some familiar conceptual structure or operation on
probability measures. Let’s take a look.
The functor instance lets us transform the support of a measurable space
while keeping its density structure invariant. If we have:
then mapping a measurable function over the measure corresponds to:
The functor structure allows us to precisely express a pushforward measure or
distribution of under . It lets us ‘adapt’ a measure to other
measurable spaces, just like a good functor should.
In Haskell, the functor instance corresponds exactly to the math:
instance Functor Measure where
fmap g nu = Measure $ \f ->
integrate (f . g) nu
The monad instance is exactly the Giry monad structure that we developed
previously, and it allows us to sequence probability measures together by
marginalizing one into another. We’ll write it in terms of bind, of course,
which went like:
The Haskell translation is verbatim:
instance Monad Measure where
return x = Measure (\f -> f x)
rho >>= g = Measure $ \f ->
integrate (\m -> integrate f (g m)) rho
Finally there’s the Applicative instance, which as I mentioned in the last
post is sort of conceptually weird here. So in the spirit of that
comment, I’m going to dodge any formal justification for now and just use the
following instance which works in practice:
instance Applicative Measure where
pure x = Measure (\f -> f x)
Measure g <*> Measure h = Measure $ \f ->
g (\k -> h (f . k))
Conceptual Example
It’s worth taking a look at an example of how things should conceivably work
here. Consider the following probabilistic model:
It’s a standard hierarchical presentation. A ‘compound’ measure can be
obtained here by marginalizing over the beta measure , and that’s called
the beta-binomial measure. Let’s find it.
The beta distribution has support on the subset of the reals, and
the binomial distribution with argument has support on the subset of the integers, so we know that things should proceed like so:
Eliding some theory of integration, I can tell you that is absolutely
continuous with respect to Lebesgue measure and that
is absolutely continuous w/respect to counting measure for appropriate
. So, admits a density and
admits a density , defined as:
and
respectively, for the beta function and a
binomial coefficient. Again, we can reduce the integral as follows,
transforming the outermost integral into a standard Riemann integral
and the innermost integral into a simple sum of products:
where denotes Lebesgue measure. I could expand this further or simplify
things a little more (the beta and binomial are conjugates) but you get
the point, which is that we have a way to evaluate the integral.
What is really required here then is to be able to encode into the
definitions of measures like and the method of integration
to use when evaluating them. For measures absolutely continuous w/respect to
Lebesgue measure, we can use the Riemann integral over the reals. For measures
absolutely continuous w/respect to counting measure, we can use a sum of
products. In both cases, we’ll also need to supply the density or mass
function by which the integral should be evaluated.
Creating Measures
Recall that we are representing measures as integration procedures. So to
create one is to define a program by which we’ll perform integration.
Let’s start with the conceptually simpler case of a probability measure that’s
absolutely continuous with respect to counting measure. We need to provide
a support (the region for which probability is greater than 0) and a
probability mass function (so that we can weight every point appropriately).
Then we just want to integrate a function by evaluating it at every point in
the support, multiplying the result by that point’s probability mass, and
summing everything up. In code, this translates to:
fromMassFunction :: (a -> Double) -> [a] -> Measure a
fromMassFunction f support = Measure $ \g ->
foldl' (\acc x -> acc + f x * g x) 0 support
So if we want to construct a binomial measure, we can do that like so (where
choose
comes from Numeric.SpecFunctions
):
binomial :: Int -> Double -> Measure Int
binomial n p = fromMassFunction (pmf n p) [0..n] where
pmf n p x
| x < 0 || n < x = 0
| otherwise = choose n x * p ^^ x * (1 - p) ^^ (n - x)
The second example involves measures over the real line that are absolutely
continuous with respect to Lebesgue measure. In this case we want to evaluate
a Riemann integral over the entire real line, which is going to necessitate
approximation on our part. There are a bunch of methods out there for
approximating integrals, but a simple one for one-dimensional problems like
this is quadrature, an implementation for which Ed Kmett has handily
packaged up in his integration package:
fromDensityFunction :: (Double -> Double) -> Measure Double
fromDensityFunction d = Measure $ \f ->
quadratureTanhSinh (\x -> f x * d x)
where
quadratureTanhSinh = result . last . everywhere trap
Here we’re using quadrature to approximate the integral, but otherwise it has
a similar form as ‘fromMassFunction’. The difference here is that we’re
integrating over the entire real line, and so don’t have to supply a support
explicitly.
We can use this to create a beta measure (where logBeta
again comes from
Numeric.SpecFunctions
):
beta :: Double -> Double -> Measure Double
beta a b = fromDensityFunction (density a b) where
density a b p
| p < 0 || p > 1 = 0
| otherwise = 1 / exp (logBeta a b) * p ** (a - 1) * (1 - p) ** (b - 1)
Note that since we’re going to be integrating over the entire real line and the
beta distribution has support only over , we need to implicitly
define the support here by specifying which regions of the domain will lead to
a density of 0.
In any case, now that we’ve constructed those things we can just use
a monadic bind to create the beta-binomial measure we described before. It
masks a lot of under-the-hood complexity.
betaBinomial :: Int -> Double -> Double -> Measure Int
betaBinomial n a b = beta a b >>= binomial n
There are a couple of other useful ways to create measures, but the most
notable is to use a sample in order to create an empirical measure.
This is equivalent to passing in a specific support for which the mass function
assigns equal probability to every element; I’ll use Gabriel Gonzalez’s
foldl package here as it’s pretty elegant:
fromSample :: Foldable f => f a -> Measure a
fromSample = Measure . flip weightedAverage
weightedAverage :: (Foldable f, Fractional r) => (a -> r) -> f a -> r
weightedAverage f = Foldl.fold (weightedAverageFold f) where
weightedAverageFold :: Fractional r => (a -> r) -> Fold a r
weightedAverageFold f = Foldl.premap f averageFold
averageFold :: Fractional a => Fold a a
averageFold = (/) <$> Foldl.sum <*> Foldl.genericLength
Using ‘fromSample’ you can create an empirical measure using just about
anything you’d like:
data Foo = Foo | Bar | Baz
foos :: [Foo]
foos = [Foo, Foo, Bar, Foo, Baz, Foo, Bar, Foo, Foo, Foo, Bar]
nu :: Measure Foo
nu = fromSample foos
Though I won’t demonstrate it here, you can use this approach to also create
measures from sampling functions or random variables that use a source of
randomness - just draw a sample from the function and pipe the result into
‘fromSample’.
Querying Measures
To query a measure is to simply get some result out of it, and we do that by
integrating some measurable function against it. The easiest thing to do is to
just take a straightforward expectation by integrating the identity function;
for example, here’s the expected value of a beta(10, 10) measure:
> integrate id (beta 10 10)
0.49999999999501316
The expected value of a beta(, ) distribution is , so we can verify analytically that the result should be
0.5. We observe a bit of numerical imprecision here because, if you’ll recall,
we’re just approximating the integral via quadrature. For measures created
via ‘fromMassFunction’ we don’t need to use quadrature, so we won’t observe the
same kind of approximation error. Here’s the expected value of a binomial(10,
0.5) measure, for example:
> integrate fromIntegral (binomial 10 0.5)
5.0
Note here that we’re integrating the ‘fromIntegral’ function against the
binomial measure. This is because the binomial measure is defined over the
integers, rather than the reals, and we always need to evaluate to a real
when we integrate. That’s part of the definition of a measure!
Let’s calculate the expectation of the beta-binomial distribution with , , and :
> integrate fromIntegral (betaBinomial 10 1 8)
1.108635884924813
Neato. And since we can integrate like this, we can really compute any of the
moments of a measure. The first raw moment is what we’ve been doing
here, and is called the expectation:
expectation :: Measure Double -> Double
expectation = integrate id
The second (central) moment is the variance. Here I mean variance in the
moment-based sense, rather than as the possibly better-known sample variance:
variance :: Measure Double -> Double
variance nu = integrate (^ 2) nu - expectation nu ^ 2
The variance of a binomial(, ) distribution is known to be
, so for and we should get 2.5:
> variance (binomial 10 0.5)
<interactive>:87:11: error:
• Couldn't match type ‘Int’ with ‘Double’
Expected type: Measure Double
Actual type: Measure Int
• In the first argument of ‘variance’, namely ‘(binomial 10 0.5)’
In the expression: variance (binomial 10 0.5)
In an equation for ‘it’: it = variance (binomial 10 0.5)
Ahhh, but remember: the binomial measure is defined over the integers, so we
can’t integrate it directly. No matter - the functorial structure allows us to
adapt it to any other measurable space via a measurable function:
> variance (fmap fromIntegral (binomial 10 0.5))
2.5
Expectation and variance (and other moments) are pretty well-known, but you can
do more exotic things as well. You can calculate the moment generating
function for a measure, for example:
momentGeneratingFunction :: Measure Double -> Double -> Double
momentGeneratingFunction nu t = integrate (\x -> exp (t * x)) nu
and the cumulant generating function follows naturally:
cumulantGeneratingFunction :: Measure Double -> Double -> Double
cumulantGeneratingFunction nu = log . momentGeneratingFunction nu
A particularly useful construct is the cumulative distribution function
for a measure, which calculates the probability of a region less than or equal
to some number:
cdf :: Measure Double -> Double -> Double
cdf nu x = integrate (negativeInfinity `to` x) nu
negativeInfinity :: Double
negativeInfinity = negate (1 / 0)
to :: (Num a, Ord a) => a -> a -> a -> a
to a b x
| x >= a && x <= b = 1
| otherwise = 0
The beta(2, 2) distribution is symmetric around its mean 0.5, so the
probability of the region should itself be 0.5. This checks out
as expected, modulo approximation error due to quadrature:
> cdf (beta 2 2) 0.5
0.4951814897381374
Similarly for measurable spaces without any notion of order, there’s a simple
CDF analogue that calculates the probability of a region that contains the
given points:
containing :: (Num a, Eq b) => [b] -> b -> a
containing xs x
| x `elem` xs = 1
| otherwise = 0
And probably the least interesting query of all is the simple ‘volume’, which
calculates the total measure of a space. For any probability measure this must
obviously be one, so it can at least be used as a quick sanity check:
volume :: Measure Double -> Double
volume = integrate (const 1)
Convolution and Friends
I mentioned in the last post that applicativeness corresponds to
independence in some sense, and that independent measures over the same
measurable space can be convolved together, à la:
for measures and on . In Haskell-land it’s well-known
that any applicative instance gives you a free ‘Num’ instance, and the story is
no different here:
instance Num a => Num (Measure a) where
(+) = liftA2 (+)
(-) = liftA2 (-)
(*) = liftA2 (*)
abs = fmap abs
signum = fmap signum
fromInteger = pure . fromInteger
There are a few neat ways to demonstrate this kind of thing. Let’s use a
Gaussian measure here as a running example:
gaussian :: Double -> Double -> Measure Double
gaussian m s = fromDensityFunction (density m s) where
density m s x
| s <= 0 = 0
| otherwise =
1 / (s * sqrt (2 * pi)) * exp (negate ((x - m) ^^ 2) / (2 * (s ^^ 2)))
First, consider a chi-squared measure with degrees of freedom.
We could create this directly using a density function, but instead we can
represent it by summing up independent squared Gaussian measures:
chisq :: Int -> Measure Double
chisq k = sum (replicate k normal) where
normal = fmap (^ 2) (gaussian 0 1)
To sanity check the result, we can compute the mean and variance of a
measure, which should be and respectively for :
> expectation (chisq 2)
2.0000000000000004
> variance (chisq 2)
4.0
As a second example, consider a product of independent Gaussian measures. This
is a trickier distribution to deal with analytically (see here), but we
can use some well-known identities for general independent measures in order to
verify our results. For any independent measures and , we have:
and
for the expectation and variance of their product. So for a product of
independent Gaussians w/parameters (1, 2) and (2, 3) respectively, we expect to
see 2 for its expectation and 61 for its variance:
> expectation (gaussian 1 2 * gaussian 2 3)
2.0000000000000001
> variance (gaussian 1 2 * gaussian 2 3)
61.00000000000003
Woop!
Wrapping Up
And there you have it, a continuation-based implementation of the Giry monad.
You can find a bunch of code with similar functionality to this packaged up in
my old measurable library on GitHub if you’d like to play around with
the concepts.
That library has accumulated a few stars since I first pushed it up in 2013. I
think a lot of people are curious about these weird measure things, and this
framework at least gives you the ability to play around with a representation
for measures directly. I found it particularly useful for really grokking,
say, that integrating some function against a probability measure
is identical to integrating the identity function against the probability
measure . And there are a few similar concepts
there that I find really pop out when you play with measures directly, rather
than when one just works with them on paper.
But let me now tell you why the Giry monad sucks in practice.
Take a look at this integral expression, which is brought about due to a
monadic bind:
For simplicitly, let’s assume that is discrete and has cardinality
. This means that the integral reduces to
for and the appropriate Radon-Nikodym derivatives. You
can see that the total number of operations involved in the integral is
, and indeed, for monadic binds the computational complexity
involved in evaluating all the integrals involved is exponential, on the order
of . It was no coincidence that I demonstrated a variance calculation
for a distribution instead of for a .
This isn’t really much of a surprise - the cottage industry of approximating
integrals exists because integration is hard in practice, and integration is
surely best avoided whenever one can get away with doing so. Vikash
Mansinghka’s quote on this topic is fitting: “don’t calculate probabilities -
sample good guesses.” I’ll also add: relegate the measures to measure theory,
where they seem to belong.
The Giry monad is a lovely abstract construction for formalizing the monadic
structure of probability, and as canonical probabilistic objects, measures and
integrals are tremendously useful when working theoretically. But they’re a
complete non-starter when it comes to getting anything nontrivial done in
practice. For that, there are far more useful representations for probability
distributions in Haskell - notably, the sampling function or random variable
representation found in things like
mwc-probability/mwc-random-monad and random-fu, or even
better, the structural representation based on free or operational monads like
I’ve written about before, or that you can find in something like
monad-bayes.
The intuitions gleaned from playing with the Giry monad carry over precisely to
other representations for the probability monad. In all cases, ‘return’ will
correspond, semantically, to constructing a Dirac distribution at a point,
while ‘bind’ will correspond to a marginalizing operator. The same is true for
the underlying (applicative) functor structure: ‘fmap’ always corresponds to a
density-preserving transformation of the support, while applicativeness
corresponds to independence (yielding convolution, etc.). And you have to
admit, the connection to continuations is pretty cool.
There is clearly some connection to the codensity monad as well, but I
think I’ll let someone else figure out the specifics of that one. Something
something right-Kan extension..
10 Feb 2017
The Giry monad is the canonical probability monad that operates on the level of
measures, which are the abstract constructs that canonically represent
probability distributions. It’s sort of the baseline by which all other
probability monads can be judged.
In this article I’m going to go through the categorical and measure-theoretic
foundations of the Giry monad. In another article, I’ll describe how you can
implement it in a very faithful sense in Haskell.
I was putting some notes together for another project and wound up writing up
things up in a somewhat blog-friendly style, but this isn’t intended to be a
tutorial per se. Really this isn’t the kind of content I’d usually post
here, but since I’ve jotted everything up, I figured I may as well. If you
like extremely dry mathematics and computer science, you’re in the right place.
I won’t define everything under the sun here - for properties or coherence
conditions or other things that I’ve elided details on, check out something
like Mac Lane or Aliprantis & Border. I’ll include some references at the end.
This is the game plan we’re working with:
- Define monads and their supporting machinery in a categorical sense.
- Define probability measures and some required background around that.
- Construct the functor that maps a measurable space to the collection of all
probability measures on that space.
- Demonstrate that it’s a monad.
Let’s get started.
Categorical Foundations
A category is a collection of objects and morphisms between them.
So if , , , and are objects in , then ,
, and are examples of morphisms. These
morphisms can be composed in the obvious associative way, i.e.
and there exist identity morphisms (or automorphisms) that simply map objects
to themselves.
A functor is a mapping between categories (equivalently, it’s a morphism in
the category of so-called ‘small’ categories). The functor
takes every object in to some object in , and every morphism in
to some morphism in , such that the structure of morphism
composition is preserved. An endofunctor is a functor from a category to
itself, and a bifunctor is a functor from a pair of categories to another
category, i.e. .
A natural transformation is a mapping between functors. So for two functors
, a natural transformation associates
to every object in a morphism in
.
A monoidal category is a category with some additional monoidal
structure, namely an identity object and a bifunctor called the tensor product, plus several natural isomorphisms that
provide the associativity of the tensor product and its right and left identity
with the identity object .
A monoid in a monoidal category is an object
in together with two morphisms (obeying the standard associativity and
identity properties) that make use of the category’s monoidal structure: the
associative binary operator , and the identity
.
A monad is (infamously) a ‘monoid in the category of endofunctors’. So take
the category of endofunctors whose objects are endofunctors and
whose morphisms are natural transformations between them. This is a monoidal
category; there exists an identity endofunctor for all
in , plus a tensor product defined by functor composition such that the
required associativity and identity properties hold. is thus a
monoidal category, and any specific monoid we construct on
it is a specific monad.
Probabilistic Foundations
A measurable space is a set equipped with a
topology-like structure called a -algebra that
essentially contains every well-behaved subset of in some sense. A
measure is a particular kind of set
function from the -algebra to the nonnegative real line. A measure
just assigns a generalized notion of area or volume to well-behaved subsets of
. In particular, if the total possible area or volume of the underlying
set is 1 then we’re dealing with a probability measure. A measurable space
completed with a measure, e.g. is called a measure
space, and a measurable space completed with a probability measure is called a
probability space.
There is a lot of overloaded lingo around the word ‘measurable’. A
‘measurable set’ is an element of a -algebra in a measurable space.
A measurable mapping is a mapping between measurable spaces. Given a
‘source’ measurable space and ‘target’ measurable space
, a measurable mapping is a map with the property that, for any
measurable set in the target, the inverse image is measurable in the source.
Or, formally, for any in , you have that is
in .
The Space of Probability Measures on a Measurable Space
If you consider the collection of all measurable spaces and measurable mappings
between them, you get a category. Define to be the category
of measurable spaces. So, objects are measurable spaces and morphisms are the
measurable mappings between them.
For any specific measurable space in , we can consider
the space of all possible probability measures that could be placed on it and
denote that . To be clear, is a space of
measures - that is, a space in which the points themselves are probability
measures.
What’s remarkable about is that it is itself a measurable
space. Let me explain.
As a probability measure, any element of is a function from
measurable subsets of to the interval in . That
is: if is the measurable space , then a point
in is a function . For any
measurable in , there just naturally exists a sort of ‘evaluation’
mapping I’ll call that takes a
measure on and evaluates it on the set . To be explicit: if
is a measure in , then simply evaluates
. It ‘runs’ the measure in a sense; in Haskell, would be
analogous to a function like \f -> f a
for some a
.
This evaluation map corresponds to an integral. If you have a
measurable space , then for any a subset in
, for
the characteristic or indicator function of (where is
if is in , and is otherwise). And we can actually extend
to operate over measurable mappings from to
, where is
a suitable -algebra on . Here we typically use what’s
called the Borel -algebra, which takes a topology on the set and
then generates a -algebra from the open sets in the topology (for
we can just use the ‘usual’ topology generated by the Euclidean
metric). For a measurable function, we can define the
evaluation mapping as .
We can abuse notation here a bit and just use to refer to ‘duck typed’
mappings that evaluate measures over measurable sets or measurable functions
depending on context. If we treat as a function
, then has type .
If we treat as a function , then
has type . I’ll say
to refer to the mappings that accept either measurable sets or functions.
In any case. For a measurable space , there exists a topology on
called the weak-* topology that makes all the evaluation
mappings continuous for any measurable set or
measurable function . From there, we can generate the Borel
-algebra that makes the evaluation
functions measurable. The result is that
is itself a measurable space,
and thus an object in .
The space actually has all sorts of insane properties that
one wouldn’t expect - there are implications on convexity, completeness,
compactness and such that carry over from . But I digress.
is a Functor
So: for any an object in , we have that
is also an object in . And if you look at
like a functor, you notice that it takes objects of
to objects of . Indeed, you can define an
analogous procedure on morphisms in as follows. Take
to be another object (read: measurable space) in and to be a morphism (read: measurable mapping) between them. Now, for any
measure in we can define (this is called the image, distribution, or pushforward of
under ). For some and ,
thus takes measurable sets in to a value in the interval -
that is, it is a measure on . So we have that:
and so is an endofunctor on .
is a Monad
See where we’re going here? If we can define natural transformations
and such that is a monoid in the category
of endofunctors, we’ll have defined a monad. We thus need to come up with a
suitable monoidal structure, et voilà.
First the identity. We want a natural transformation between the
identity functor and the functor such
that for any measurable
space in . Evaluating the identity functor simplifies
things to .
We can define this concretely as follows. Grab a measurable space in
and define for any point and any measurable set . is thus a
probability measure on - we assign to measurable sets that contain
, and 0 to those that don’t. If we peel away another argument, we have
that , as required.
So takes points in measurable spaces to probability measures on those
spaces. In technical parlance, it takes a point to the Dirac
measure at - the probability measure that places the entirety of its
mass at .
Now for the other part of the monoidal structure, . I initially found
this next part to be a bit of a mind fuck, but let me see what I can do about
that.
Recall that the category of endofunctors, , is monoidal, so
there exists a tensor product that we can deal with, which here just corresponds to functor
composition. We’re looking for a natural transformation:
which is often written as:
Take a measurable space in and then
consider the space of probability measures over it, . Then
take the space of probability measures over the space of probability measures
on , . Since is an
endofunctor, this is again a measurable space, and for any measurable subset
of we again have a family of mappings that take a
probability measure in and evaluate it on
. We want to be the thing that turns a measure over measures
into a plain old probability measure on .
In the context of probability theory, this kind of semigroup action is a
marginalizing operator. We’re taking the ‘uncertainty’ captured in
via the probability measure and
smearing it into the probability measures in .
Take in and some a measurable
subset of . We can define as follows:
Using some lambda calculus notation to see the argument for , we can
expand the integrals to get the following gnarly expression:
Notice what’s happening here. For a measurable space, we’re integrating
over the space of probability measures on , with
respect to the probability measure , which itself is a point in the
space of probability measures over probability measures on ,
. Whew.
The spaces we’re integrating over here are unusual, but is still a
probability measure, so when applied to a measurable set in
it results in a probability in .
So, peeling back an argument, we have that has type . In other words, it’s a probability measure on , and
thus is in . And if we peel back another argument, we find
that:
so, as required, that
It’s also worth noting that we can overload the notation for in the
same way we did for , i.e. to supply measurable functions in addition
to measurable sets:
Combining the three components, we get , the
canonical Giry monad.
In Haskell, when we’re dealing with monads we typically use the bind operator
instead of manually dealing with the functorial structure and
(called ‘join’). Bind has the type:
and for illustration, we can define for the Giry monad like so:
Here is in , is in ,
and is in , so note that we potentially simplify the
outermost integral enormously. It now operates over a general measurable
space, rather than a space of measures in particular, and this will come in
handy when we get to implementation details in the next post.
Wrapping Up
That’s about it for now. It’s worth noting as a kind of footnote here that the
existence of the Giry monad also obviously implies the existence of a Giry
applicative functor. But the official situation for applicative functors seems
kind of weird in this context, and I’m not yet up to the task of dealing with
it formally.
Intuitively, one should be able to define the binary applicative operator
characteristic of its lax monoidal structure as follows:
But this has some really weird measure-theoretic implications - namely, that it
assumes the existence of a space of probability measures over the space of
all measurable functions , which is not trivial to define
and indeed may not even exist. It seems like some people are looking into this
problem as I just happened to stumble on this paper on the arXiv while
doing some googling. I notice that some people on e.g. nLab require categories
with additional structure beyond for the development of the
Giry monad as well, for example the category of Polish (separable, completely
metrizable) spaces , so maybe the extra structure there takes
care of the quirks.
Anyway. Applicatives are neat here because applicative probability measures
are independent probability measures. And the existence of
applicativeness means you can do all the things with independent probability
measures that you might be used to. Measure convolution and friends are good
examples. Given a measurable space that supports some notion of addition
and two probability measures and in , we
can add measures together via:
where and are both points in . Subtraction and multiplication
translate trivially as well.
In another article I’ll detail how the Giry monad can be implemented in Haskell
and point out some neat extensions. There are some cool connections to
continuations and codensity monads, and seemingly de Finetti’s theorem and
exchangeability. That kind of thing. It’d also be worth trying to justify
independence of probability measures from a categorical perspective, which
seems easier than resolving the nitty-gritty measurability qualms I mentioned
above.
‘Til then! Thanks to Jason Forbes for sifting through this stuff and
providing some great comments.
References:
04 Jan 2017
Here’s a short one.
I use Colin Percival’s Hacker News Daily to catch the top ten articles
of the day on Hacker News. Today an article called Why Recursive Data
Structures? popped up, which illustrates that recursive algorithms can
become both intuitive and borderline trivial when a suitable data structure is
used to implement them. This is exactly the motivation for using recursion
schemes.
In the above article, Reginald rotates squares by representing them via a
quadtree. If we have a square of bits, something like:
then we want to be able to easily rotate it 90 degrees clockwise, for example.
So let’s define a quadtree in Haskell:
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE LambdaCase #-}
import Data.Functor.Foldable
import Data.List.Split
data QuadTreeF a r =
NodeF r r r r
| LeafF a
| EmptyF
deriving (Show, Functor)
type QuadTree a = Fix (QuadTreeF a)
The four fields of the ‘NodeF’ constructor correspond to the upper left, upper
right, lower right, and lower left quadrants of the tree respectively.
Gimme some embedded language terms:
node :: QuadTree a -> QuadTree a -> QuadTree a -> QuadTree a -> QuadTree a
node ul ur lr ll = Fix (NodeF ul ur lr ll)
leaf :: a -> QuadTree a
leaf = Fix . LeafF
empty :: QuadTree a
empty = Fix EmptyF
That lets us define quadtrees easily. Here’s the tree that the previous
diagram corresponds to:
tree :: QuadTree Bool
tree = node ul ur lr ll where
ul = node (leaf False) (leaf True) (leaf False) (leaf False)
ur = node (leaf False) (leaf False) (leaf False) (leaf True)
lr = node (leaf True) (leaf False) (leaf False) (leaf False)
ll = node (leaf True) (leaf True) (leaf False) (leaf False)
Rotating is then really easy - we rotate each quadrant recursively. Just reach
for a catamorphism:
rotate :: QuadTree a -> QuadTree a
rotate = cata $ \case
NodeF ul ur lr ll -> node ll ul ur lr
LeafF a -> leaf a
EmptyF -> empty
Notice that you just have to shift each field of ‘NodeF’ rightward, with
wraparound. Then if you rotate and render the original tree you get:
Rotating things more times yields predictable results.
If you want to rotate another structure - say, a flat list - you can go through
a quadtree as an intermediate representation using the same pattern I described
in Sorting with Style. Build yourself a coalgebra and algebra pair:
builder :: [a] -> QuadTreeF a [a]
builder = \case
[] -> EmptyF
[x] -> LeafF x
xs -> NodeF a b c d where
[a, b, c, d] = chunksOf (length xs `div` 4) xs
consumer :: QuadTreeF a [a] -> [a]
consumer = \case
EmptyF -> []
LeafF a -> [a]
NodeF ul ur lr ll -> concat [ll, ul, ur, lr]
and then glue them together with a hylomorphism:
rotateList :: [a] -> [a]
rotateList = hylo consumer builder
Neato.
For a recent recursion scheme resource I’ve spotted on the Twitters, check out
Pascal Hartig’s compendium in progress.
26 Nov 2016
To the.. uh, ‘layperson’, pre- and postpromorphisms are probably well into the
WTF category of recursion schemes. This is a mistake - they’re simple and
useful, and I’m going to try and convince you of this in short order.
Preliminaries:
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE LambdaCase #-}
import Data.Functor.Foldable
import Prelude hiding (sum)
For simplicity, let’s take a couple of standard interpreters on lists. We’ll
define ‘sumAlg’ as an interpreter for adding up list contents and ‘lenAlg’ for
just counting the number of elements present:
sumAlg :: Num a => ListF a a -> a
sumAlg = \case
Cons h t -> h + t
Nil -> 0
lenAlg :: ListF a Int -> Int
lenAlg = \case
Cons h t -> 1 + t
Nil -> 0
Easy-peasy. We can use cata to make these
things useful:
sum :: Num a => [a] -> a
sum = cata sumAlg
len :: [a] -> Int
len = cata lenAlg
Nothing new there; ‘sum [1..10]’ will give you 55 and ‘len [1..10]’ will give
you 10.
An interesting twist is to consider only small elements in some sense; say,
we only want to add or count elements that are less than or equal to 10, and
ignore any others.
We could rewrite the previous interpreters, manually checking for the condition
we’re interested in and handling it accordingly:
smallSumAlg :: (Ord a, Num a) => ListF a a -> a
smallSumAlg = \case
Cons h t ->
if h <= 10
then h + t
else 0
Nil -> 0
smallLenAlg :: (Ord a, Num a) => ListF a Int -> Int
smallLenAlg = \case
Cons h t ->
if h <= 10
then 1 + t
else 0
Nil -> 0
And you get ‘smallSum’ and ‘smallLen’ by using ‘cata’ on them respectively.
They work like you’d expect - ‘smallLen [1, 5, 20]’ ignores the 20 and just
returns 2, for example.
You can do better though. Enter the prepromorphism.
Instead of writing additional special-case interpreters for the ‘small’ case,
consider the following natural transformation on the list base functor. It
maps the list base functor to itself, without needing to inspect the carrier
type:
small :: (Ord a, Num a) => ListF a b -> ListF a b
small Nil = Nil
small term@(Cons h t)
| h <= 10 = term
| otherwise = Nil
A prepromorphism is a ‘cata’-like recursion scheme that proceeds by first
applying a natural transformation before interpreting via a supplied algebra.
That’s.. surprisingly simple. Here are ‘smallSum’ and ‘smallLen’, defined
without needing to clumsily create new special-case algebras:
smallSum :: (Ord a, Num a) => [a] -> a
smallSum = prepro small sumAlg
smallLen :: (Ord a, Num a) => [a] -> Int
smallLen = prepro small lenAlg
They work great:
> smallSum [1..100]
55
> smallLen [1..100]
10
In pseudo category-theoretic notation you visualize how a prepromorphism works
via the following commutative diagram:

The only difference, when compared to a standard
catamorphism, is the presence of the natural
transformation applied via the looping arrow in the top left. The natural
transformation ‘h’ has type ‘forall r. Base t r -> Base t r’, and ‘embed’ has
type ‘Base t t -> t’, so their composition gets you exactly the type you need
for an algebra, which is then the input to ‘cata’ there. Mapping the
catamorphism over the type ‘Base t t’ brings it right back to ‘Base t t’.
A postpromorphism is dual to a prepromorphism. It’s ‘ana’-like; proceed with
your corecursive production, applying natural transformations as you go.
Here’s a streaming coalgebra:
streamCoalg :: Enum a => a -> ListF a a
streamCoalg n = Cons n (succ n)
A normal anamorphism would just send this thing shooting off into infinity, but
we can use the existing ‘small’ natural transformation to cap it at 10:
smallStream :: (Ord a, Num a, Enum a) => a -> [a]
smallStream = postpro small streamCoalg
You get what you might expect:
> smallStream 3
[3,4,5,6,7,8,9,10]
And similarly, you can visualize a postpromorphism like so:

In this case the natural transformation is applied after mapping the
postpromorphism over the base functor (hence the ‘post’ namesake).