# Foundations of the Giry Monad

10 Feb 2017The Giry monad is the canonical probability monad that operates on the level of measures, which are the abstract constructs that canonically represent probability distributions. It’s sort of the baseline by which all other probability monads can be judged.

In this article I’m going to go through the categorical and measure-theoretic foundations of the Giry monad. In another article, I’ll describe how you can implement it in a very faithful sense in Haskell.

I was putting some notes together for another project and wound up writing up
things up in a somewhat blog-friendly style, but this isn’t intended to be a
tutorial *per se*. Really this isn’t the kind of content I’d usually post
here, but since I’ve jotted everything up, I figured I may as well. If you
like extremely dry mathematics and computer science, you’re in the right place.

I won’t define everything under the sun here - for properties or coherence conditions or other things that I’ve elided details on, check out something like Mac Lane or Aliprantis & Border. I’ll include some references at the end.

This is the game plan we’re working with:

- Define monads and their supporting machinery in a categorical sense.
- Define probability measures and some required background around that.
- Construct the functor that maps a measurable space to the collection of all probability measures on that space.
- Demonstrate that it’s a monad.

Let’s get started.

## Categorical Foundations

A *category* \(C\) is a collection of *objects* and *morphisms* between them.
So if \(W\), \(X\), \(Y\), and \(Z\) are objects in \(C\), then \(f : W \to X\),
\(g : X \to Y\), and \(h : Y \to Z\) are examples of morphisms. These
morphisms can be composed in the obvious associative way, i.e.

and there exist identity morphisms (or *automorphisms*) that simply map objects
to themselves.

A *functor* is a mapping between categories (equivalently, it’s a morphism in
the category of so-called ‘small’ categories). The functor \(F : C \to D\)
takes every object in \(C\) to some object in \(D\), and every morphism in
\(C\) to some morphism in \(D\), such that the structure of morphism
composition is preserved. An *endofunctor* is a functor from a category to
itself, and a *bifunctor* is a functor from a pair of categories to another
category, i.e. \(F : A \times B \to C\).

A *natural transformation* is a mapping between functors. So for two functors
\(F, G : C \to D\), a natural transformation \(\epsilon : F \to G\) associates
to every object \(c\) in \(C\) a morphism \(\epsilon_c : F(c) \to G(c)\) in
\(D\).

A *monoidal category* \(C\) is a category with some additional monoidal
structure, namely an identity object \(I\) and a bifunctor \(\otimes : C \times
C \to C\) called the *tensor product*, plus several natural isomorphisms that
provide the associativity of the tensor product and its right and left identity
with the identity object \(I\).

A *monoid* \((M, \mu, \eta)\) in a monoidal category \(C\) is an object \(M\)
in \(C\) together with two morphisms (obeying the standard associativity and
identity properties) that make use of the category’s monoidal structure: the
associative binary operator \(\mu : M \otimes M \to M\), and the identity
\(\eta : I \to M\).

A *monad* is (infamously) a ‘monoid in the category of endofunctors’. So take
the category of endofunctors \(\mathcal{F}\) whose objects are endofunctors and
whose morphisms are natural transformations between them. This is a monoidal
category; there exists an identity endofunctor \(1_\mathcal{F}(F) = F\) for all
\(F\) in \(\mathcal{F}\), plus a tensor product \(\otimes : \mathcal{F} \times
\mathcal{F} \to \mathcal{F}\) defined by functor composition such that the
required associativity and identity properties hold. \(\mathcal{F}\) is thus a
monoidal category, and any specific monoid \((F, \mu, \eta)\) we construct on
it is a specific monad.

## Probabilistic Foundations

A *measurable space* \((X, \mathcal{X})\) is a set \(X\) equipped with a
topology-like structure called a \(\sigma\)-algebra \(\mathcal{X}\) that
essentially contains every well-behaved subset of \(X\) in some sense. A
*measure* \(\nu : \mathcal{X} \to \mathbb{R}\) is a particular kind of set
function from the \(\sigma\)-algebra to the nonnegative real line. A measure
just assigns a generalized notion of area or volume to well-behaved subsets of
\(X\). In particular, if the total possible area or volume of the underlying
set is 1 then we’re dealing with a *probability measure*. A measurable space
completed with a measure, e.g. \((X, \mathcal{X}, \nu)\) is called a *measure
space*, and a measurable space completed with a probability measure is called a
*probability space*.

There is a lot of overloaded lingo around the word ‘measurable’. A
‘measurable set’ is an element of a \(\sigma\)-algebra in a measurable space.
A *measurable mapping* is a mapping between measurable spaces. Given a
‘source’ measurable space \((X, \mathcal{X})\) and ‘target’ measurable space
\((Y, \mathcal{Y})\), a measurable mapping \((X, \mathcal{X}) \to (Y,
\mathcal{Y})\) is a map \(T : X \to Y\) with the property that, for any
measurable set in the target, the inverse image is measurable in the source.
Or, formally, for any \(B\) in \(\mathcal{Y}\), you have that \(T^{-1}(B)\) is
in \(\mathcal{X}\).

## The Space of Probability Measures on a Measurable Space

If you consider the collection of all measurable spaces and measurable mappings between them, you get a category. Define \(\textbf{Meas}\) to be the category of measurable spaces. So, objects are measurable spaces and morphisms are the measurable mappings between them.

For any specific measurable space \(M\) in \(\textbf{Meas}\), we can consider
the space of all possible probability measures that could be placed on it and
denote that \(\mathcal{P}(M)\). To be clear, \(\mathcal{P}(M)\) is a *space of
measures* - that is, a space in which the points themselves are probability
measures.

What’s remarkable about \(\mathcal{P}(M)\) is that it is *itself* a measurable
space. Let me explain.

As a probability measure, any element of \(\mathcal{P}(M)\) is a function from
measurable subsets of \(M\) to the interval \([0, 1]\) in \(\mathbb{R}\). That
is: if \(M\) is the measurable space \((X, \mathcal{X})\), then a point \(\nu\)
in \(\mathcal{P}(M)\) is a function \(\mathcal{X} \to \mathbb{R}\). For any
measurable \(A\) in \(M\), there just naturally exists a sort of ‘evaluation’
mapping I’ll call \(\tau_A: \mathcal{P}(M) \to \mathbb{R}\) that takes a
measure on \(M\) and evaluates it on the set \(A\). To be explicit: if \(\nu\)
is a measure in \(\mathcal{P}(M)\), then \(\tau_A\) simply evaluates
\(\nu(A)\). It ‘runs’ the measure in a sense; in Haskell, \(\tau_A\) would be
analogous to a function like `\f -> f a`

for some `a`

.

This evaluation map \(\tau_A\) corresponds to an *integral*. If you have a
measurable space \((X, \mathcal{X})\), then for any \(A\) a subset in
\(\mathcal{X}\), \(\tau_A(\nu) = \nu(A) = \int_{X}\chi_A d\nu\) for \(\chi\)
the characteristic or indicator function of \(A\) (where \(\chi(x)\) is \(1\)
if \(x\) is in \(A\), and is \(0\) otherwise). And we can actually extend
\(\tau\) to operate over measurable mappings from \((X, \mathcal{X})\) to
\((\mathbb{R}, \mathcal{B}(\mathbb{R}))\), where \(\mathcal{B}(\mathbb{R})\) is
a suitable \(\sigma\)-algebra on \(\mathbb{R}\). Here we typically use what’s
called the *Borel* \(\sigma\)-algebra, which takes a topology on the set and
then generates a \(\sigma\)-algebra from the open sets in the topology (for
\(\mathbb{R}\) we can just use the ‘usual’ topology generated by the Euclidean
metric). For \(f : X \to \mathbb{R}\) a measurable function, we can define the
evaluation mapping \(\tau_f : \mathcal{P}(M) \to \mathbb{R}\) as \(\tau_f(\nu)
= \int_X f d\nu\).

We can abuse notation here a bit and just use \(\tau\) to refer to ‘duck typed’ mappings that evaluate measures over measurable sets or measurable functions depending on context. If we treat \(\tau_A(\nu)\) as a function \(\tau(\nu)(A)\), then \(\tau(\nu)\) has type \(\mathcal{X} \to \mathbb{R}\). If we treat \(\tau_f(\nu)\) as a function \(\tau(\nu)(f)\), then \(\tau(\nu)\) has type \((X \to \mathbb{R}) \to \mathbb{R}\). I’ll say \(\tau_{\{A, f\}}\) to refer to the mappings that accept either measurable sets or functions.

In any case. For a measurable space \(M\), there exists a topology on
\(\mathcal{P}(M)\) called the *weak-* topology* that makes all the evaluation
mappings \(\tau_{\{A, f\}}\) continuous for any measurable set \(A\) or
measurable function \(f\). From there, we can generate the Borel
\(\sigma\)-algebra \(\mathcal{B}(\mathcal{P}(M))\) that makes the evaluation
functions \(\tau_{\{A, f\}}\) measurable. The result is that
\((\mathcal{P}(M), \mathcal{B}(\mathcal{P}(M)))\) is itself a measurable space,
and thus an object in \(\textbf{Meas}\).

The space \(\mathcal{P}(M)\) actually has all sorts of insane properties that one wouldn’t expect - there are implications on convexity, completeness, compactness and such that carry over from \(M\). But I digress.

## \(\mathcal{P}\) is a Functor

So: for any \(M\) an object in \(\textbf{Meas}\), we have that \(\mathcal{P}(M)\) is also an object in \(\textbf{Meas}\). And if you look at \(\mathcal{P}\) like a functor, you notice that it takes objects of \(\textbf{Meas}\) to objects of \(\textbf{Meas}\). Indeed, you can define an analogous procedure on morphisms in \(\textbf{Meas}\) as follows. Take \(N\) to be another object (read: measurable space) in \(\textbf{Meas}\) and \(T : M \to N\) to be a morphism (read: measurable mapping) between them. Now, for any measure \(\nu\) in \(\mathcal{P}(M)\) we can define \(\mathcal{P}(T)(\nu) = \nu \circ T^{-1}\) (this is called the image, distribution, or pushforward of \(\nu\) under \(T\)). For some \(T\) and \(\nu\), \(\mathcal{P}(T)(\nu)\) thus takes measurable sets in \(N\) to a value in the interval \([0, 1]\) - that is, it is a measure on \(\mathcal{P}(N)\). So we have that:

\[\mathcal{P}(T) : \mathcal{P}(M) \to \mathcal{P}(N)\]and so \(\mathcal{P}\) is an endofunctor on \(\textbf{Meas}\).

## \(\mathcal{P}\) is a Monad

See where we’re going here? If we can define natural transformations \(\mu\) and \(\eta\) such that \((\mathcal{P}, \mu, \eta)\) is a monoid in the category of endofunctors, we’ll have defined a monad. We thus need to come up with a suitable monoidal structure, et voilà.

First the identity. We want a natural transformation \(\eta\) between the identity functor \(1_{\mathcal{F}}\) and the functor \(\mathcal{P}\) such that \(\eta_M : 1_{\mathcal{F}}(M) \to \mathcal{P}(M)\) for any measurable space \(M\) in \(\textbf{Meas}\). Evaluating the identity functor simplifies things to \(\eta_M : M \to \mathcal{P}(M)\).

We can define this concretely as follows. Grab a measurable space \(M\) in \(\textbf{Meas}\) and define \(\eta(x)(A) = \chi_A(x)\) for any point \(x \in M\) and any measurable set \(A \subseteq M\). \(\eta(x)\) is thus a probability measure on \(M\) - we assign \(1\) to measurable sets that contain \(x\), and 0 to those that don’t. If we peel away another argument, we have that \(\eta : M \to \mathcal{P}(M)\), as required.

So \(\eta\) takes points in measurable spaces to probability measures on those
spaces. In technical parlance, it takes a point \(x\) to the *Dirac
measure* at \(x\) - the probability measure that places the entirety of its
mass at \(x\).

Now for the other part of the monoidal structure, \(\mu\). I initially found this next part to be a bit of a trip, but let me see what I can do about that.

Recall that the category of endofunctors, \(\mathcal{F}\), is monoidal, so there exists a tensor product \(\otimes : \mathcal{F} \times \mathcal{F} \to \mathcal{F}\) that we can deal with, which here just corresponds to functor composition. We’re looking for a natural transformation:

\[\mu : \mathcal{P} \circ \mathcal{P} \to \mathcal{P}\]which is often written as:

\[\mu : \mathcal{P}^2 \to \mathcal{P}.\]Take \(M = (X, \mathcal{X})\) a measurable space in \(\textbf{Meas}\) and then
consider the space of probability measures over it, \(\mathcal{P}(M)\). Then
take the space of probability measures *over the space of probability measures*
on \(M\), \(\mathcal{P}(\mathcal{P}(M))\). Since \(\mathcal{P}\) is an
endofunctor, this is again a measurable space, and for any measurable subset
\(A\) of \(M\) we again have a family of mappings \(\tau_A\) that take a
probability measure in \(\mathcal{P}(\mathcal{P}(M))\) and evaluate it on
\(A\). We want \(\mu\) to be the thing that turns a measure over measures
\(\rho\) into a plain old probability measure on \(\mathcal{P}(M)\).

In the context of probability theory, this kind of semigroup action is a
*marginalizing* operator. We’re taking the ‘uncertainty’ captured in
\(\mathcal{P}(\mathcal{P}(M))\) via the probability measure \(\rho\) and
smearing it into the probability measures in \(\mathcal{P}(M)\).

Take \(\rho\) in \(\mathcal{P}(\mathcal{P}(M))\) and some \(A\) a measurable subset of \(M\). We can define \(\mu\) as follows:

\[\mu(\rho)(A) = \int_{\mathcal{P}(M)} \tau_A d\rho.\]Using some lambda calculus notation to see the argument for \(\tau_A\), we can expand the integrals to get the following gnarly expression:

\[\mu(\rho)(A) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M \chi_A d \nu \right\} d \rho.\]Notice what’s happening here. For \(M\) a measurable space, we’re integrating over \(\mathcal{P}(M)\) the space of probability measures on \(M\), with respect to the probability measure \(\rho\), which itself is a point in the space of probability measures over probability measures on \(M\), \(\mathcal{P}(\mathcal{P}(M))\). Whew.

The spaces we’re integrating over here are unusual, but \(\rho\) is still a
probability measure, so when applied to a measurable set in
\(\mathcal{B}(\mathcal{P}(M))\) it results in a probability in \([0, 1]\).
So, peeling back an argument, we have that \(\mu(\rho)\) has type \(\mathcal{X}
\to \mathbb{R}\). In other words, it’s a probability measure on \(M\), and
thus is in \(\mathcal{P}(M)\). And if we peel back *another* argument, we find
that:

so, as required, that

\[\mu : \mathcal{P}^{2} \to \mathcal{P}.\]It’s also worth noting that we can overload the notation for \(\mu\) in the same way we did for \(\tau\), i.e. to supply measurable functions in addition to measurable sets:

\[\mu(\rho)(f) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M f d \nu \right\} d \rho.\]Combining the three components, we get \((\mathcal{P}, \mu, \eta)\), the canonical Giry monad.

In Haskell, when we’re dealing with monads we typically use the bind operator \(\gg\!\!=\) instead of manually dealing with the functorial structure and \(\mu\) (called ‘join’). Bind has the type:

\[\gg\!\!= : \mathcal{P}(M) \to (M \to \mathcal{P}(N)) \to \mathcal{P}(N)\]and for illustration, we can define \(\gg\!\!=\) for the Giry monad like so:

\[(\rho \gg\!\!= g)(f) = \int_{M} \left\{ \lambda m . \int_N f d g(m) \right\} d\rho.\]Here \(\rho\) is in \(\mathcal{P}(M)\), \(g\) is in \(M \to \mathcal{P}(N)\),
and \(f\) is in \(N \to \mathbb{R}\), so note that we potentially simplify the
outermost integral enormously. It now operates over a *general* measurable
space, rather than a space of measures in particular, and this will come in
handy when we get to implementation details in the next post.

## Wrapping Up

That’s about it for now. It’s worth noting as a kind of footnote here that the existence of the Giry monad also obviously implies the existence of a Giry applicative functor. But the official situation for applicative functors seems kind of weird in this context, and I’m not yet up to the task of dealing with it formally.

Intuitively, one should be able to define the binary applicative operator characteristic of its lax monoidal structure as follows:

\[(\rho \, \langle \ast \rangle \, \nu)(f) = \int_{\mathcal{P}(M \to N)} \left\{\lambda T . \int_{M \to N} (f \circ T) d\nu \right\} d \rho.\]But this has some really weird measure-theoretic implications - namely, that it assumes the existence of a space of probability measures over the space of all measurable functions \(M \to N\), which is not trivial to define and indeed may not even exist. It seems like some people are looking into this problem as I just happened to stumble on this paper on the arXiv while doing some googling. I notice that some people on e.g. nLab require categories with additional structure beyond \(\textbf{Meas}\) for the development of the Giry monad as well, for example the category of Polish (separable, completely metrizable) spaces \(\textbf{Pol}\), so maybe the extra structure there takes care of the quirks.

Anyway. Applicatives are neat here because applicative probability measures are independent probability measures. And the existence of applicativeness means you can do all the things with independent probability measures that you might be used to. Measure convolution and friends are good examples. Given a measurable space \(M\) that supports some notion of addition and two probability measures \(\nu\) and \(\zeta\) in \(\mathcal{P}(M)\), we can add measures together via:

\[(\nu + \zeta)(f) = \int_{M}\int_{M}f(x + y)d\nu(x)d\zeta(y)\]where \(x\) and \(y\) are both points in \(M\). Subtraction and multiplication translate trivially as well.

In another article I’ll detail how the Giry monad can be implemented in Haskell and point out some neat extensions. There are some cool connections to continuations and codensity monads, and seemingly de Finetti’s theorem and exchangeability. That kind of thing. It’d also be worth trying to justify independence of probability measures from a categorical perspective, which seems easier than resolving the nitty-gritty measurability qualms I mentioned above.

‘Til then! Thanks to Jason Forbes for sifting through this stuff and providing some great comments.