# The Applicative Structure of the Giry Monad

In my last two posts about the Giry monad I derived the thing from its categorical and measure-theoretic foundations. I kind of thought that those posts wouldn’t be of much interest to people but they turned out to be a hit. I clearly can’t tell what the internet likes.

Anyway, something I left out was the theoretical foundation of the Giry monad’s Applicative instance, which seemed a bit odd. I also pointed out that applicativeness in the context of probability implies independence between probability measures.

In this article I’m going to look at each of these issues. After playing around with the foundations, it looks like the applicative instance for the Giry monad can be put on a sound footing in terms of the standard measure-theoretic concept of product measure. Also, it turns out that the claim I made of applicativeness $$\implies$$ independence is somewhat ill-posed. But, using the shiny new intuition we’ll glean from a better understanding of the applicative structure, we can put that on a solid footing too.

So let’s look at both of these things and see what’s up.

## Monoidal Functors

The foundational categorical concept behind applicative functors is the monoidal functor, which is a functor between monoidal categories that preserves monoidal structure.

Formally: for monoidal categories $$(C, \otimes, I)$$ and $$(D, \oplus, J)$$, a monoidal functor $$F : C \to D$$ is a functor and associated natural transformations $$\phi : F(A) \oplus F(B) \to F(A \otimes B)$$ and $$i : J \to F(I)$$ that satisfy some coherence conditions that I won’t mention here. Notably, if $$\phi$$ and $$i$$ are isomorphisms (i.e. are invertible) then $$F$$ is called a strong monoidal functor. Otherwise it’s called lax. Applicative functors in particular are lax monoidal functors.

This can be made much clearer for endofunctors on a monoidal category $$(C, \otimes, I)$$. Then you only have $$F : C \to C$$ and $$\phi : F(A) \otimes F(B) \to F(A \otimes B)$$ to worry about. If we sub in the Giry monad $$\mathcal{P}$$ from the last couple of posts, we’d want $$\mathcal{P} : \textbf{Meas} \to \textbf{Meas}$$ and $$\phi : \mathcal{P}(M) \otimes \mathcal{P}(N) \to \mathcal{P}(M \otimes N)$$.

Does the category of measurable spaces $$\textbf{Meas}$$ have a monoidal structure? Yup. Take measurable spaces $$M = (X, \mathcal{X})$$ and $$N = (Y, \mathcal{Y})$$. From the Giry monad derivation we already have that the monoidal identity $$i : M \to \mathcal{P}(M)$$ corresponds to a Dirac measure at a point, so that’s well and good. And we can define the tensor product $$\otimes$$ between $$M$$ and $$N$$ as follows: let $$X \times Y$$ be the standard Cartesian product on $$X$$ and $$Y$$ and let $$\mathcal{X} \otimes \mathcal{Y}$$ be the smallest $$\sigma$$-algebra generated by the Cartesian product $$A \times B$$ of measurable sets $$A \in \mathcal{X}$$ and $$B \in \mathcal{Y}$$. Then $$(X \times Y, \mathcal{X} \otimes \mathcal{Y})$$ is a measurable space, and so $$(\textbf{Meas}, \otimes, i)$$ is monoidal.

Recall that $$\mathcal{P}(M)$$ and $$\mathcal{P}(N)$$ - the space of measures over $$M$$ and $$N$$ respectively - are themselves objects in $$\textbf{Meas}$$. So, clearly $$\mathcal{P}(M) \otimes \mathcal{P}(N)$$ is a measurable space, and if $$\mathcal{P}$$ is monoidal then there must exist a natural transformation that can take us from there to $$\mathcal{P}(M \otimes N)$$. This is the space of measures over the product $$M \otimes N$$.

So the question is: does $$\mathcal{P}$$ have the required monoidal structure?

Yes. It must, since $$\mathcal{P}$$ is a monad, and any monad can generate the required natural transformation. Let $$\mu$$ be the monadic ‘join’ operator $$\mathcal{P}^2 \to \mathcal{P}$$ and $$\eta$$ be the monadic identity $$I \to \mathcal{P}$$. We have, evaluating right-to-left:

$\phi_{\nu \times \rho} = \mu \mathcal{P} \left\{ \lambda m . \mu \mathcal{P}\left(\lambda n. \eta_{m \times n}\right)\mathcal{P}(\rho) \right\} \mathcal{P}(\nu).$

Using $$\gg\!\!=$$ makes this much easier to read:

$\phi_{\nu \times \rho} = \nu \gg\!\!= \lambda m. \rho \gg\!\!= \lambda n. \eta_{m \times n}$

or in code, just:

phi :: Monad m => (m a, m b) -> m (a, b)
phi (m, n) = liftM2 (,) m n


So with that we have that $$(\mathcal{P}, \phi, i)$$ is a (lax) monoidal functor. And you can glean a monad-generated applicative operator from that immediately (this leads to the function called ‘ap’ in Control.Monad):

ap :: Monad m => m (a -> b) -> m a -> m b
ap f x = fmap ($$g, z) -> g z) (phi f x)  (Note: I won’t refer to \(\mu$$ as the join operator from this point out in order to free it up for denoting measures.)

## Probabilistic Interpretations

### Product Measure

The correct probabilistic interpretation here is that $$\phi$$ takes a pair of probability measures to the product measure over the appropriate product space. For probability measures $$\mu$$ and $$\nu$$ on measurable spaces $$M$$ and $$N$$ respectively, the product measure is the (unique) measure $$\mu \times \nu$$ on $$M \otimes N$$ such that:

$(\mu \times \nu)(A \times B) = \mu(A) \nu(B)$

for $$A \times B$$ a measurable set in $$M \otimes N$$.

Going through the monoidal functor route seems to put the notion of the Giry applicative instance on a more firm measure-theoretic foundation. Instead of considering the following from the Giry monad foundations article:

$(\rho \, \langle \ast \rangle \, \nu)(f) = \int_{\mathcal{P}(M \to N)} \left\{\lambda T . \int_{M \to N} (f \circ T) d\nu \right\} d \rho$

which is defined in terms of the dubious space of measures over measurable functions $$M \to N$$, we can better view things using the monoidal structure-preserving natural transformation $$\phi$$. For measures $$\mu$$ and $$\nu$$ on $$(X, \mathcal{X})$$ and $$(Y, \mathcal{Y})$$ respectively, we have:

$\phi(\mu, \nu)(f) = \int_{X \times Y}f d(\mu \times \nu)$

and then for $$g : Z \to X \otimes Y$$ we can use the functor structure of $$\mathcal{P}$$ to do:

$(\text{fmap} \, g \, \phi(\mu, \nu))(f) = \int_{Z} (f \circ g) d((\mu \times \nu) \circ g^{-1})$

## Conclusion

So there you have it. Applicativeness can seemingly be put on a straightforward measure-theoretic grounding and has some useful implications for independence.

It’s worth noting that, in the case of the Giry monad, we don’t need to go through its monadic structure in order to recover an applicative instance. We can do so entirely by hacking together continuations without using a single monadic bind. This is actually how I defined the applicative instance in the Giry monad implementation article previously:

instance Applicative Measure where
pure x = Measure (\f -> f x)
Measure g <*> Measure h = Measure $\f -> g (\k -> h (f . k))  Teasing out the exact structure of this and its relation to the codensity monad is again something I’ll leave to others. # Implementing the Giry Monad In my last post I went over the categorical and measure-theoretic foundations of the Giry monad, the ‘canonical’ probability monad that operates on the level of probability measures. In this post I’ll pick up from where I left off and talk about a neat and faithful (if impractical) implementation of the Giry monad that one can put together in Haskell. ## Measure, Integral, and Continuation So. For a quick review, we’ve established the Giry monad as a triple $$(\mathcal{P}, \mu, \eta)$$, where $$\mathcal{P}$$ is an endofunctor on the category of measurable spaces $$\textbf{Meas}$$, $$\mu$$ is a marginalizing integration operation defined by: $\mu(\rho)(A) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M \chi_A d \nu \right\} d \rho$ and $$\eta$$ is a monoidal identity, defined by the Dirac measure at a point: $\eta(x)(A) = \chi_A(x).$ How do we actually implement this beast? If we’re looking to be suitably general then it is unlikely that we’re going to be able to easily represent something like a $$\sigma$$-algebra over some space of measures on a computer, so that route is sort of a non-starter. But it can be done. The key to implementing a general-purpose Giry monad is to notice that the fundamental operation involved in it is integration, and that we can avoid working with $$\sigma$$-algebras and measurable spaces directly if we focus on dealing with measurable functions instead of measurable sets. Consider the integration map on measurable functions $$\tau_f$$ that we’ve been using this whole time. For some measurable function $$f$$, $$\tau_f$$ takes a measure on some measurable space $$M = (X, \mathcal{X})$$ and uses it to integrate $$f$$ over $$X$$. In other words: $\tau_f(\nu) = \int_X f d\nu.$ A measure in $$\mathcal{P}(M)$$ has type $$X \to \mathbb{R}$$, so $$\tau_f$$ has corresponding type $$(X \to \mathbb{R}) \to \mathbb{R}$$. This might look familiar to you; it’s very similar to the type signature for a continuation: newtype Cont a r = Cont ((a -> r) -> r)  Indeed, if we restrict the carrier type of ‘Cont’ to the reals, we can be really faithful to the type: newtype Integral a = Integral ((a -> Double) -> Double)  Now, let’s overload notation and call the integration map $$\tau_f$$ itself a measure. That is, $$\tau_f$$ is a mapping $$\nu \mapsto \int_{X}fd\nu$$, so we’ll just interpret the notation $$\nu(f)$$ to mean the same thing - $$\int_{X}fd\nu$$. This is convenient because we can dispense with $$\tau$$ and just pretend measures can be applied directly to measurable functions. There’s no way we can get confused here; measures operate on sets, not functions, so notation like $$\nu(f)$$ is not currently in use. We just set $$\nu(f) = \tau_f(\nu)$$ and that’s that. Let’s rename the ‘Integral’ type to match: newtype Measure a = Measure ((a -> Double) -> Double)  We can extract a very nice shallowly-embedded language for integration here, the core of which is a single term: integrate :: (a -> Double) -> Measure a -> Double integrate f (Measure nu) = nu f  Note that this is the same way we’d express integration mathematically; we specify that we want to integrate a measurable function $$f$$ with respect to some measure $$\nu$$: $\int f d\nu = \texttt{integrate f nu}.$ The only subtle difference here is that we don’t specify the space we’re integrating over in the integral expression - instead, we’ll bake that into the definition of the measures we create themselves. Details in a bit. What’s interesting here is that the Giry monad is the continuation monad with the carrier type restricted to the reals. This isn’t surprising when you think about what’s going on here - we’re representing measures as integration procedures, that is, programs that take a measurable function as input and then compute its integral in some particular way. A measure, as we’ve implemented it here, is just a ‘program with a missing piece’. And this is exactly the essence of the continuation monad in Haskell. ## Typeclass Instances We can fill out the functor, applicative, and monad instances mechanically by reference to the a standard continuation monad implementation, and each instance gives us some familiar conceptual structure or operation on probability measures. Let’s take a look. The functor instance lets us transform the support of a measurable space while keeping its density structure invariant. If we have: $\nu(f) = \int_X f d\nu$ then mapping a measurable function over the measure corresponds to: $(\texttt{fmap} \, g \, \nu)(f) = \int_{X} (f \circ g) d\nu.$ The functor structure allows us to precisely express a pushforward measure or distribution of $$\nu$$ under $$g$$. It lets us ‘adapt’ a measure to other measurable spaces, just like a good functor should. In Haskell, the functor instance corresponds exactly to the math: instance Functor Measure where fmap g nu = Measure$ \f ->
integrate (f . g) nu


The monad instance is exactly the Giry monad structure that we developed previously, and it allows us to sequence probability measures together by marginalizing one into another. We’ll write it in terms of bind, of course, which went like:

$(\rho \gg\!\!= g)(f) = \int_M \left\{\lambda m . \int_N f dg(m) \right\} d \rho.$

instance Monad Measure where
return x  = Measure (\f -> f x)
rho >>= g = Measure $\f -> integrate (\m -> integrate f (g m)) rho  Finally there’s the Applicative instance, which as I mentioned in the last post is sort of conceptually weird here. So in the spirit of that comment, I’m going to dodge any formal justification for now and just use the following instance which works in practice: instance Applicative Measure where pure x = Measure (\f -> f x) Measure g <*> Measure h = Measure$ \f ->
g (\k -> h (f . k))


## Conceptual Example

It’s worth taking a look at an example of how things should conceivably work here. Consider the following probabilistic model:

\begin{align*} \pi & \sim \text{beta}(\alpha, \beta) \\ \mu \, | \, \pi & \sim \text{binomial}(n, \pi) \end{align*}

It’s a standard hierarchical presentation. A ‘compound’ measure can be obtained here by marginalizing over the beta measure $$\pi$$, and that’s called the beta-binomial measure. Let’s find it.

The beta distribution has support on the $$[0, 1]$$ subset of the reals, and the binomial distribution with argument $$n$$ has support on the $$\{0, \ldots, n\}$$ subset of the integers, so we know that things should proceed like so:

\begin{align*} \psi(f) & = (\pi \gg\!\!= \mu)(f) \\ & = \int_{\mathbb{R}} \left\{\lambda p . \int_{\mathbb{Z}} f d\mu(p) \right\} d \pi. \end{align*}

Eliding some theory of integration, I can tell you that $$\pi$$ is absolutely continuous with respect to Lebesgue measure and that $$\mu(p)$$ is absolutely continuous w/respect to counting measure for appropriate $$p$$. So, $$\pi$$ admits a density $$d\pi/dx = g_\pi$$ and $$\mu(p)$$ admits a density $$d\mu(p)/d\# = g_{\mu(p)}$$, defined as:

$g_\pi(p \, | \, \alpha, \beta) = \frac{1}{B(\alpha, \beta)} p^{\alpha - 1} (1 - p)^{\beta - 1}$

and

$g_{\mu(p)}(x \, | \, n, p) = \binom{n}{x} p^x (1 - p)^{n - x}$

respectively, for $$B$$ the beta function and $$\binom{n}{x}$$ a binomial coefficient. Again, we can reduce the integral as follows, transforming the outermost integral into a standard Riemann integral and the innermost integral into a simple sum of products:

$\psi(f) = \int_{0}^{1} \lambda p. \left\{ \lambda \alpha. \lambda \beta. g_{\pi}(p \, | \alpha, \beta) \sum_{z \in \{0, \ldots, n\}} f(z) \left( \lambda n. g_{\mu(p)}(z \, | \, n, p) \right) \right\} dx.$

where $$dx$$ denotes Lebesgue measure. I could expand this further or simplify things a little more (the beta and binomial are conjugates) but you get the point, which is that we have a way to evaluate the integral.

What is really required here then is to be able to encode into the definitions of measures like $$\pi$$ and $$\mu(p)$$ the method of integration to use when evaluating them. For measures absolutely continuous w/respect to Lebesgue measure, we can use the Riemann integral over the reals. For measures absolutely continuous w/respect to counting measure, we can use a sum of products. In both cases, we’ll also need to supply the density or mass function by which the integral should be evaluated.

## Creating Measures

Recall that we are representing measures as integration procedures. So to create one is to define a program by which we’ll perform integration.

Let’s start with the conceptually simpler case of a probability measure that’s absolutely continuous with respect to counting measure. We need to provide a support (the region for which probability is greater than 0) and a probability mass function (so that we can weight every point appropriately). Then we just want to integrate a function by evaluating it at every point in the support, multiplying the result by that point’s probability mass, and summing everything up. In code, this translates to:

fromMassFunction :: (a -> Double) -> [a] -> Measure a
fromMassFunction f support = Measure $\g -> foldl' (\acc x -> acc + f x * g x) 0 support  So if we want to construct a binomial measure, we can do that like so (where choose comes from Numeric.SpecFunctions): binomial :: Int -> Double -> Measure Int binomial n p = fromMassFunction (pmf n p) [0..n] where pmf n p x | x < 0 || n < x = 0 | otherwise = choose n x * p ^^ x * (1 - p) ^^ (n - x)  The second example involves measures over the real line that are absolutely continuous with respect to Lebesgue measure. In this case we want to evaluate a Riemann integral over the entire real line, which is going to necessitate approximation on our part. There are a bunch of methods out there for approximating integrals, but a simple one for one-dimensional problems like this is quadrature, an implementation for which Ed Kmett has handily packaged up in his integration package: fromDensityFunction :: (Double -> Double) -> Measure Double fromDensityFunction d = Measure$ \f ->
quadratureTanhSinh (\x -> f x * d x)
where
quadratureTanhSinh = result . last . everywhere trap


Here we’re using quadrature to approximate the integral, but otherwise it has a similar form as ‘fromMassFunction’. The difference here is that we’re integrating over the entire real line, and so don’t have to supply a support explicitly.

We can use this to create a beta measure (where logBeta again comes from Numeric.SpecFunctions):

beta :: Double -> Double -> Measure Double
beta a b = fromDensityFunction (density a b) where
density a b p
| p < 0 || p > 1 = 0
| otherwise = 1 / exp (logBeta a b) * p ** (a - 1) * (1 - p) ** (b - 1)


Note that since we’re going to be integrating over the entire real line and the beta distribution has support only over $$[0, 1]$$, we need to implicitly define the support here by specifying which regions of the domain will lead to a density of 0.

In any case, now that we’ve constructed those things we can just use a monadic bind to create the beta-binomial measure we described before. It masks a lot of under-the-hood complexity.

betaBinomial :: Int -> Double -> Double -> Measure Int
betaBinomial n a b = beta a b >>= binomial n


There are a couple of other useful ways to create measures, but the most notable is to use a sample in order to create an empirical measure. This is equivalent to passing in a specific support for which the mass function assigns equal probability to every element; I’ll use Gabriel Gonzalez’s foldl package here as it’s pretty elegant:

fromSample :: Foldable f => f a -> Measure a
fromSample = Measure . flip weightedAverage

weightedAverage :: (Foldable f, Fractional r) => (a -> r) -> f a -> r
weightedAverage f = Foldl.fold (weightedAverageFold f) where
weightedAverageFold :: Fractional r => (a -> r) -> Fold a r
weightedAverageFold f = Foldl.premap f averageFold

averageFold :: Fractional a => Fold a a
averageFold = (/) <\$> Foldl.sum <*> Foldl.genericLength


Using ‘fromSample’ you can create an empirical measure using just about anything you’d like:

data Foo = Foo | Bar | Baz

foos :: [Foo]
foos = [Foo, Foo, Bar, Foo, Baz, Foo, Bar, Foo, Foo, Foo, Bar]

nu :: Measure Foo
nu = fromSample foos


Though I won’t demonstrate it here, you can use this approach to also create measures from sampling functions or random variables that use a source of randomness - just draw a sample from the function and pipe the result into ‘fromSample’.

## Querying Measures

To query a measure is to simply get some result out of it, and we do that by integrating some measurable function against it. The easiest thing to do is to just take a straightforward expectation by integrating the identity function; for example, here’s the expected value of a beta(10, 10) measure:

> integrate id (beta 10 10)
0.49999999999501316


The expected value of a beta($$\alpha$$, $$\beta$$) distribution is $$\alpha / (\alpha + \beta)$$, so we can verify analytically that the result should be 0.5. We observe a bit of numerical imprecision here because, if you’ll recall, we’re just approximating the integral via quadrature. For measures created via ‘fromMassFunction’ we don’t need to use quadrature, so we won’t observe the same kind of approximation error. Here’s the expected value of a binomial(10, 0.5) measure, for example:

> integrate fromIntegral (binomial 10 0.5)
5.0


Note here that we’re integrating the ‘fromIntegral’ function against the binomial measure. This is because the binomial measure is defined over the integers, rather than the reals, and we always need to evaluate to a real when we integrate. That’s part of the definition of a measure!

Let’s calculate the expectation of the beta-binomial distribution with $$n = 10$$, $$\alpha = 1$$, and $$\beta = 8$$:

> integrate fromIntegral (betaBinomial 10 1 8)
1.108635884924813


Neato. And since we can integrate like this, we can really compute any of the moments of a measure. The first raw moment is what we’ve been doing here, and is called the expectation:

expectation :: Measure Double -> Double
expectation = integrate id


The second (central) moment is the variance. Here I mean variance in the moment-based sense, rather than as the possibly better-known sample variance:

variance :: Measure Double -> Double
variance nu = integrate (^ 2) nu - expectation nu ^ 2


The variance of a binomial($$n$$, $$p$$) distribution is known to be $$np(1-p)$$, so for $$n = 10$$ and $$p = 0.5$$ we should get 2.5:

> variance (binomial 10 0.5)
<interactive>:87:11: error:
• Couldn't match type ‘Int’ with ‘Double’
Expected type: Measure Double
Actual type: Measure Int
• In the first argument of ‘variance’, namely ‘(binomial 10 0.5)’
In the expression: variance (binomial 10 0.5)
In an equation for ‘it’: it = variance (binomial 10 0.5)


Ahhh, but remember: the binomial measure is defined over the integers, so we can’t integrate it directly. No matter - the functorial structure allows us to adapt it to any other measurable space via a measurable function:

> variance (fmap fromIntegral (binomial 10 0.5))
2.5


Expectation and variance (and other moments) are pretty well-known, but you can do more exotic things as well. You can calculate the moment generating function for a measure, for example:

momentGeneratingFunction :: Measure Double -> Double -> Double
momentGeneratingFunction nu t = integrate (\x -> exp (t * x)) nu


and the cumulant generating function follows naturally:

cumulantGeneratingFunction :: Measure Double -> Double -> Double
cumulantGeneratingFunction nu = log . momentGeneratingFunction nu


A particularly useful construct is the cumulative distribution function for a measure, which calculates the probability of a region less than or equal to some number:

cdf :: Measure Double -> Double -> Double
cdf nu x = integrate (negativeInfinity to x) nu

negativeInfinity :: Double
negativeInfinity = negate (1 / 0)

to :: (Num a, Ord a) => a -> a -> a -> a
to a b x
| x >= a && x <= b = 1
| otherwise        = 0


The beta(2, 2) distribution is symmetric around its mean 0.5, so the probability of the region $$[0, 0.5]$$ should itself be 0.5. This checks out as expected, modulo approximation error due to quadrature:

> cdf (beta 2 2) 0.5
0.4951814897381374


Similarly for measurable spaces without any notion of order, there’s a simple CDF analogue that calculates the probability of a region that contains the given points:

containing :: (Num a, Eq b) => [b] -> b -> a
containing xs x
| x elem xs = 1
| otherwise   = 0


And probably the least interesting query of all is the simple ‘volume’, which calculates the total measure of a space. For any probability measure this must obviously be one, so it can at least be used as a quick sanity check:

volume :: Measure Double -> Double
volume = integrate (const 1)


## Convolution and Friends

I mentioned in the last post that applicativeness corresponds to independence in some sense, and that independent measures over the same measurable space can be convolved together, à la:

$(\nu + \zeta)(f) = \int_{M}\int_{M}f(x + y)d\nu(x)d\zeta(y)$

for measures $$\nu$$ and $$\zeta$$ on $$M$$. In Haskell-land it’s well-known that any applicative instance gives you a free ‘Num’ instance, and the story is no different here:

instance Num a => Num (Measure a) where
(+)         = liftA2 (+)
(-)         = liftA2 (-)
(*)         = liftA2 (*)
abs         = fmap abs
signum      = fmap signum
fromInteger = pure . fromInteger


There are a few neat ways to demonstrate this kind of thing. Let’s use a Gaussian measure here as a running example:

gaussian :: Double -> Double -> Measure Double
gaussian m s = fromDensityFunction (density m s) where
density m s x
| s <= 0    = 0
| otherwise =
1 / (s * sqrt (2 * pi)) * exp (negate ((x - m) ^^ 2) / (2 * (s ^^ 2)))


First, consider a chi-squared measure with $$k$$ degrees of freedom. We could create this directly using a density function, but instead we can represent it by summing up independent squared Gaussian measures:

chisq :: Int -> Measure Double
chisq k = sum (replicate k normal) where
normal = fmap (^ 2) (gaussian 0 1)


To sanity check the result, we can compute the mean and variance of a $$\chi^2(2)$$ measure, which should be $$k$$ and $$2k$$ respectively for $$k = 2$$:

> expectation (chisq 2)
2.0000000000000004
> variance (chisq 2)
4.0


As a second example, consider a product of independent Gaussian measures. This is a trickier distribution to deal with analytically (see here), but we can use some well-known identities for general independent measures in order to verify our results. For any independent measures $$\mu$$ and $$\nu$$, we have:

$\mathbb{E}(\mu\nu) = \mathbb{E}\mu \mathbb{E}\nu$

and

$\text{var}(\mu\nu) = \text{var}(\mu)\text{var}(\nu) + \text{var}(\mu)(\mathbb{E}\nu)^2 + \text{var}(\nu)(\mathbb{E}\mu)^2$

for the expectation and variance of their product. So for a product of independent Gaussians w/parameters (1, 2) and (2, 3) respectively, we expect to see 2 for its expectation and 61 for its variance:

> expectation (gaussian 1 2 * gaussian 2 3)
2.0000000000000001
> variance (gaussian 1 2 * gaussian 2 3)
61.00000000000003


Woop!

## Wrapping Up

And there you have it, a continuation-based implementation of the Giry monad. You can find a bunch of code with similar functionality to this packaged up in my old measurable library on GitHub if you’d like to play around with the concepts.

That library has accumulated a few stars since I first pushed it up in 2013. I think a lot of people are curious about these weird measure things, and this framework at least gives you the ability to play around with a representation for measures directly. I found it particularly useful for really grokking, say, that integrating some function $$f$$ against a probability measure $$\nu$$ is identical to integrating the identity function against the probability measure $$\texttt{fmap} \, f \, \nu$$. And there are a few similar concepts there that I find really pop out when you play with measures directly, rather than when one just works with them on paper.

But let me now tell you why the Giry monad sucks in practice.

Take a look at this integral expression, which is brought about due to a monadic bind:

$(\nu \gg\!\!= \mu)(f) = \int_{M} \left\{\lambda m . \int_{M} f d\mu(m) \right\} d \nu.$

For simplicitly, let’s assume that $$M$$ is discrete and has cardinality $$|M|$$. This means that the integral reduces to

$(\nu \gg\!\!= \mu)(f) = \underbrace{\sum_{m \in M} d\nu(m) \underbrace{ \sum_{n \in M} f(n) d\mu(m)(n) }_{O(|M|)}}_{O(|M|)}$

for $$d\mu(m)$$ and $$d\nu$$ the appropriate Radon-Nikodym derivatives. You can see that the total number of operations involved in the integral is $$O(|M|^2)$$, and indeed, for $$p$$ monadic binds the computational complexity involved in evaluating all the integrals involved is exponential, on the order of $$|M|^p$$. It was no coincidence that I demonstrated a variance calculation for a $$\chi^2(2)$$ distribution instead of for a $$\chi^2(10)$$.

This isn’t really much of a surprise - the cottage industry of approximating integrals exists because integration is hard in practice, and integration is surely best avoided whenever one can get away with doing so. Vikash Mansinghka’s quote on this topic is fitting: “don’t calculate probabilities - sample good guesses.” I’ll also add: relegate the measures to measure theory, where they seem to belong.

The Giry monad is a lovely abstract construction for formalizing the monadic structure of probability, and as canonical probabilistic objects, measures and integrals are tremendously useful when working theoretically. But they’re a complete non-starter when it comes to getting anything nontrivial done in practice. For that, there are far more useful representations for probability distributions in Haskell - notably, the sampling function or random variable representation found in things like mwc-probability/mwc-random-monad and random-fu, or even better, the structural representation based on free or operational monads like I’ve written about before, or that you can find in something like monad-bayes.

The intuitions gleaned from playing with the Giry monad carry over precisely to other representations for the probability monad. In all cases, ‘return’ will correspond, semantically, to constructing a Dirac distribution at a point, while ‘bind’ will correspond to a marginalizing operator. The same is true for the underlying (applicative) functor structure: ‘fmap’ always corresponds to a density-preserving transformation of the support, while applicativeness corresponds to independence (yielding convolution, etc.). And you have to admit, the connection to continuations is pretty cool.

There is clearly some connection to the codensity monad as well, but I think I’ll let someone else figure out the specifics of that one. Something something right-Kan extension..

# Foundations of the Giry Monad

The Giry monad is the canonical probability monad that operates on the level of measures, which are the abstract constructs that canonically represent probability distributions. It’s sort of the baseline by which all other probability monads can be judged.

In this article I’m going to go through the categorical and measure-theoretic foundations of the Giry monad. In another article, I’ll describe how you can implement it in a very faithful sense in Haskell.

I was putting some notes together for another project and wound up writing up things up in a somewhat blog-friendly style, but this isn’t intended to be a tutorial per se. Really this isn’t the kind of content I’d usually post here, but since I’ve jotted everything up, I figured I may as well. If you like extremely dry mathematics and computer science, you’re in the right place.

I won’t define everything under the sun here - for properties or coherence conditions or other things that I’ve elided details on, check out something like Mac Lane or Aliprantis & Border. I’ll include some references at the end.

This is the game plan we’re working with:

• Define monads and their supporting machinery in a categorical sense.
• Define probability measures and some required background around that.
• Construct the functor that maps a measurable space to the collection of all probability measures on that space.
• Demonstrate that it’s a monad.

Let’s get started.

## Categorical Foundations

A category $$C$$ is a collection of objects and morphisms between them. So if $$W$$, $$X$$, $$Y$$, and $$Z$$ are objects in $$C$$, then $$f : W \to X$$, $$g : X \to Y$$, and $$h : Y \to Z$$ are examples of morphisms. These morphisms can be composed in the obvious associative way, i.e.

$f \circ (g \circ h) = (f \circ g) \circ h$

and there exist identity morphisms (or automorphisms) that simply map objects to themselves.

A functor is a mapping between categories (equivalently, it’s a morphism in the category of so-called ‘small’ categories). The functor $$F : C \to D$$ takes every object in $$C$$ to some object in $$D$$, and every morphism in $$C$$ to some morphism in $$D$$, such that the structure of morphism composition is preserved. An endofunctor is a functor from a category to itself, and a bifunctor is a functor from a pair of categories to another category, i.e. $$F : A \times B \to C$$.

A natural transformation is a mapping between functors. So for two functors $$F, G : C \to D$$, a natural transformation $$\epsilon : F \to G$$ associates to every object $$c$$ in $$C$$ a morphism $$\epsilon_c : F(c) \to G(c)$$ in $$D$$.

A monoidal category $$C$$ is a category with some additional monoidal structure, namely an identity object $$I$$ and a bifunctor $$\otimes : C \times C \to C$$ called the tensor product, plus several natural isomorphisms that provide the associativity of the tensor product and its right and left identity with the identity object $$I$$.

A monoid $$(M, \mu, \eta)$$ in a monoidal category $$C$$ is an object $$M$$ in $$C$$ together with two morphisms (obeying the standard associativity and identity properties) that make use of the category’s monoidal structure: the associative binary operator $$\mu : M \otimes M \to M$$, and the identity $$\eta : I \to M$$.

A monad is (infamously) a ‘monoid in the category of endofunctors’. So take the category of endofunctors $$\mathcal{F}$$ whose objects are endofunctors and whose morphisms are natural transformations between them. This is a monoidal category; there exists an identity endofunctor $$1_\mathcal{F}(F) = F$$ for all $$F$$ in $$\mathcal{F}$$, plus a tensor product $$\otimes : \mathcal{F} \times \mathcal{F} \to \mathcal{F}$$ defined by functor composition such that the required associativity and identity properties hold. $$\mathcal{F}$$ is thus a monoidal category, and any specific monoid $$(F, \mu, \eta)$$ we construct on it is a specific monad.

## Probabilistic Foundations

A measurable space $$(X, \mathcal{X})$$ is a set $$X$$ equipped with a topology-like structure called a $$\sigma$$-algebra $$\mathcal{X}$$ that essentially contains every well-behaved subset of $$X$$ in some sense. A measure $$\nu : \mathcal{X} \to \mathbb{R}$$ is a particular kind of set function from the $$\sigma$$-algebra to the nonnegative real line. A measure just assigns a generalized notion of area or volume to well-behaved subsets of $$X$$. In particular, if the total possible area or volume of the underlying set is 1 then we’re dealing with a probability measure. A measurable space completed with a measure, e.g. $$(X, \mathcal{X}, \nu)$$ is called a measure space, and a measurable space completed with a probability measure is called a probability space.

There is a lot of overloaded lingo around the word ‘measurable’. A ‘measurable set’ is an element of a $$\sigma$$-algebra in a measurable space. A measurable mapping is a mapping between measurable spaces. Given a ‘source’ measurable space $$(X, \mathcal{X})$$ and ‘target’ measurable space $$(Y, \mathcal{Y})$$, a measurable mapping $$(X, \mathcal{X}) \to (Y, \mathcal{Y})$$ is a map $$T : X \to Y$$ with the property that, for any measurable set in the target, the inverse image is measurable in the source. Or, formally, for any $$B$$ in $$\mathcal{Y}$$, you have that $$T^{-1}(B)$$ is in $$\mathcal{X}$$.

## The Space of Probability Measures on a Measurable Space

If you consider the collection of all measurable spaces and measurable mappings between them, you get a category. Define $$\textbf{Meas}$$ to be the category of measurable spaces. So, objects are measurable spaces and morphisms are the measurable mappings between them.

For any specific measurable space $$M$$ in $$\textbf{Meas}$$, we can consider the space of all possible probability measures that could be placed on it and denote that $$\mathcal{P}(M)$$. To be clear, $$\mathcal{P}(M)$$ is a space of measures - that is, a space in which the points themselves are probability measures.

What’s remarkable about $$\mathcal{P}(M)$$ is that it is itself a measurable space. Let me explain.

As a probability measure, any element of $$\mathcal{P}(M)$$ is a function from measurable subsets of $$M$$ to the interval $$[0, 1]$$ in $$\mathbb{R}$$. That is: if $$M$$ is the measurable space $$(X, \mathcal{X})$$, then a point $$\nu$$ in $$\mathcal{P}(M)$$ is a function $$\mathcal{X} \to \mathbb{R}$$. For any measurable $$A$$ in $$M$$, there just naturally exists a sort of ‘evaluation’ mapping I’ll call $$\tau_A: \mathcal{P}(M) \to \mathbb{R}$$ that takes a measure on $$M$$ and evaluates it on the set $$A$$. To be explicit: if $$\nu$$ is a measure in $$\mathcal{P}(M)$$, then $$\tau_A$$ simply evaluates $$\nu(A)$$. It ‘runs’ the measure in a sense; in Haskell, $$\tau_A$$ would be analogous to a function like \f -> f a for some a.

This evaluation map $$\tau_A$$ corresponds to an integral. If you have a measurable space $$(X, \mathcal{X})$$, then for any $$A$$ a subset in $$\mathcal{X}$$, $$\tau_A(\nu) = \nu(A) = \int_{X}\chi_A d\nu$$ for $$\chi$$ the characteristic or indicator function of $$A$$ (where $$\chi(x)$$ is $$1$$ if $$x$$ is in $$A$$, and is $$0$$ otherwise). And we can actually extend $$\tau$$ to operate over measurable mappings from $$(X, \mathcal{X})$$ to $$(\mathbb{R}, \mathcal{B}(\mathbb{R}))$$, where $$\mathcal{B}(\mathbb{R})$$ is a suitable $$\sigma$$-algebra on $$\mathbb{R}$$. Here we typically use what’s called the Borel $$\sigma$$-algebra, which takes a topology on the set and then generates a $$\sigma$$-algebra from the open sets in the topology (for $$\mathbb{R}$$ we can just use the ‘usual’ topology generated by the Euclidean metric). For $$f : X \to \mathbb{R}$$ a measurable function, we can define the evaluation mapping $$\tau_f : \mathcal{P}(M) \to \mathbb{R}$$ as $$\tau_f(\nu) = \int_X f d\nu$$.

We can abuse notation here a bit and just use $$\tau$$ to refer to ‘duck typed’ mappings that evaluate measures over measurable sets or measurable functions depending on context. If we treat $$\tau_A(\nu)$$ as a function $$\tau(\nu)(A)$$, then $$\tau(\nu)$$ has type $$\mathcal{X} \to \mathbb{R}$$. If we treat $$\tau_f(\nu)$$ as a function $$\tau(\nu)(f)$$, then $$\tau(\nu)$$ has type $$(X \to \mathbb{R}) \to \mathbb{R}$$. I’ll say $$\tau_{\{A, f\}}$$ to refer to the mappings that accept either measurable sets or functions.

In any case. For a measurable space $$M$$, there exists a topology on $$\mathcal{P}(M)$$ called the weak-* topology that makes all the evaluation mappings $$\tau_{\{A, f\}}$$ continuous for any measurable set $$A$$ or measurable function $$f$$. From there, we can generate the Borel $$\sigma$$-algebra $$\mathcal{B}(\mathcal{P}(M))$$ that makes the evaluation functions $$\tau_{\{A, f\}}$$ measurable. The result is that $$(\mathcal{P}(M), \mathcal{B}(\mathcal{P}(M)))$$ is itself a measurable space, and thus an object in $$\textbf{Meas}$$.

The space $$\mathcal{P}(M)$$ actually has all sorts of insane properties that one wouldn’t expect - there are implications on convexity, completeness, compactness and such that carry over from $$M$$. But I digress.

## $$\mathcal{P}$$ is a Functor

So: for any $$M$$ an object in $$\textbf{Meas}$$, we have that $$\mathcal{P}(M)$$ is also an object in $$\textbf{Meas}$$. And if you look at $$\mathcal{P}$$ like a functor, you notice that it takes objects of $$\textbf{Meas}$$ to objects of $$\textbf{Meas}$$. Indeed, you can define an analogous procedure on morphisms in $$\textbf{Meas}$$ as follows. Take $$N$$ to be another object (read: measurable space) in $$\textbf{Meas}$$ and $$T : M \to N$$ to be a morphism (read: measurable mapping) between them. Now, for any measure $$\nu$$ in $$\mathcal{P}(M)$$ we can define $$\mathcal{P}(T)(\nu) = \nu \circ T^{-1}$$ (this is called the image, distribution, or pushforward of $$\nu$$ under $$T$$). For some $$T$$ and $$\nu$$, $$\mathcal{P}(T)(\nu)$$ thus takes measurable sets in $$N$$ to a value in the interval $$[0, 1]$$ - that is, it is a measure on $$\mathcal{P}(N)$$. So we have that:

$\mathcal{P}(T) : \mathcal{P}(M) \to \mathcal{P}(N)$

and so $$\mathcal{P}$$ is an endofunctor on $$\textbf{Meas}$$.

## $$\mathcal{P}$$ is a Monad

See where we’re going here? If we can define natural transformations $$\mu$$ and $$\eta$$ such that $$(\mathcal{P}, \mu, \eta)$$ is a monoid in the category of endofunctors, we’ll have defined a monad. We thus need to come up with a suitable monoidal structure, et voilà.

First the identity. We want a natural transformation $$\eta$$ between the identity functor $$1_{\mathcal{F}}$$ and the functor $$\mathcal{P}$$ such that $$\eta_M : 1_{\mathcal{F}}(M) \to \mathcal{P}(M)$$ for any measurable space $$M$$ in $$\textbf{Meas}$$. Evaluating the identity functor simplifies things to $$\eta_M : M \to \mathcal{P}(M)$$.

We can define this concretely as follows. Grab a measurable space $$M$$ in $$\textbf{Meas}$$ and define $$\eta(x)(A) = \chi_A(x)$$ for any point $$x \in M$$ and any measurable set $$A \subseteq M$$. $$\eta(x)$$ is thus a probability measure on $$M$$ - we assign $$1$$ to measurable sets that contain $$x$$, and 0 to those that don’t. If we peel away another argument, we have that $$\eta : M \to \mathcal{P}(M)$$, as required.

So $$\eta$$ takes points in measurable spaces to probability measures on those spaces. In technical parlance, it takes a point $$x$$ to the Dirac measure at $$x$$ - the probability measure that places the entirety of its mass at $$x$$.

Now for the other part of the monoidal structure, $$\mu$$. I initially found this next part to be a bit of a trip, but let me see what I can do about that.

Recall that the category of endofunctors, $$\mathcal{F}$$, is monoidal, so there exists a tensor product $$\otimes : \mathcal{F} \times \mathcal{F} \to \mathcal{F}$$ that we can deal with, which here just corresponds to functor composition. We’re looking for a natural transformation:

$\mu : \mathcal{P} \circ \mathcal{P} \to \mathcal{P}$

which is often written as:

$\mu : \mathcal{P}^2 \to \mathcal{P}.$

Take $$M = (X, \mathcal{X})$$ a measurable space in $$\textbf{Meas}$$ and then consider the space of probability measures over it, $$\mathcal{P}(M)$$. Then take the space of probability measures over the space of probability measures on $$M$$, $$\mathcal{P}(\mathcal{P}(M))$$. Since $$\mathcal{P}$$ is an endofunctor, this is again a measurable space, and for any measurable subset $$A$$ of $$M$$ we again have a family of mappings $$\tau_A$$ that take a probability measure in $$\mathcal{P}(\mathcal{P}(M))$$ and evaluate it on $$A$$. We want $$\mu$$ to be the thing that turns a measure over measures $$\rho$$ into a plain old probability measure on $$\mathcal{P}(M)$$.

In the context of probability theory, this kind of semigroup action is a marginalizing operator. We’re taking the ‘uncertainty’ captured in $$\mathcal{P}(\mathcal{P}(M))$$ via the probability measure $$\rho$$ and smearing it into the probability measures in $$\mathcal{P}(M)$$.

Take $$\rho$$ in $$\mathcal{P}(\mathcal{P}(M))$$ and some $$A$$ a measurable subset of $$M$$. We can define $$\mu$$ as follows:

$\mu(\rho)(A) = \int_{\mathcal{P}(M)} \tau_A d\rho.$

Using some lambda calculus notation to see the argument for $$\tau_A$$, we can expand the integrals to get the following gnarly expression:

$\mu(\rho)(A) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M \chi_A d \nu \right\} d \rho.$

Notice what’s happening here. For $$M$$ a measurable space, we’re integrating over $$\mathcal{P}(M)$$ the space of probability measures on $$M$$, with respect to the probability measure $$\rho$$, which itself is a point in the space of probability measures over probability measures on $$M$$, $$\mathcal{P}(\mathcal{P}(M))$$. Whew.

The spaces we’re integrating over here are unusual, but $$\rho$$ is still a probability measure, so when applied to a measurable set in $$\mathcal{B}(\mathcal{P}(M))$$ it results in a probability in $$[0, 1]$$. So, peeling back an argument, we have that $$\mu(\rho)$$ has type $$\mathcal{X} \to \mathbb{R}$$. In other words, it’s a probability measure on $$M$$, and thus is in $$\mathcal{P}(M)$$. And if we peel back another argument, we find that:

$\mu_M : \mathcal{P}(\mathcal{P}(M)) \to \mathcal{P}(M)$

so, as required, that

$\mu : \mathcal{P}^{2} \to \mathcal{P}.$

It’s also worth noting that we can overload the notation for $$\mu$$ in the same way we did for $$\tau$$, i.e. to supply measurable functions in addition to measurable sets:

$\mu(\rho)(f) = \int_{\mathcal{P}(M)} \left\{\lambda \nu . \int_M f d \nu \right\} d \rho.$

Combining the three components, we get $$(\mathcal{P}, \mu, \eta)$$, the canonical Giry monad.

In Haskell, when we’re dealing with monads we typically use the bind operator $$\gg\!\!=$$ instead of manually dealing with the functorial structure and $$\mu$$ (called ‘join’). Bind has the type:

$\gg\!\!= : \mathcal{P}(M) \to (M \to \mathcal{P}(N)) \to \mathcal{P}(N)$

and for illustration, we can define $$\gg\!\!=$$ for the Giry monad like so:

$(\rho \gg\!\!= g)(f) = \int_{M} \left\{ \lambda m . \int_N f d g(m) \right\} d\rho.$

Here $$\rho$$ is in $$\mathcal{P}(M)$$, $$g$$ is in $$M \to \mathcal{P}(N)$$, and $$f$$ is in $$N \to \mathbb{R}$$, so note that we potentially simplify the outermost integral enormously. It now operates over a general measurable space, rather than a space of measures in particular, and this will come in handy when we get to implementation details in the next post.

## Wrapping Up

That’s about it for now. It’s worth noting as a kind of footnote here that the existence of the Giry monad also obviously implies the existence of a Giry applicative functor. But the official situation for applicative functors seems kind of weird in this context, and I’m not yet up to the task of dealing with it formally.

Intuitively, one should be able to define the binary applicative operator characteristic of its lax monoidal structure as follows:

$(\rho \, \langle \ast \rangle \, \nu)(f) = \int_{\mathcal{P}(M \to N)} \left\{\lambda T . \int_{M \to N} (f \circ T) d\nu \right\} d \rho.$

But this has some really weird measure-theoretic implications - namely, that it assumes the existence of a space of probability measures over the space of all measurable functions $$M \to N$$, which is not trivial to define and indeed may not even exist. It seems like some people are looking into this problem as I just happened to stumble on this paper on the arXiv while doing some googling. I notice that some people on e.g. nLab require categories with additional structure beyond $$\textbf{Meas}$$ for the development of the Giry monad as well, for example the category of Polish (separable, completely metrizable) spaces $$\textbf{Pol}$$, so maybe the extra structure there takes care of the quirks.

Anyway. Applicatives are neat here because applicative probability measures are independent probability measures. And the existence of applicativeness means you can do all the things with independent probability measures that you might be used to. Measure convolution and friends are good examples. Given a measurable space $$M$$ that supports some notion of addition and two probability measures $$\nu$$ and $$\zeta$$ in $$\mathcal{P}(M)$$, we can add measures together via:

$(\nu + \zeta)(f) = \int_{M}\int_{M}f(x + y)d\nu(x)d\zeta(y)$

where $$x$$ and $$y$$ are both points in $$M$$. Subtraction and multiplication translate trivially as well.

In another article I’ll detail how the Giry monad can be implemented in Haskell and point out some neat extensions. There are some cool connections to continuations and codensity monads, and seemingly de Finetti’s theorem and exchangeability. That kind of thing. It’d also be worth trying to justify independence of probability measures from a categorical perspective, which seems easier than resolving the nitty-gritty measurability qualms I mentioned above.

‘Til then! Thanks to Jason Forbes for sifting through this stuff and providing some great comments.