The Applicative Structure of the Giry Monad

In my last two posts about the Giry monad I derived the thing from its categorical and measure-theoretic foundations. I kind of thought that those posts wouldn’t be of much interest to people but they turned out to be a hit. I clearly can’t tell what the internet likes.

Anyway, something I left out was the theoretical foundation of the Giry monad’s Applicative instance, which seemed a bit odd. I also pointed out that applicativeness in the context of probability implies independence between probability measures.

In this article I’m going to look at each of these issues. After playing around with the foundations, it looks like the applicative instance for the Giry monad can be put on a sound footing in terms of the standard measure-theoretic concept of product measure. Also, it turns out that the claim I made of applicativeness \(\implies\) independence is somewhat ill-posed. But, using the shiny new intuition we’ll glean from a better understanding of the applicative structure, we can put that on a solid footing too.

So let’s look at both of these things and see what’s up.

Monoidal Functors

The foundational categorical concept behind applicative functors is the monoidal functor, which is a functor between monoidal categories that preserves monoidal structure.

Formally: for monoidal categories \((C, \otimes, I)\) and \((D, \oplus, J)\), a monoidal functor \(F : C \to D\) is a functor and associated natural transformations \(\phi : F(A) \oplus F(B) \to F(A \otimes B)\) and \(i : J \to F(I)\) that satisfy some coherence conditions that I won’t mention here. Notably, if \(\phi\) and \(i\) are isomorphisms (i.e. are invertible) then \(F\) is called a strong monoidal functor. Otherwise it’s called lax. Applicative functors in particular are lax monoidal functors.

This can be made much clearer for endofunctors on a monoidal category \((C, \otimes, I)\). Then you only have \(F : C \to C\) and \(\phi : F(A) \otimes F(B) \to F(A \otimes B)\) to worry about. If we sub in the Giry monad \(\mathcal{P}\) from the last couple of posts, we’d want \(\mathcal{P} : \textbf{Meas} \to \textbf{Meas}\) and \(\phi : \mathcal{P}(M) \otimes \mathcal{P}(N) \to \mathcal{P}(M \otimes N)\).

Does the category of measurable spaces \(\textbf{Meas}\) have a monoidal structure? Yup. Take measurable spaces \(M = (X, \mathcal{X})\) and \(N = (Y, \mathcal{Y})\). From the Giry monad derivation we already have that the monoidal identity \(i : M \to \mathcal{P}(M)\) corresponds to a Dirac measure at a point, so that’s well and good. And we can define the tensor product \(\otimes\) between \(M\) and \(N\) as follows: let \(X \times Y\) be the standard Cartesian product on \(X\) and \(Y\) and let \(\mathcal{X} \otimes \mathcal{Y}\) be the smallest \(\sigma\)-algebra generated by the Cartesian product \(A \times B\) of measurable sets \(A \in \mathcal{X}\) and \(B \in \mathcal{Y}\). Then \((X \times Y, \mathcal{X} \otimes \mathcal{Y})\) is a measurable space, and so \((\textbf{Meas}, \otimes, i)\) is monoidal.

Recall that \(\mathcal{P}(M)\) and \(\mathcal{P}(N)\) - the space of measures over \(M\) and \(N\) respectively - are themselves objects in \(\textbf{Meas}\). So, clearly \(\mathcal{P}(M) \otimes \mathcal{P}(N)\) is a measurable space, and if \(\mathcal{P}\) is monoidal then there must exist a natural transformation that can take us from there to \(\mathcal{P}(M \otimes N)\). This is the space of measures over the product \(M \otimes N\).

So the question is: does \(\mathcal{P}\) have the required monoidal structure?

Yes. It must, since \(\mathcal{P}\) is a monad, and any monad can generate the required natural transformation. Let \(\mu\) be the monadic ‘join’ operator \(\mathcal{P}^2 \to \mathcal{P}\) and \(\eta\) be the monadic identity \(I \to \mathcal{P}\). We have, evaluating right-to-left:

\[\phi_{\nu \times \rho} = \mu \mathcal{P} \left\{ \lambda m . \mu \mathcal{P}\left(\lambda n. \eta_{m \times n}\right)\mathcal{P}(\rho) \right\} \mathcal{P}(\nu).\]

Using \(\gg\!\!=\) makes this much easier to read:

\[\phi_{\nu \times \rho} = \nu \gg\!\!= \lambda m. \rho \gg\!\!= \lambda n. \eta_{m \times n}\]

or in code, just:

phi :: Monad m => (m a, m b) -> m (a, b)
phi (m, n) = liftM2 (,) m n

So with that we have that \((\mathcal{P}, \phi, i)\) is a (lax) monoidal functor. And you can glean a monad-generated applicative operator from that immediately (this leads to the function called ‘ap’ in Control.Monad):

ap :: Monad m => m (a -> b) -> m a -> m b
ap f x = fmap (\(g, z) -> g z) (phi f x)

(Note: I won’t refer to \(\mu\) as the join operator from this point out in order to free it up for denoting measures.)

Probabilistic Interpretations

Product Measure

The correct probabilistic interpretation here is that \(\phi\) takes a pair of probability measures to the product measure over the appropriate product space. For probability measures \(\mu\) and \(\nu\) on measurable spaces \(M\) and \(N\) respectively, the product measure is the (unique) measure \(\mu \times \nu\) on \(M \otimes N\) such that:

\[(\mu \times \nu)(A \times B) = \mu(A) \nu(B)\]

for \(A \times B\) a measurable set in \(M \otimes N\).

Going through the monoidal functor route seems to put the notion of the Giry applicative instance on a more firm measure-theoretic foundation. Instead of considering the following from the Giry monad foundations article:

\[(\rho \, \langle \ast \rangle \, \nu)(f) = \int_{\mathcal{P}(M \to N)} \left\{\lambda T . \int_{M \to N} (f \circ T) d\nu \right\} d \rho\]

which is defined in terms of the dubious space of measures over measurable functions \(M \to N\), we can better view things using the monoidal structure-preserving natural transformation \(\phi\). For measures \(\mu\) and \(\nu\) on \((X, \mathcal{X})\) and \((Y, \mathcal{Y})\) respectively, we have:

\[\phi(\mu, \nu)(f) = \int_{X \times Y}f d(\mu \times \nu)\]

and then for \(g : Z \to X \otimes Y\) we can use the functor structure of \(\mathcal{P}\) to do:

\[(\text{fmap} \, g \, \phi(\mu, \nu))(f) = \int_{Z} (f \circ g) d((\mu \times \nu) \circ g^{-1})\]

which corresponds to a standard applicative expression g <$> mu <*> nu. I suspect there’s then some sort of Yoneda argument or something that makes currying and partial function application acceptable.

Independence

Now. What does this have to say about independence?

In particular, it’s too fast and loose to claim measures can be ‘independent’ at all. Independence is a property of measurable sets, measurable functions, and \(\sigma\)-algebras. Not of measures! But there is a really useful connection, so let’s illustrate that.

First, let’s define independence formally as follows. Take a probability space \((X, \mathcal{X}, \mathbb{P})\). Then any measurable sets \(A\) and \(B\) in \(\mathcal{X}\) are independent if

\[\mathbb{P}(A \cap B) = \mathbb{P}(A)\mathbb{P}(B).\]

That’s the simplest notion.

Next, consider two sub-\(\sigma\)-algebras \(\mathcal{A}\) and \(\mathcal{B}\) of \(\mathcal{X}\) (a sub-\(\sigma\)-algebra is just a a subset of a \(\sigma\)-algebra that itself happens to be a \(\sigma\) algebra). Then \(\mathcal{A}\) and \(\mathcal{B}\) are independent if, for any \(A\) in \(\mathcal{A}\) and any \(B\) in \(\mathcal{B}\), we have that \(A\) and \(B\) are independent.

The final example is independence of measurable functions. Take measurable functions \(f\) and \(g\) both from \(X\) to the real numbers equipped with some appropriate \(\sigma\)-algebra \(\mathcal{B}\). Then each of these functions generates a sub-\(\sigma\) algebra of \(\mathcal{X}\) as follows:

\[\begin{align*} \mathcal{X}_{f} & = \{ f^{-1}(B) : B \in \mathcal{B} \} \\ \mathcal{X}_{g} & = \{ g^{-1}(B) : B \in \mathcal{B} \}. \end{align*}\]

Then \(f\) and \(g\) are independent if the generated \(\sigma\)-algebras \(\mathcal{X}_{f}\) and \(\mathcal{X}_{g}\) are independent.

Note that in every case independence is defined in terms of a single measure, \(\mathbb{P}\). We can’t talk about different measures being independent. To paraphrase Terry Tao here:

The notion of independence between [measurable functions] does not make sense if the [measurable functions] are being modeled by separate probability spaces; they have to be coupled together into a single probability space before independence becomes a meaningful notion.

To be pedantic and explicitly specify the measure by which some things are independent, some authors state that measurable functions \(f\) and \(g\) are \(\mathbb{P}\)-independent, for example.

We can see a connection to independence when we look at convolution and associated operators. Recall that for measures \(\mu\) and \(\nu\) on the same measurable space \(M = (X, \mathcal{X})\) that supports some notion of addition, their convolution looks like:

\[(\mu + \nu)(f) = \int_{X}\int_{X} f(x + y) d\mu(x) d\nu(y).\]

The probabilistic interpretation here (see Terry Tao on this too) is that \(\mu + \nu\) is the measure corresponding to the sum of independent measurable functions \(g\) and \(h\) with corresponding measures \(\mu\) and \(\nu\) respectively.

That looks weird though, since we clearly defined independence between measurable functions using a single probability measure. How is it we can talk about independent measurable functions \(g\) and \(h\) having different corresponding measures?

We first need to couple everything together into a single probability space as per Terry’s quote. Complete \(M\) with some abstract probability measure \(\mathbb{P}\) to form the probability space \((X, \mathcal{X}, \mathbb{P})\). Now we have \(g\) and \(h\) measurable functions from \(X\) to \(\mathbb{R}\).

To say that \(g\) and \(h\) are independent is to say that their generated \(\sigma\)-algebras are \(\mathbb{P}\)-independent. And the measures that they correspond to are the pushforwards of \(\mathbb{P}\) under \(g\) and \(h\) respectively. So, \(\mu = \mathbb{P} \circ g^{-1}\) and \(\nu = \mathbb{P} \circ h^{-1}\). The result is that the measurable functions correspond to different (pushforward) measures \(\mu\) and \(\nu\), but are independent with respect to the same underlying probability measure \(\mathbb{P}\).

The monoidal structure of \(\mathcal{P}\) then gets us to convolution. Given a product of measures \(\mu\) and \(\nu\) each on \((X, \mathcal{X})\) we can immediately retrieve their product measure \(\mu \times \nu\) via \(\phi\). And from there we can get to \(\mu + \nu\) via the functor structure of \(\mathcal{P}\) - we just find the pushforward of \(\mu \times \nu\) with respect to a function \(\rho\) that collapses a product via addition. So \(\rho : X \times X \to \mathbb{R}\) is defined as:

\[\rho(a \times b) = a + b\]

and then the convolution \(\mu + \nu\) is thus:

\[\mu + \nu = (\mu \times \nu) \circ \rho^{-1}.\]

Other operations can be defined similarly, e.g. for \(\sigma(a \times b) = a - b\) we get:

\[\mu - \nu = (\mu \times \nu) \circ \sigma^{-1}.\]

The crux of all this is whenever we apply a measurable function to a product measure, we can always extract notions of independent measurable functions from the result. And the measures corresponding to those independent measurable functions will be the components of the product measure respectively.

This is super useful and lets one claim something stronger than what the monadic structure gives you. In an expression like g <$> mu <*> nu <*> rho, you are guaranteed that the corresponding random variables \(g_\mu\), \(g_\nu\), \(g_\rho\) (for suitable projections) are independent. The same cannot be said if you use the monadic structure to do something like g mu nu rho where the product structure is not enforced - in that case you’re not guaranteed anything of the sort. This is why the applicative structure is useful for encoding independence in a way that the monadic structure is not.

Conclusion

So there you have it. Applicativeness can seemingly be put on a straightforward measure-theoretic grounding and has some useful implications for independence.

It’s worth noting that, in the case of the Giry monad, we don’t need to go through its monadic structure in order to recover an applicative instance. We can do so entirely by hacking together continuations without using a single monadic bind. This is actually how I defined the applicative instance in the Giry monad implementation article previously:

instance Applicative Measure where
  pure x = Measure (\f -> f x)
  Measure g <*> Measure h = Measure $ \f ->
    g (\k -> h (f . k))

Teasing out the exact structure of this and its relation to the codensity monad is again something I’ll leave to others.