# Fubini and Applicatives

Take an iterated integral, e.g. $\int_X \int_Y f(x, y) dy dx$. Fubini’s Theorem describes the conditions under which the order of integration can be swapped on this kind of thing while leaving its value invariant. If Fubini’s conditions are met, you can convert your integral into $\int_Y \int_X f(x, y) dx dy$ and be guaranteed to obtain the same result you would have gotten by going the other way.

What are these conditions? Just that you can glue your individual measures together as a product measure, and that $f$ is integrable with respect to it. I.e.,

Say you have a Giry monad implementation kicking around and you want to see how Fubini’s Theorem works in terms of applicative functors, monads, continuations, and all that. It’s pretty easy. You could start with my old measurable library that sits on GitHub and attracts curious stars from time to time and cook up the following example:

import Control.Applicative ((<$>), (<*>)) import Measurable dx :: Measure Int dx = bernoulli 0.5 dy :: Measure Double dy = beta 1 1 dprod :: Measure (Int, Double) dprod = (,) <$> dx <*> dy


Note that ‘dprod’ is clearly a product measure (I’ve constructed it using the Applicative instance for the Giry monad, so it must be a product measure) and take a simple, obviously integrable function:

add :: (Int, Double) -> Double
add (m, x) = fromIntegral m + x


Since ‘dprod’ is a product measure, Fubini’s Theorem guarantees that the following are equivalent:

i0 :: Double

i1 :: Double
i1 = integrate (\x -> integrate (curry add x) dy) dx

i2 :: Double
i2 = integrate (\y -> integrate (\x -> curry add x y) dx) dy


And indeed they are – you can verify them yourself if you don’t believe me (or our boy Fubini).

For an example of a where interchanging the order of integration would be impossible, we can construct some other measure:

dpair :: Measure (Int, Double)
dpair = do
x <- dx
y <- fmap (* fromIntegral x) dy
return (x, y)


It can be integrated as follows:

i3 :: Double
i3 = integrate (\x -> integrate (curry add x) (fmap (* fromIntegral x) dy)) dx


But notice how ‘dpair’ is constructed: it is strictly monadic, not applicative, so the order of the expressions matters. Since ‘dpair’ can’t be expressed as a product measure (i.e. by an applicative expression), Fubini says that swapping the order of integration is a no-no.

Note that if you were to just look at the types of ‘dprod’ and ‘dpair’ – both ‘Measure (Int, Double)’ – you wouldn’t be able to tell immediately that one represents a product measure while the other one does not. If being able to tell these things apart statically is important to you (say, you want to statically apply order-of-integration optimisations to integral expressions or what have you), you need look no further than the free applicative functor to help you out.

Fun fact: there is a well-known variant of Fubini’s Theorem, called Tonelli’s Theorem, that was developed by another Italian guy at around the same time. I’m not sure how early-20th century Italy became so strong in order-of-integration research, exactly.

# Byzantine Generals and Nakamoto Consensus

You can recognize truth by its beauty and simplicity.

– Richard Feynman (attributed)

In one of his early emails on the Cryptography mailing list, Satoshi claimed that the proof-of-work chain is a solution to the Byzantine Generals Problem (BGP). He describes this via an example where a bunch of generals – Byzantine ones, of course – collude to break a king’s wifi.

It’s interesting to look at this a little closer in the language of the originally-stated BGP itself. One doesn’t need to be too formal to glean useful intuition here.

What, more precisely, did Satoshi claim?

## The Decentralized Timestamp Server

Satoshi’s problem is that of a decentralized timestamp server (DTS). Namely, he posits that any number of nodes, following some protocol, can together act as a timestamping server – producing some consistent ordering on what we’ll consider to be abstract ‘blocks’.

The decentralized timestamp server reduces to an instance of the Byzantine Generals Problem as follows. There are a bunch of nodes, who could each be honest or dishonest. All honest nodes want to agree on some ordering – a history – of blocks, and a small number of dishonest nodes should not easily be able to compromise that history – say, by convincing the honest nodes to adopt some alternate one of their choosing.

(N.b. it’s unimportant here to be concerned about the contents of blocks. Since the decentralized timestamp server problem is only concerned about block orderings, we don’t need to consider the case of invalid transactions within blocks or what have you, and can safely assume that any history must be internally consistent. We only need to assume that child blocks depend utterly on their parents, so that rewriting a history by altering some parent block also necessitates rewriting its children, and that honest nodes are constantly trying to append blocks.)

As demonstrated in the introduction to the original paper, the Byzantine Generals Problem can be reduced to the problem of how any given node communicates its information to others. In our context, it reduces to the following:

Byzantine Generals Problem (DTS)

A node must broadcast a history of blocks to its peers, such that:

• (IC1) All honest peers agree on the history.
• (IC2) If the node is honest, then all honest peers agree with the history it broadcasts.

To produce consensus, every node will communicate its history to others by using a solution to the Byzantine Generals Problem.

## Longest Proof-of-Work Chain

Satoshi’s proposed solution to the BGP has since come to be known as ‘Nakamoto Consensus’. It is the following protocol:

Nakamoto Consensus

• Always use the longest history.
• Appending a block to any history requires a proof that a certain amount of work – proportional in expectation to the total ‘capability’ of the network – has been completed.

To examine how it works, consider an abstract network and communication medium. We can assume that messages are communicated instantly (it suffices that communication is dwarfed in time by actually producing a proof of work) and that the network is static and fixed, so that only active or ‘live’ nodes actually contribute to consensus.

The crux of Nakamoto consensus is that nodes must always use the longest available history – the one that provably has the largest amount of work invested in it – and appending to any history requires a nontrivial amount of work in of itself. Consider a set of nodes, each having some (not necessarily shared) history. Whenever any node broadcasts a one-block longer history, all honest nodes will immediately agree on it, and conditions (IC1) and (IC2) are thus automatically satisfied whether or not the broadcasting node is honest. Nakamoto Consensus trivially solves the BGP in this most important case; we can examine other cases by examining how they reduce to this one.

If two or more nodes broadcast longer histories at approximately the same time, then honest nodes may not agree on a single history for as long as it takes a longer history to be produced and broadcast. As soon as this occurs (which, in all probability, is only a matter of time), we reduce to the previous case in which all honest nodes agree with each other again, and the BGP is resolved.

The ‘bad’ outcome we’re primarily concerned about is that of dishonest nodes rewriting history in their favour, i.e. by replacing some history $\{\ldots, B_1, B_2, B_3, \ldots\}$ by another one $\{\ldots, B_1, B_2', B_3', \ldots\}$ that somehow benefits them. The idea here is that some dishonest node (or nodes) intends to use block $B_2$ as some sort of commitment, but later wants to renege. To do so, the node needs to rewrite not only $B_2$, but all other blocks that depend on $B_2$ (here $B_3$, etc.), ultimately producing a longer history than is currently agreed upon by honest peers.

Moreover, it needs to do this faster than honest nodes are able to produce longer histories on their own. Catching up to and exceeding the honest nodes becomes exponentially unlikely in the number of blocks to be rewritten, and so a measure of confidence can be ascribed to agreement on the state of any sub-history that has been ‘buried’ by a certain number of blocks (see the penultimate section of Satoshi’s paper for details).

Dishonest nodes that seek to replace some well-established, agreed-upon history with another will thus find it effectively impossible (i.e. the probability is negligible) unless they control a majority of the network’s capability – at which point they no longer constitute a small number of peers.

## Summary

So in the language of the originally-stated BGP: Satoshi claimed that the decentralized timestamp server is an instance of the Byzantine Generals Problem, and that Nakamoto Consensus (as it came to be known) is a solution to the Byzantine Generals Problem. Because Nakamoto Consensus solves the BGP, honest nodes that always use the longest proof-of-work history in the decentralized timestamp network will eventually come to consensus on the ordering of blocks.

# Recursive Stochastic Processes

Last week Dan Peebles asked me on Twitter if I knew of any writing on the use of recursion schemes for expressing stochastic processes or other probability distributions. And I don’t! So I’ll write some of what I do know myself.

There are a number of popular statistical models or stochastic processes that have an overtly recursive structure, and when one has some recursive structure lying around, the elegant way to represent it is by way of a recursion scheme. In the case of stochastic processes, this typically boils down to using an anamorphism to drive things. Or, if you actually want to be able to observe the thing (note: you do), an apomorphism.

By representing a stochastic process in this way one can really isolate the probabilistic phenomena involved in it. One bundles up the essence of a process in a coalgebra, and then drives it via some appropriate recursion scheme.

Let’s take a look at three stochastic processes and examine their probabilistic and recursive structures.

## Foundations

To start, I’m going to construct a simple embedded language in the spirit of the ones used in my simple probabilistic programming and comonadic inference posts. Check those posts out if this stuff looks too unfamiliar. Here’s a preamble that constitutes the skeleton of the code we’ll be working with.

{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE LambdaCase #-}
{-# LANGUAGE RankNTypes #-}
{-# LANGUAGE TypeFamilies #-}

import Data.Functor.Foldable
import Data.Random (RVar, sample)
import qualified Data.Random.Distribution.Bernoulli as RF
import qualified Data.Random.Distribution.Beta as RF
import qualified Data.Random.Distribution.Normal as RF

-- probabilistic instruction set, program definitions

data ModelF a r =
BernoulliF Double (Bool -> r)
| GaussianF Double Double (Double -> r)
| BetaF Double Double (Double -> r)
| DiracF a
deriving Functor

type Program a = Free (ModelF a)

type Model b = forall a. Program a b

type Terminating a = Program a a

-- core language terms

bernoulli :: Double -> Model Bool
bernoulli p = liftF (BernoulliF vp id) where
vp
| p < 0 = 0
| p > 1 = 1
| otherwise = p

gaussian :: Double -> Double -> Model Double
gaussian m s
| s <= 0    = error "gaussian: variance out of bounds"
| otherwise = liftF (GaussianF m s id)

beta :: Double -> Double -> Model Double
beta a b
| a <= 0 || b <= 0 = error "beta: parameter out of bounds"
| otherwise        = liftF (BetaF a b id)

dirac :: a -> Program a b
dirac x = liftF (DiracF x)

-- interpreter

rvar :: Program a a -> RVar a

### Independence

Now. What does this have to say about independence?

In particular, it’s too fast and loose to claim measures can be ‘independent’ at all. Independence is a property of measurable sets, measurable functions, and $\sigma$-algebras. Not of measures! But there is a really useful connection, so let’s illustrate that.

First, let’s define independence formally as follows. Take a probability space $(X, \mathcal{X}, \mathbb{P})$. Then any measurable sets $A$ and $B$ in $\mathcal{X}$ are independent if

That’s the simplest notion.

Next, consider two sub-$\sigma$-algebras $\mathcal{A}$ and $\mathcal{B}$ of $\mathcal{X}$ (a sub-$\sigma$-algebra is just a a subset of a $\sigma$-algebra that itself happens to be a $\sigma$ algebra). Then $\mathcal{A}$ and $\mathcal{B}$ are independent if, for any $A$ in $\mathcal{A}$ and any $B$ in $\mathcal{B}$, we have that $A$ and $B$ are independent.

The final example is independence of measurable functions. Take measurable functions $f$ and $g$ both from $X$ to the real numbers equipped with some appropriate $\sigma$-algebra $\mathcal{B}$. Then each of these functions generates a sub-$\sigma$ algebra of $\mathcal{X}$ as follows:

Then $f$ and $g$ are independent if the generated $\sigma$-algebras $\mathcal{X}_{f}$ and $\mathcal{X}_{g}$ are independent.

Note that in every case independence is defined in terms of a single measure, $\mathbb{P}$. We can’t talk about different measures being independent. To paraphrase Terry Tao here:

The notion of independence between [measurable functions] does not make sense if the [measurable functions] are being modeled by separate probability spaces; they have to be coupled together into a single probability space before independence becomes a meaningful notion.

To be pedantic and explicitly specify the measure by which some things are independent, some authors state that measurable functions $f$ and $g$ are $\mathbb{P}$-independent, for example.

We can see a connection to independence when we look at convolution and associated operators. Recall that for measures $\mu$ and $\nu$ on the same measurable space $M = (X, \mathcal{X})$ that supports some notion of addition, their convolution looks like:

The probabilistic interpretation here (see Terry Tao on this too) is that $\mu + \nu$ is the measure corresponding to the sum of independent measurable functions $g$ and $h$ with corresponding measures $\mu$ and $\nu$ respectively.

That looks weird though, since we clearly defined independence between measurable functions using a single probability measure. How is it we can talk about independent measurable functions $g$ and $h$ having different corresponding measures?

We first need to couple everything together into a single probability space as per Terry’s quote. Complete $M$ with some abstract probability measure $\mathbb{P}$ to form the probability space $(X, \mathcal{X}, \mathbb{P})$. Now we have $g$ and $h$ measurable functions from $X$ to $\mathbb{R}$.

To say that $g$ and $h$ are independent is to say that their generated $\sigma$-algebras are $\mathbb{P}$-independent. And the measures that they correspond to are the pushforwards of $\mathbb{P}$ under $g$ and $h$ respectively. So, $\mu = \mathbb{P} \circ g^{-1}$ and $\nu = \mathbb{P} \circ h^{-1}$. The result is that the measurable functions correspond to different (pushforward) measures $\mu$ and $\nu$, but are independent with respect to the same underlying probability measure $\mathbb{P}$.

The monoidal structure of $\mathcal{P}$ then gets us to convolution. Given a product of measures $\mu$ and $\nu$ each on $(X, \mathcal{X})$ we can immediately retrieve their product measure $\mu \times \nu$ via $\phi$. And from there we can get to $\mu + \nu$ via the functor structure of $\mathcal{P}$ - we just find the pushforward of $\mu \times \nu$ with respect to a function $\rho$ that collapses a product via addition. So $\rho : X \times X \to \mathbb{R}$ is defined as:

and then the convolution $\mu + \nu$ is thus:

Other operations can be defined similarly, e.g. for $\sigma(a \times b) = a - b$ we get:

The crux of all this is whenever we apply a measurable function to a product measure, we can always extract notions of independent measurable functions from the result. And the measures corresponding to those independent measurable functions will be the components of the product measure respectively.

This is super useful and lets one claim something stronger than what the monadic structure gives you. In an expression like g <$> mu <*> nu <*> rho, you are guaranteed that the corresponding random variables $g_\mu$, $g_\nu$, $g_\rho$ (for suitable projections) are independent. The same cannot be said if you use the monadic structure to do something like g mu nu rho where the product structure is not enforced - in that case you’re not guaranteed anything of the sort. This is why the applicative structure is useful for encoding independence in a way that the monadic structure is not. ## Conclusion So there you have it. Applicativeness can seemingly be put on a straightforward measure-theoretic grounding and has some useful implications for independence. It’s worth noting that, in the case of the Giry monad, we don’t need to go through its monadic structure in order to recover an applicative instance. We can do so entirely by hacking together continuations without using a single monadic bind. This is actually how I defined the applicative instance in the Giry monad implementation article previously: instance Applicative Measure where pure x = Measure (\f -> f x) Measure g <*> Measure h = Measure$ \f ->
g (\k -> h (f . k))


Teasing out the exact structure of this and its relation to the codensity monad is again something I’ll leave to others.

In my last post I went over the categorical and measure-theoretic foundations of the Giry monad, the ‘canonical’ probability monad that operates on the level of probability measures.

In this post I’ll pick up from where I left off and talk about a neat and faithful (if impractical) implementation of the Giry monad that one can put together in Haskell.

## Measure, Integral, and Continuation

So. For a quick review, we’ve established the Giry monad as a triple $(\mathcal{P}, \mu, \eta)$, where $\mathcal{P}$ is an endofunctor on the category of measurable spaces $\textbf{Meas}$, $\mu$ is a marginalizing integration operation defined by:

and $\eta$ is a monoidal identity, defined by the Dirac measure at a point:

How do we actually implement this beast? If we’re looking to be suitably general then it is unlikely that we’re going to be able to easily represent something like a $\sigma$-algebra over some space of measures on a computer, so that route is sort of a non-starter.

But it can be done. The key to implementing a general-purpose Giry monad is to notice that the fundamental operation involved in it is integration, and that we can avoid working with $\sigma$-algebras and measurable spaces directly if we focus on dealing with measurable functions instead of measurable sets.

Consider the integration map on measurable functions $\tau_f$ that we’ve been using this whole time. For some measurable function $f$, $\tau_f$ takes a measure on some measurable space $M = (X, \mathcal{X})$ and uses it to integrate $f$ over $X$. In other words:

A measure in $\mathcal{P}(M)$ has type $X \to \mathbb{R}$, so $\tau_f$ has corresponding type $(X \to \mathbb{R}) \to \mathbb{R}$.

This might look familiar to you; it’s very similar to the type signature for a continuation:

newtype Cont a r = Cont ((a -> r) -> r)


Indeed, if we restrict the carrier type of ‘Cont’ to the reals, we can be really faithful to the type:

newtype Integral a = Integral ((a -> Double) -> Double)


Now, let’s overload notation and call the integration map $\tau_f$ itself a measure. That is, $\tau_f$ is a mapping $\nu \mapsto \int_{X}fd\nu$, so we’ll just interpret the notation $\nu(f)$ to mean the same thing - $\int_{X}fd\nu$. This is convenient because we can dispense with $\tau$ and just pretend measures can be applied directly to measurable functions. There’s no way we can get confused here; measures operate on sets, not functions, so notation like $\nu(f)$ is not currently in use. We just set $\nu(f) = \tau_f(\nu)$ and that’s that. Let’s rename the ‘Integral’ type to match:

newtype Measure a = Measure ((a -> Double) -> Double)


We can extract a very nice shallowly-embedded language for integration here, the core of which is a single term:

integrate :: (a -> Double) -> Measure a -> Double
integrate f (Measure nu) = nu f


Note that this is the same way we’d express integration mathematically; we specify that we want to integrate a measurable function $f$ with respect to some measure $\nu$:

The only subtle difference here is that we don’t specify the space we’re integrating over in the integral expression - instead, we’ll bake that into the definition of the measures we create themselves. Details in a bit.

What’s interesting here is that the Giry monad is the continuation monad with the carrier type restricted to the reals. This isn’t surprising when you think about what’s going on here - we’re representing measures as integration procedures, that is, programs that take a measurable function as input and then compute its integral in some particular way. A measure, as we’ve implemented it here, is just a ‘program with a missing piece’. And this is exactly the essence of the continuation monad in Haskell.

## Typeclass Instances

We can fill out the functor, applicative, and monad instances mechanically by reference to the a standard continuation monad implementation, and each instance gives us some familiar conceptual structure or operation on probability measures. Let’s take a look.

The functor instance lets us transform the support of a measurable space while keeping its density structure invariant. If we have:

then mapping a measurable function over the measure corresponds to:

The functor structure allows us to precisely express a pushforward measure or distribution of $\nu$ under $g$. It lets us ‘adapt’ a measure to other measurable spaces, just like a good functor should.

In Haskell, the functor instance corresponds exactly to the math:

instance Functor Measure where
fmap g nu = Measure $\f -> integrate (f . g) nu  The monad instance is exactly the Giry monad structure that we developed previously, and it allows us to sequence probability measures together by marginalizing one into another. We’ll write it in terms of bind, of course, which went like: The Haskell translation is verbatim: instance Monad Measure where return x = Measure (\f -> f x) rho >>= g = Measure$ \f ->
integrate (\m -> integrate f (g m)) rho


Finally there’s the Applicative instance, which as I mentioned in the last post is sort of conceptually weird here. So in the spirit of that comment, I’m going to dodge any formal justification for now and just use the following instance which works in practice:

instance Applicative Measure where
pure x = Measure (\f -> f x)
Measure g <*> Measure h = Measure $\f -> g (\k -> h (f . k))  ## Conceptual Example It’s worth taking a look at an example of how things should conceivably work here. Consider the following probabilistic model: It’s a standard hierarchical presentation. A ‘compound’ measure can be obtained here by marginalizing over the beta measure $\pi$, and that’s called the beta-binomial measure. Let’s find it. The beta distribution has support on the $[0, 1]$ subset of the reals, and the binomial distribution with argument $n$ has support on the $\{0, \ldots, n\}$ subset of the integers, so we know that things should proceed like so: Eliding some theory of integration, I can tell you that $\pi$ is absolutely continuous with respect to Lebesgue measure and that $\mu(p)$ is absolutely continuous w/respect to counting measure for appropriate $p$. So, $\pi$ admits a density $d\pi/dx = g_\pi$ and $\mu(p)$ admits a density $d\mu(p)/d\# = g_{\mu(p)}$, defined as: and respectively, for $B$ the beta function and $\binom{n}{x}$ a binomial coefficient. Again, we can reduce the integral as follows, transforming the outermost integral into a standard Riemann integral and the innermost integral into a simple sum of products: where $dx$ denotes Lebesgue measure. I could expand this further or simplify things a little more (the beta and binomial are conjugates) but you get the point, which is that we have a way to evaluate the integral. What is really required here then is to be able to encode into the definitions of measures like $\pi$ and $\mu(p)$ the method of integration to use when evaluating them. For measures absolutely continuous w/respect to Lebesgue measure, we can use the Riemann integral over the reals. For measures absolutely continuous w/respect to counting measure, we can use a sum of products. In both cases, we’ll also need to supply the density or mass function by which the integral should be evaluated. ## Creating Measures Recall that we are representing measures as integration procedures. So to create one is to define a program by which we’ll perform integration. Let’s start with the conceptually simpler case of a probability measure that’s absolutely continuous with respect to counting measure. We need to provide a support (the region for which probability is greater than 0) and a probability mass function (so that we can weight every point appropriately). Then we just want to integrate a function by evaluating it at every point in the support, multiplying the result by that point’s probability mass, and summing everything up. In code, this translates to: fromMassFunction :: (a -> Double) -> [a] -> Measure a fromMassFunction f support = Measure$ \g ->
foldl' (\acc x -> acc + f x * g x) 0 support


So if we want to construct a binomial measure, we can do that like so (where choose comes from Numeric.SpecFunctions):

binomial :: Int -> Double -> Measure Int
binomial n p = fromMassFunction (pmf n p) [0..n] where
pmf n p x
| x < 0 || n < x = 0
| otherwise = choose n x * p ^^ x * (1 - p) ^^ (n - x)


The second example involves measures over the real line that are absolutely continuous with respect to Lebesgue measure. In this case we want to evaluate a Riemann integral over the entire real line, which is going to necessitate approximation on our part. There are a bunch of methods out there for approximating integrals, but a simple one for one-dimensional problems like this is quadrature, an implementation for which Ed Kmett has handily packaged up in his integration package:

fromDensityFunction :: (Double -> Double) -> Measure Double
fromDensityFunction d = Measure $\f -> quadratureTanhSinh (\x -> f x * d x) where quadratureTanhSinh = result . last . everywhere trap  Here we’re using quadrature to approximate the integral, but otherwise it has a similar form as ‘fromMassFunction’. The difference here is that we’re integrating over the entire real line, and so don’t have to supply a support explicitly. We can use this to create a beta measure (where logBeta again comes from Numeric.SpecFunctions): beta :: Double -> Double -> Measure Double beta a b = fromDensityFunction (density a b) where density a b p | p < 0 || p > 1 = 0 | otherwise = 1 / exp (logBeta a b) * p ** (a - 1) * (1 - p) ** (b - 1)  Note that since we’re going to be integrating over the entire real line and the beta distribution has support only over $[0, 1]$, we need to implicitly define the support here by specifying which regions of the domain will lead to a density of 0. In any case, now that we’ve constructed those things we can just use a monadic bind to create the beta-binomial measure we described before. It masks a lot of under-the-hood complexity. betaBinomial :: Int -> Double -> Double -> Measure Int betaBinomial n a b = beta a b >>= binomial n  There are a couple of other useful ways to create measures, but the most notable is to use a sample in order to create an empirical measure. This is equivalent to passing in a specific support for which the mass function assigns equal probability to every element; I’ll use Gabriel Gonzalez’s foldl package here as it’s pretty elegant: fromSample :: Foldable f => f a -> Measure a fromSample = Measure . flip weightedAverage weightedAverage :: (Foldable f, Fractional r) => (a -> r) -> f a -> r weightedAverage f = Foldl.fold (weightedAverageFold f) where weightedAverageFold :: Fractional r => (a -> r) -> Fold a r weightedAverageFold f = Foldl.premap f averageFold averageFold :: Fractional a => Fold a a averageFold = (/) <$> Foldl.sum <*> Foldl.genericLength


Using ‘fromSample’ you can create an empirical measure using just about anything you’d like:

data Foo = Foo | Bar | Baz

foos :: [Foo]
foos = [Foo, Foo, Bar, Foo, Baz, Foo, Bar, Foo, Foo, Foo, Bar]

nu :: Measure Foo
nu = fromSample foos


Though I won’t demonstrate it here, you can use this approach to also create measures from sampling functions or random variables that use a source of randomness - just draw a sample from the function and pipe the result into ‘fromSample’.

## Querying Measures

To query a measure is to simply get some result out of it, and we do that by integrating some measurable function against it. The easiest thing to do is to just take a straightforward expectation by integrating the identity function; for example, here’s the expected value of a beta(10, 10) measure:

> integrate id (beta 10 10)
0.49999999999501316


The expected value of a beta($\alpha$, $\beta$) distribution is $\alpha / (\alpha + \beta)$, so we can verify analytically that the result should be 0.5. We observe a bit of numerical imprecision here because, if you’ll recall, we’re just approximating the integral via quadrature. For measures created via ‘fromMassFunction’ we don’t need to use quadrature, so we won’t observe the same kind of approximation error. Here’s the expected value of a binomial(10, 0.5) measure, for example:

> integrate fromIntegral (binomial 10 0.5)
5.0


Note here that we’re integrating the ‘fromIntegral’ function against the binomial measure. This is because the binomial measure is defined over the integers, rather than the reals, and we always need to evaluate to a real when we integrate. That’s part of the definition of a measure!

Let’s calculate the expectation of the beta-binomial distribution with $n = 10$, $\alpha = 1$, and $\beta = 8$:

> integrate fromIntegral (betaBinomial 10 1 8)
1.108635884924813


Neato. And since we can integrate like this, we can really compute any of the moments of a measure. The first raw moment is what we’ve been doing here, and is called the expectation:

expectation :: Measure Double -> Double
expectation = integrate id


The second (central) moment is the variance. Here I mean variance in the moment-based sense, rather than as the possibly better-known sample variance:

variance :: Measure Double -> Double
variance nu = integrate (^ 2) nu - expectation nu ^ 2


The variance of a binomial($n$, $p$) distribution is known to be $np(1-p)$, so for $n = 10$ and $p = 0.5$ we should get 2.5:

> variance (binomial 10 0.5)
<interactive>:87:11: error:
• Couldn't match type ‘Int’ with ‘Double’
Expected type: Measure Double
Actual type: Measure Int
• In the first argument of ‘variance’, namely ‘(binomial 10 0.5)’
In the expression: variance (binomial 10 0.5)
In an equation for ‘it’: it = variance (binomial 10 0.5)


Ahhh, but remember: the binomial measure is defined over the integers, so we can’t integrate it directly. No matter - the functorial structure allows us to adapt it to any other measurable space via a measurable function:

> variance (fmap fromIntegral (binomial 10 0.5))
2.5


Expectation and variance (and other moments) are pretty well-known, but you can do more exotic things as well. You can calculate the moment generating function for a measure, for example:

momentGeneratingFunction :: Measure Double -> Double -> Double
momentGeneratingFunction nu t = integrate (\x -> exp (t * x)) nu


and the cumulant generating function follows naturally:

cumulantGeneratingFunction :: Measure Double -> Double -> Double
cumulantGeneratingFunction nu = log . momentGeneratingFunction nu


A particularly useful construct is the cumulative distribution function for a measure, which calculates the probability of a region less than or equal to some number:

cdf :: Measure Double -> Double -> Double
cdf nu x = integrate (negativeInfinity to x) nu

negativeInfinity :: Double
negativeInfinity = negate (1 / 0)

to :: (Num a, Ord a) => a -> a -> a -> a
to a b x
| x >= a && x <= b = 1
| otherwise        = 0


The beta(2, 2) distribution is symmetric around its mean 0.5, so the probability of the region $[0, 0.5]$ should itself be 0.5. This checks out as expected, modulo approximation error due to quadrature:

> cdf (beta 2 2) 0.5
0.4951814897381374


Similarly for measurable spaces without any notion of order, there’s a simple CDF analogue that calculates the probability of a region that contains the given points:

containing :: (Num a, Eq b) => [b] -> b -> a
containing xs x
| x elem xs = 1
| otherwise   = 0


And probably the least interesting query of all is the simple ‘volume’, which calculates the total measure of a space. For any probability measure this must obviously be one, so it can at least be used as a quick sanity check:

volume :: Measure Double -> Double
volume = integrate (const 1)


## Convolution and Friends

I mentioned in the last post that applicativeness corresponds to independence in some sense, and that independent measures over the same measurable space can be convolved together, à la:

for measures $\nu$ and $\zeta$ on $M$. In Haskell-land it’s well-known that any applicative instance gives you a free ‘Num’ instance, and the story is no different here:

instance Num a => Num (Measure a) where
(+)         = liftA2 (+)
(-)         = liftA2 (-)
(*)         = liftA2 (*)
abs         = fmap abs
signum      = fmap signum
fromInteger = pure . fromInteger


There are a few neat ways to demonstrate this kind of thing. Let’s use a Gaussian measure here as a running example:

gaussian :: Double -> Double -> Measure Double
gaussian m s = fromDensityFunction (density m s) where
density m s x
| s <= 0    = 0
| otherwise =
1 / (s * sqrt (2 * pi)) * exp (negate ((x - m) ^^ 2) / (2 * (s ^^ 2)))


First, consider a chi-squared measure with $k$ degrees of freedom. We could create this directly using a density function, but instead we can represent it by summing up independent squared Gaussian measures:

chisq :: Int -> Measure Double
chisq k = sum (replicate k normal) where
normal = fmap (^ 2) (gaussian 0 1)


To sanity check the result, we can compute the mean and variance of a $\chi^2(2)$ measure, which should be $k$ and $2k$ respectively for $k = 2$:

> expectation (chisq 2)
2.0000000000000004
> variance (chisq 2)
4.0


As a second example, consider a product of independent Gaussian measures. This is a trickier distribution to deal with analytically (see here), but we can use some well-known identities for general independent measures in order to verify our results. For any independent measures $\mu$ and $\nu$, we have:

and

for the expectation and variance of their product. So for a product of independent Gaussians w/parameters (1, 2) and (2, 3) respectively, we expect to see 2 for its expectation and 61 for its variance:

> expectation (gaussian 1 2 * gaussian 2 3)
2.0000000000000001
> variance (gaussian 1 2 * gaussian 2 3)
61.00000000000003


Woop!

## Wrapping Up

And there you have it, a continuation-based implementation of the Giry monad. You can find a bunch of code with similar functionality to this packaged up in my old measurable library on GitHub if you’d like to play around with the concepts.

That library has accumulated a few stars since I first pushed it up in 2013. I think a lot of people are curious about these weird measure things, and this framework at least gives you the ability to play around with a representation for measures directly. I found it particularly useful for really grokking, say, that integrating some function $f$ against a probability measure $\nu$ is identical to integrating the identity function against the probability measure $\texttt{fmap} \, f \, \nu$. And there are a few similar concepts there that I find really pop out when you play with measures directly, rather than when one just works with them on paper.

But let me now tell you why the Giry monad sucks in practice.

Take a look at this integral expression, which is brought about due to a monadic bind:

For simplicitly, let’s assume that $M$ is discrete and has cardinality $|M|$. This means that the integral reduces to

for $d\mu(m)$ and $d\nu$ the appropriate Radon-Nikodym derivatives. You can see that the total number of operations involved in the integral is $O(|M|^2)$, and indeed, for $p$ monadic binds the computational complexity involved in evaluating all the integrals involved is exponential, on the order of $|M|^p$. It was no coincidence that I demonstrated a variance calculation for a $\chi^2(2)$ distribution instead of for a $\chi^2(10)$.

This isn’t really much of a surprise - the cottage industry of approximating integrals exists because integration is hard in practice, and integration is surely best avoided whenever one can get away with doing so. Vikash Mansinghka’s quote on this topic is fitting: “don’t calculate probabilities - sample good guesses.” I’ll also add: relegate the measures to measure theory, where they seem to belong.

The Giry monad is a lovely abstract construction for formalizing the monadic structure of probability, and as canonical probabilistic objects, measures and integrals are tremendously useful when working theoretically. But they’re a complete non-starter when it comes to getting anything nontrivial done in practice. For that, there are far more useful representations for probability distributions in Haskell - notably, the sampling function or random variable representation found in things like mwc-probability/mwc-random-monad and random-fu, or even better, the structural representation based on free or operational monads like I’ve written about before, or that you can find in something like monad-bayes.

The intuitions gleaned from playing with the Giry monad carry over precisely to other representations for the probability monad. In all cases, ‘return’ will correspond, semantically, to constructing a Dirac distribution at a point, while ‘bind’ will correspond to a marginalizing operator. The same is true for the underlying (applicative) functor structure: ‘fmap’ always corresponds to a density-preserving transformation of the support, while applicativeness corresponds to independence (yielding convolution, etc.). And you have to admit, the connection to continuations is pretty cool.

There is clearly some connection to the codensity monad as well, but I think I’ll let someone else figure out the specifics of that one. Something something right-Kan extension..