Embedded DSLs for Bayesian Modelling and Inference: a Retrospective

02 Jul 2018

Why does my blog often feature its typical motley mix of probability, functional programming, and computer science anyway?

From 2011 through 2017 I slogged through a Ph.D. in statistics, working on it full time in 2012, and part-time in every other year. It was an interesting experience. Although everything worked out for me in the end – I managed to do a lot of good and interesting work in industry while still picking up a Ph.D. on the side – it’s not something I’d necessarily recommend to others. The smart strategy is surely to choose one thing and give it one’s maximum effort; by splitting my time between work and academia, both obviously suffered to some degree.

That said, at the end of the day I was pretty happy with the results on both fronts. On the academic side of things, the main product was a dissertation, Embedded Domain-Specific Languages for Bayesian Modelling and Inference, supporting my thesis: that novel and useful DSLs for solving problems in Bayesian statistics can be embedded in statically-typed, purely functional programming languages.

It helps to remember that in this day and age, one can still typically graduate by, uh, “merely” submitting and defending a dissertation. Publishing in academic venues certainly helps focus one’s work, and is obviously necessary for a career in academia (or, increasingly, industrial research). But it’s optional when it comes to getting your degree, so if it doesn’t help you achieve your goals, you may want to reconsider it, as I did.

The problem with the dissertation-first approach, of course, is that nobody reads your work. To some extent I think I’ve mitigated that; most of the content in my dissertation is merely a fleshed-out version of various ideas I’ve written about on this blog. Here I’ll continue that tradition and write a brief, informal summary of my dissertation and Ph.D. more broadly – what I did, how I approached it, and what my thoughts are on everything after the fact.

The Idea

Following the advice of Olin Shivers (by way of Matt Might), I oriented my work around a concrete thesis, which wound up more or less being that embedding DSLs in a Haskell-like language can be a useful technique for solving statistical problems. This thesis wasn’t born into the world fully-formed, of course – it began as quite a vague (or misguided) thing, but matured naturally over time. Using the tools of programming languages and compilers to do statistics and machine learning is the motivation behind probabilistic programming in general; what I was interested in was exploring the problem in the setting of languages embedded in a purely functional host. Haskell was the obvious choice of host for all of my implementations.

It may sound obvious that putting together a thesis is a good strategy for a Ph.D. But here I’m talking about a thesis in the original (Greek) sense of a proposition, i.e. a falsifiable idea or claim (in contrast to a dissertation, from the Latin disserere, i.e. to examine or to discuss). Having a central idea to orient your work around can be immensely useful in terms of focus. When you read a dissertation with a clear thesis, it’s easy to know what the writer is generally on about – without one it can (increasingly) be tricky.

My thesis is pretty easy to defend in the abstract. A DSL really exposes the structure of one’s problem while also constraining it appropriately, and embedding one in a host language means that one doesn’t have to implement an entire compiler toolchain to support it. I reckoned that simply pointing the artillery of “language engineering” at the statistical domain would lead to some interesting insight on structure, and maybe even produce some useful tools. And it did!

The Contributions

Of course, one needs to do a little more defending than that to satisfy his or her examination committee. Doctoral research is supposed to be substantial and novel. In my experience, reviewers are concerned with your answers to the following questions:

What, specifically, are your claims?
Are they novel contributions to your field?
Have you backed them up sufficiently?

At the end of the day, I claimed the following advances from my work.

Novel probabilistic interpretations of the Giry monad’s algebraic structure. The Giry monad (Lawvere, 1962; Giry, 1981) is the “canonical” probability monad, in a meaningful sense, and I demonstrated that one can characterise the measure-theoretic notion of image measure by its functorial structure, as well as the notion of product measure by its monoidal structure. Having the former around makes it easy to transform the support of a probability distribution while leaving its density structure invariant, and the latter lets one encode probabilistic independence, enabling things like measure convolution and the like. What’s more, the analogous semantics carry over to other probability monads – for example the well-known sampling monad, or more abstract variants.
A novel characterisation of the Giry monad as a restricted continuation monad. Ramsey & Pfeffer (2002) discussed an “expectation monad,” and I had independently come up with my own “measure monad” based on continuations. But I showed both reduce to a restricted form of the continuation monad of Wadler (1994) – and that indeed, when the return type of Wadler’s continuation monad is restricted to the reals, it is the Giry monad.

To be precise it’s actually somewhat more general – it permits integration with respect to any measure, not only a probability measure – but that definition strictly subsumes the Giry monad. I also showed that product measure, via the applicative instance, yields measure convolution and associated operations.
A novel technique for embedding a statically-typed probabilistic programming language in a purely functional language. The general idea itself is well-known to those who have worked with DSLs in Haskell: one constructs a base functor and wraps it in the free monad. But the reason that technique is appropriate in the probabilistic programming domain is that probabilistic models are fundamentally monadic constructs – merely recall the existence of the Giry monad for proof!

To construct the requisite base functor, one maps some core set of concrete probability distributions denoted by the Giry monad to a collection of abstract probability distributions represented only by unique names. These constitute the branches of one’s base functor, which is then wrapped in the familiar ‘Free’ machinery that gives one access to the functorial, applicative, and monadic structure that I talked about above. This abstract representation of a probabilistic model allows one to implement other probability monads, such as the well-known sampling monad (Ramsey & Pfeffer, 2002; Park et al., 2008) or the Giry monad, by way of interpreters.

(N.b. Ścibior et al. (2015) did some very similar work to this, although the monad they used was arguably more operational in its flavour.)
A novel characterisation of execution traces as cofree comonads. The idea of an “execution trace” is that one runs a probabilistic program (typically generating a sample) and then records how it executed – what randomness was used, the execution path of the program, etc. To do Bayesian inference, one then runs a Markov chain over the space of possible execution traces, calculating statistics about the resulting distribution in trace space (Wingate et al., 2011).

Remarkably, a cofree comonad over the same abstract probabilistic base functor described above allows us to represent an execution trace at the embedded language level itself. In practical terms, that means one can denote a probabilistic model, and then run a Markov chain over the space of possible ways it could have executed, without leaving GHCi. You can alternatively examine and perturb the way the program executes, stepping through it piece by piece, as I believe was originally a feature in Venture (Mansinghka et al., 2014).

(N.b. this really blew my mind when I first started toying with it, converting programs into execution traces and then manipulating them as first-class values, defining other probabilistic programs over spaces of execution traces, etc. Meta.)
A novel technique for statically encoding conditional independence of terms in this kind of embedded probabilistic programming language. If you recall that I previously demonstrated the monoidal (i.e. applicative) structure of the Giry monad encodes the notion of product measure, it will not be too surprising to hear that I used the free applicative functor (Capriotti & Kaposi, 2014) (again, over the same kind of abstract probabilistic base functor) to reify applicative expressions such that they can be identified statically.
A novel shallowly-embedded language for building custom transition operators for use in Markov chain Monte Carlo. MCMC is the de-facto standard way to perform inference on Bayesian models (although it is not limited to Bayesian models in particular). By wrapping a simple state monad transformer around a probability monad, one can denote Markov transition operators, combine them, and transform them in a few ways that are useful for doing MCMC.

The framework here was inspired by the old parallel “strategies” idea of Trinder et al. (1998). The idea is that you want to “evaluate” a posterior via MCMC, and want to choose a strategy by which to do so – e.g. Metropolis (Metropolis, 1953), slice sampling (Neal, 2003), Hamiltonian (Neal, 2011), etc. Since Markov transition operators are closed under composition and convex combinations, it is easy to write a little shallowly-embedded combinator language for working with them – effectively building evaluation strategies in a manner familiar to those who’ve worked with Haskell’s parallel library.

(N.b. although this was the most trivial part of my research, theoretical or implementation-wise, it remains the most useful for day-to-day practical work.)

The Execution

One needs to stitch his or her contributions together in some kind of over-arching narrative that supports the underlying thesis. Mine went something like this:

The Giry monad is appropriate for denoting probabilistic semantics in languages with purely-functional hosts. Its functorial, applicative, and monadic structure denote probability distributions, independence, and marginalisation, respectively, and these are necessary and sufficient for encoding probabilistic models. An embedded language based on the Giry monad is type-safe and composable.

Probabilistic models in an embedded language, semantically denoted in terms of the Giry monad, can be made abstract and interpretation-independent by defining them in terms of a probabilistic base functor and a free monad instead. They can be forward-interpreted using standard free monad recursion schemes in order to compute probabilities (via a measure intepretation) or samples (via a sampling interpretation); the latter interpretation is useful for performing limited forms of Bayesian inference, in particular. These free-encoded models can also be transformed into cofree-encoded models, under which they represent execution traces that can be perturbed arbitrarily by standard comonadic machinery. This representation is amenable to more elaborate forms of Bayesian inference. To accurately denote conditional independence in the embedded language, the free applicative functor can also be used.

One can easily construct a shallowly-embedded language for building custom Markov transitions. Markov chains that use these compound transitions can outperform those that use only “primitive” transitions in certain settings. The shallowly embedded language guarantees that transitions can only be composed in well-defined, type-safe ways that preserve the properties desirable for MCMC. What’s more, one can implement “transition transformers” for implementing still more complex inference techniques, e.g. annealing or tempering, over existing transitions.

Thus: novel and useful domain-specific languages for solving problems in Bayesian statistics can be embedded in statically-typed, purely-functional programming languages.

I used the twenty-minute talk period of my defence to go through this narrative and point out my claims, after which I was grilled on them for an hour or two. The defence was probably the funnest part of my whole Ph.D.

The Product

In the end, I mainly produced a dissertation, a few blog posts, and some code. By my count, the following repos came out of the work:

deanie: An embedded probabilistic programming language.
http://github.com/jtobin/deanie
declarative: DIY Markov Chains.
http://github.com/jtobin/declarative
flat-mcmc: Painless general-purpose sampling.
http://github.com/jtobin/flat-mcmc
hasty-hamiltonian: Speedy traversal through parameter space.
http://github.com/jtobin/hasty-hamiltonian
hnuts: Automatic gradient-based sampling.
http://github.com/jtobin/hnuts
lazy-langevin: Gradient-based diffusion.
http://github.com/jtobin/lazy-langevin
mcmc-types: Common types for implementing MCMC algorithms.
https://github.com/jtobin/mcmc-types
measurable: A shallowly-embedded DSL for basic measure wrangling.
http://github.com/jtobin/measurable
mighty-metropolis: The Metropolis sampling algorithm.
http://github.com/jtobin/mighty-metropolis
mwc-probability: Sampling function-based probability distributions.
http://github.com/jtobin/mwc-probability
sampling: Tools for sampling from collections.
https://github.com/jtobin/sampling
speedy-slice: Speedy slice sampling.
http://github.com/jtobin/speedy-slice

If any of this stuff is or was useful to you, that’s great! I still use the declarative libraries, flat-mcmc, mwc-probability, and sampling pretty regularly. They’re fast and convenient for practical work.

Some of the other stuff, e.g. measurable, is useful for building intuition, but not so much in practice, and deanie, for example, is a work-in-progress that will probably not see much more progress (from me, at least). Continuing from where I left off might be a good idea for someone who wants to explore problems in this kind of setting in the future.

General Thoughts

When I first read about probabilistic (functional) programming in Dan Roy’s 2011 dissertation I was absolutely blown away by the idea. It seemed that, since there was such an obvious connection between the structure of Bayesian models and programming languages (via the underlying semantic graph structure, something that has been exploited to some degree as far back as BUGS), it was only a matter of time until someone was able to really create a tool that would revolutionize the practice of Bayesian statistics.

Now I’m much more skeptical. It’s true that probabilistic programming tends to expose some beautiful structure in statistical models, and that a probabilistic programming language that was easy to use and “just worked” for inference would be a very useful tool. But putting something expressive and usable together that also “just works” for that inference step is very, very difficult. Very difficult indeed.

Almost every probabilistic programming framework of the past ten years, from Church down to my own stuff, has more or less wound up as “thesisware,” or remains the exclusive publication-generating mechanism of a single research group. The exceptions are almost unexceptional in of themselves: JAGS and Stan are probably the most-used such frameworks, certainly in statistics (I will mention the very honourable PyMC here as well), but they innovate little, if at all, over the original BUGS in terms of expressiveness. Similarly it’s very questionable whether the fancy MCMC algo du jour is really any better than some combination of Metropolis-Hastings (even plain Metropolis), Gibbs (or its approximate variant, slice sampling), or nested sampling in anything outside of favourably-engineered examples (I will note that Hamiltonian Monte Carlo could probably be counted in there too, but it can still be quite a pain to use, its variants are probably overrated, and it is comparatively expensive).

Don’t get me wrong. I am a militant Bayesian. Bayesian statistics, i.e., as far as I’m concerned, probability theory, describes the world accurately. And there’s nothing wrong with thesisware, either. Research is research, and this is a very thorny problem area. I hope to see more abandoned, innovative software that moves the ball up the field, or kicks it into another stadium entirely. Not less. The more ingenious implementations and sampling schemes out there, the better.

But more broadly, I often find myself in the camp of Leo Breiman, who in 2001 characterised the two predominant cultures in statistics as those of data modelling and algorithmic modelling respectively, the latter now known as machine learning, of course. The crux of the data modelling argument, which is of course predominant in probabilistic programming research and Bayesian statistics more generally, is that a practitioner, by means of his or her ingenuity, is able to suss out the essence of a problem and distill it into a useful equation or program. Certainly there is something to this: science is a matter of creating hypotheses, testing them against the world, and iterating on that, and the “data modelling” procedure is absolutely scientific in principle. Moreover, with a hat tip to Max Dama, one often wants to impose a lot of structure on a problem, especially if the problem is in a domain where there is a tremendous amount of noise. There are many areas where this approach is just the thing one is looking for.

That said, it seems to me that a lot of the data modelling-flavoured side of probabilistic programming, Bayesian nonparametrics, etc., is to some degree geared more towards being, uh, “research paper friendly” than anything else. These are extremely seductive areas for curious folks who like to play at the intersection of math, statistics, and computer science (raises hand), and one can easily spend a lifetime chasing this or that exquisite theoretical construct into any number of rabbit holes. But at the end of the day, the data modelling culture, per Breiman:

.. has at its heart the belief that a statistician, by imagination and by looking at the data, can invent a reasonably good parametric class of models for a complex mechanism devised by nature.

Certainly the traditional statistics that Breiman wrote about in 2001 was very different from probabilistic programming and similar fields in 2018. But I think there is the same element of hubris in them, and to some extent, a similar dissociation from reality. I have cultivated some of the applied bent of a Breiman, or a Dama, or a Locklin, so perhaps this should not be too surprising.

I feel that the 2012-ish resurgence of neural networks jolted the machine learning community out of a large-scale descent into some rather dubious Bayesian nonparametrics research, which, much as I enjoy that subject area, seemed more geared towards generating fun machine learning summer school lectures and NIPS papers than actually getting much practical work done. I can’t help but feel that probabilistic programming may share a few of those same characteristics. When all is said and done, answering the question is this stuff useful? often feels like a stretch.

So: onward & upward and all that, but my enthusiasm has been tempered somewhat, is all.

Fini

Administrative headaches and the existential questions associated with grad school aside, I had a great time working in this area for a few years, if in my own aloof and eccentric way.

If you ever interacted with this area of my work, I hope you got some utility out of it: research ideas, use of my code, or just some blog post that you thought was interesting during a slow day at the office. If you’re working in the area, or are considering it, I wish you success, whether your goal is to build practical tools, or to publish sexy papers. :-)

jtobin.io