Yo Dawg We Heard You Like Derivatives

I noticed this article by Tom Ellis today that provides an excellent ‘demystified’ introduction to automatic differentiation. His exposition is exceptionally clear and simple.

Hopefully not in the spirit of re-mystifying things too much, I wanted to demonstrate that this kind of forward-mode automatic differentiation can be implemented using a catamorphism, which cleans up the various let statements found in Tom’s version (at the expense of slightly more pattern matching).

Let me first duplicate his setup using the standard recursion scheme machinery:

{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE LambdaCase #-}

import Data.Functor.Foldable

data ExprF r =
    VarF
  | ZeroF
  | OneF
  | NegateF r
  | SumF r r
  | ProductF r r
  | ExpF r
  deriving (Show, Functor)

type Expr = Fix ExprF

Since my expression type uses a fixed-point wrapper I’ll define my own embedded language terms to get around it:

var :: Expr
var = Fix VarF

zero :: Expr
zero = Fix ZeroF

one :: Expr
one = Fix OneF

neg :: Expr -> Expr
neg x = Fix (NegateF x)

add :: Expr -> Expr -> Expr
add a b = Fix (SumF a b)

prod :: Expr -> Expr -> Expr
prod a b = Fix (ProductF a b)

e :: Expr -> Expr
e x = Fix (ExpF x)

To implement a corresponding eval function we can use a catamorphism:

eval :: Double -> Expr -> Double
eval x = cata $ \case
  VarF         -> x
  ZeroF        -> 0
  OneF         -> 1
  NegateF a    -> negate a
  SumF a b     -> a + b
  ProductF a b -> a * b
  ExpF a       -> exp a

Very clear. We just match things mechanically.

Now, symbolic differentiation. If you refer to the original diff function you’ll notice that in cases like Product or Exp there are uses of both an original expression and also its derivative. This can be captured simply by a paramorphism:

diff :: Expr -> Expr
diff = para $ \case
  VarF                     -> one
  ZeroF                    -> zero
  OneF                     -> zero
  NegateF (_, x')          -> neg x'
  SumF (_, x') (_, y')     -> add x' y'
  ProductF (x, x') (y, y') -> add (prod x y') (prod x' y)
  ExpF (x, x')             -> prod (e x) x'

Here the primes indicate derivatives in the usual fashion, and the standard rules of differentiation are self-explanatory.

For automatic differentiation we just do sort of the same thing, except we’re also also going to lug around the evaluated function value itself at each point and evaluate to doubles instead of other expressions.

It’s worth noting here: why doubles? Because the expression type that we’ve defined has no notion of sharing, and thus the expressions will blow up à la diff (to see what I mean, try printing the analogue of diff bigExpression in GHCi). This could probably be mitigated by incorporating sharing into the embedded language in some way, but that’s a topic for another post.

Anyway, a catamorphism will do the trick:

ad :: Double -> Expr -> (Double, Double)
ad x = cata $ \case
  VarF                     -> (x, 1)
  ZeroF                    -> (0, 0)
  OneF                     -> (1, 0)
  NegateF (x, x')          -> (negate x, negate x')
  SumF (x, x') (y, y')     -> (x + y, x' + y')
  ProductF (x, x') (y, y') -> (x * y, x * y' + x' * y)
  ExpF (x, x')             -> (exp x, exp x * x')

Take a look at the pairs to the right of the pattern matches; the first element in each is just the corresponding term from eval, and the second is just the corresponding term from diff (made ‘Double’-friendly). The catamorphism gives us access to all the terms we need, and we can avoid a lot of work on the right-hand side by doing some more pattern matching on the left.

Some sanity checks to make sure that these functions match up with Tom’s:

*Main> map (snd . (`ad` testSmall)) [0.0009, 1.0, 1.0001]
[0.12254834896191881,1.0,1.0003000600100016]
*Main> map (snd . (`ad` testBig)) [0.00009, 1.0, 1.00001]
[3.2478565715996756e-6,1.0,1.0100754777229357]

UPDATE:

I had originally defined ad using a paramorphism but noticed that we can get by just fine with cata.

A Tour of Some Useful Recursive Types

I’m presently at NIPS and so felt like writing about some appropriate machine learning topic, but along the way I wound up talking about parameterized recursive types, and here we are. Enjoy!

One starts to see common ‘shapes’ in algebraic data types after working with them for a while. Take the natural numbers and a standard linked list, for example:

data Natural =
    One
  | Succ Natural

data List a =
    Empty
  | Cons a (List a)

These are similar in some sense. There are some differences - a list has an additional type parameter, and each recursive point in the list is tagged with a value of that type - but the nature of the recursion in each is the same. There is a single recursive point wrapped up in a single constructor, plus a single base case.

Consider a recursive type that is parameterized by a functor with kind ‘* -> *’, such that the kind of the resulting type is something like ‘(* -> *) -> *’ or ‘(* -> *) -> * -> *’ or so on. It’s interesting to look at the ‘shapes’ of some useful types like this and see what kind of similarities and differences in recursive structure that we can find.

In this article we’ll look at three such recursive types: ‘Fix’, ‘Free’, and ‘Cofree’. I’ll demonstrate that each can be viewed as a kind of program parameterized by some underlying instruction set.

Fix

To start, let’s review the famous fixed-point type ‘Fix’. I’ve talked about it before, but will go into a bit more detail here.

{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UndecideableInstances #-}

newtype Fix f = Fix (f (Fix f))

deriving instance (Show (f (Fix f))) => Show (Fix f)

Note: I’ll omit interpreter output for examples throughout this article, but feel free to try the code yourself in GHCi. I’ll post some gists at the bottom. The above code block also contains some pragmas that you can ignore; they’re just there to help GHC derive some instances for us.

Anyway. ‘Fix’ is in some sense a template recursive structure. It relies on some underlying functor ‘f’ to define the scope of recursion that you can expect a value with type ‘Fix f’ to support. There is the degenerate constant case, for example, which supports no recursion:

data DegenerateF r = DegenerateF
  deriving (Functor, Show)

type Degenerate = Fix DegenerateF

degenerate :: Degenerate
degenerate = Fix DegenerateF

Then you have the case like the one below, where only an infinitely recursive value exists:

newtype InfiniteF r = InfiniteF r
  deriving (Functor, Show)

type Infinite = Fix InfiniteF

infinite :: Infinite
infinite = Fix (InfiniteF infinite)

But in practice you’ll have something in between; a type with at least one recursive point or ‘running’ case and also at least one base or ‘terminating’ case. Take the natural numbers, for example:

data NatF r =
    OneF
  | SuccF r
  deriving (Functor, Show)

type Nat = Fix NatF

one :: Nat
one = Fix OneF

succ :: Nat -> Nat
succ = Fix . SuccF

Here ‘NatF’ provides both a ‘running’ case - ‘SuccF’ - and a ‘terminating’ case in - ‘OneF’. ‘Fix’ just lets ‘NatF’ do whatever it wants, having no say of its own about termination. In fact, we could have defined ‘Fix’ like this:

data Program f = Running (f (Program f))

Indeed, you can think of ‘Fix’ as defining a program that runs until ‘f’ decides to terminate. In turn, you can think of ‘f’ as an instruction set for the program. The whole shebang of ‘Fix f’ may only terminate if ‘f’ contains a terminating instruction.

Here’s a simple set of instructions, for example:

data Instruction r =
    Increment r
  | Decrement r
  | Terminate
  deriving (Functor, Show)

increment :: Program Instruction -> Program Instruction
increment = Running . Increment

decrement :: Program Instruction -> Program Instruction
decrement = Running . Decrement

terminate :: Program Instruction
terminate = Running Terminate

And we can write a sort of stack-based program like so:

program :: Program Instruction
program =
    increment
  . increment
  . decrement
  $ terminate

Richness of ‘Fix’

It’s worthwhile to review two functions that are useful for working with ‘Fix’, unimaginatively named ‘fix’ and ‘unfix’:

fix :: f (Fix f) -> Fix f
fix = Fix

unfix :: Fix f -> f (Fix f)
unfix (Fix f) = f

You can think of them like so: ‘fix’ embeds a value of type ‘f’ into a recursive structure by adding a new layer of recursion, while ‘unfix’ projects a value of type ‘f’ out of a recursive structure by peeling back a layer of recursion.

This is a pretty rich recursive structure - we have a guarantee that we can always embed into or project out of something with type ‘Fix f’, no matter what ‘f’ is.

Free

Next up is ‘Free’, which is really just ‘Fix’ with some added structure. It is defined as follows:

data Free f a =
    Free (f (Free f a))
  | Pure a
  deriving Functor

deriving instance (Show a, Show (f (Free f a))) => Show (Free f a)

The ‘Free’ constructor has an analogous definition to the ‘Fix’ constructor, meaning we can use ‘Free’ to implement the same things we did previously. Here are the natural numbers redux, for example:

type NatFree = Free NatF

oneFree :: NatFree a
oneFree = Free OneF

succFree :: NatFree a -> NatFree a
succFree = Free . SuccF

There’s also another branch here called ‘Pure’, though, that just bluntly wraps a value of type ‘a’, and has nothing to do with the parameter ‘f’. This has an interesting consequence: it means that ‘Free’ can have an opinion of its own about termination, regardless about what ‘f’ might decree:

type NotSoInfinite = Free InfiniteF

notSoInfinite :: NotSoInfinite ()
notSoInfinite = Free (InfiniteF (Free (InfiniteF (Pure ()))))

(Note that here I’ve returned the value of type unit when terminating under the ‘Pure’ branch, but you could pick whatever else you’d like.)

You’ll recall that ‘InfiniteF’ provides no terminating instruction, and left to its own devices will just recurse endlessly.

So: instead of being forced to choose a branch of the underlying functor to recurse on, ‘Free’ can just bail out on a whim and return some value wrapped up in ‘Pure’. We could have defined the whole type like this:

data Program f a =
    Running (f (Program f a))
  | Terminated a
  deriving Functor

Again, it’s ‘Fix’ with more structure. It’s a program that runs until ‘f’ decides to terminate, or that terminates and returns a value of type ‘a’

As a quick illustration, take our simple stack-based instruction set again. We can define the following embedded language terms:

increment :: Program Instruction a -> Program Instruction a
increment = Running . Increment

decrement :: Program Instruction a -> Program Instruction a
decrement = Running . Decrement

terminate :: Program Instruction a
terminate = Running Terminate

sigkill :: Program f Int
sigkill = Terminated 1

So note that ‘sigkill’ is independent of whatever instruction set we’re working with. We can thus write another simple program like before, except this time have ‘sigkill’ terminate it:

program :: Program Instruction Int
program =
    increment
  . increment
  . decrement
  $ sigkill

Richness of ‘Free’

Try to define the equivalent versions of ‘fix’ and ‘unfix’ for ‘Free’. The equivalent to ‘fix’ is easy:

free :: f (Free f a) -> Free f a
free = Free

You’ll hit a wall, though, if you want to implement the (total) analogue to ‘unfix’. One wants a function of type ‘Free f a -> f (Free f a)’, but the existence of the ‘Pure’ branch makes this impossible to implement totally. In general there is not going to be an ‘f’ to pluck out:

unfree :: Free f a -> f (Free f a)
unfree (Free f) = f
unfree (Pure a) = error "kaboom"

The recursion provided by ‘Free’ is thus a little less rich than that provided by ‘Fix’. With ‘Fix’ one can always project a value out of its recursive structure - but that’s not the case with ‘Free’.

It’s well-known that ‘Free’ is monadic, and indeed it’s usually called the ‘free monad’. The namesake ‘free’ comes from an algebraic definition; roughly, a free ‘foo’ is a ‘foo’ that satisfies the minimum possible constraints to make it a ‘foo’, and nothing else. Check out the slides from Dan Piponi’s excellent talk from Bayhac a few years back for a deeper dive on algebraic freeness.

Cofree

‘Cofree’ is also like ‘Fix’, but again with some extra structure. It can be defined as follows:

data Cofree f a = Cofree a (f (Cofree f a))
  deriving Functor

deriving instance (Show a, Show (f (Cofree f a))) => Show (Cofree f a)

Again, part of the definition - the second field of the ‘Cofree’ constructor - looks just like ‘Fix’. So predictably we can do a redux-redux of the natural numbers using ‘Cofree’:

type NatCofree = Cofree NatF

oneCofree :: NatCofree ()
oneCofree = Cofree () OneF

succFree :: NatCofree () -> NatCofree ()
succFree f = Cofree () (SuccF f)

(Note that here I’ve again used unit to fill in the first field - you could of course choose whatever you’d like.)

This looks a lot like ‘Free’, and in fact it’s the categorical dual of ‘Free’. Whereas ‘Free’ is a sum type with two branches, ‘Cofree’ is a product type with two fields. In the case of ‘Free’, we could have a program that either runs an instruction from a set ‘f’, or terminates with a value having type ‘a’. In the case of ‘Cofree’, we have a program that runs an instruction from a set ‘f’ and returns a value of type ‘a’.

A ‘Free’ value thus contains at most one recursive point wrapping the value with type ‘a’, while a ‘Cofree’ value contains potentially infinite recursive points - each one of which is tagged with a value of type ‘a’.

Rolling with the ‘Program’ analogy, we could have written this alternate definition for ‘Cofree’:

data Program f a = Program {
    annotation :: a
  , running    :: f (Program f a)
  } deriving Show

A ‘Cofree’ value is thus a program in which every instruction is annotated with a value of type ‘a’. This means that, unlike ‘Free’, it can’t have its own opinion on termination. Like ‘Fix’, it has to let ‘f’ decide how to do that.

We’ll use the stack-based instruction set example to wrap up. Here we can annotate instructions with progress about how many instructions remain to execute. First our new embedded language terms:

increment :: Program Instruction Int -> Program Instruction Int
increment p = Program (remaining p) (Increment p)

decrement :: Program Instruction Int -> Program Instruction Int
decrement p = Program (remaining p) (Decrement p)

terminate :: Program Instruction Int
terminate = Program 0 Terminate

Notice that two of these terms use a helper function ‘remaining’ that counts the number of instructions left in the program. It’s defined as follows:

remaining :: Program Instruction Int -> Int
remaining = loop where
  loop (Program a f) = case f of
    Increment p -> succ (loop p)
    Decrement p -> succ (loop p)
    Terminate   -> succ a

And we can write our toy program like so:

program :: Program Instruction Int
program =
    increment
  . increment
  . decrement
  $ terminate

Evaluate it in GHCi to see what the resulting value looks like.

Richness of ‘Cofree’

If you try and implement the ‘fix’ and ‘unfix’ analogues for ‘Cofree’ you’ll rapidly infer that we have the opposite situation to ‘Free’ here. Implementing the ‘unfix’ analogue is easy:

uncofree :: Cofree f a -> f (Cofree f a)
uncofree (Cofree _ f) = f

But implementing a total function corresponding to ‘fix’ is impossible - we can’t just come up with something of arbitrary type ‘a’ to tag an instruction ‘f’ with, so, like before, we can’t do any better than define something partially:

cofree :: f (Cofree f a) -> Cofree f a
cofree f = Cofree (error "kaboom") f

Just as how ‘Free’ forms a monad, ‘Cofree’ forms a comonad. It’s thus known as the ‘cofree comonad’, though I can’t claim to really have any idea what the algebraic notion of ‘cofreeness’ captures, exactly.

Wrapping Up

So: ‘Fix’, ‘Free’, and ‘Cofree’ all share a similar sort of recursive structure that make them useful for encoding programs, given some instruction set. And while their definitions are similar, ‘Fix’ supports the richest recursion of the three in some sense - it can both ‘embed’ things into and ‘project’ things out of its recursive structure, while ‘Free’ supports only embedding and ‘Cofree’ supports only projecting.

This has a practical implication: it means one can’t make use of certain recursion schemes for ‘Free’ and ‘Cofree’ in the same way that one can for ‘Fix’. There do exist analogues, but they’re sort of out-of-scope for this post.

I haven’t actually mentioned any truly practical uses of ‘Free’ and ‘Cofree’ here, but they’re wonderful things to keep in your toolkit if you’re doing any work with embedded languages, and I’ll likely write more about them in the future. In the meantime, Dave Laing wrote an excellent series of posts on ‘Free’ and ‘Cofree’ that are more than worth reading. They go into much more interesting detail than I’ve done here - in particular he details a nice pairing that exists between ‘Free’ and ‘Cofree’ (also discussed by Dan Piponi), plus a whack of examples.

You can also find industrial-strength infrastructure for both ‘Free’ and ‘Cofree’ in Edward Kmett’s excellent free library, and for ‘Fix’ in recursion-schemes.

I’ve dumped the code for this article into a few gists. Here’s one of everything excluding the running ‘Program’ examples, and here are the corresponding ‘Program’ examples for the Fix, Free, and Cofree cases respectively.

Thanks to Fredrik Olsen for review and great feedback.

Sorting with Style

Merge sort is a famous comparison-based sorting algorithm that starts by first recursively dividing a collection of orderable elements into smaller subcollections, and then finishes by recursively sorting and merging the smaller subcollections together to reconstruct the (now sorted) original.

A clear implementation of mergesort should by definition be as faithful to that high-level description as possible. We can get pretty close to that using the whole recursion schemes business that I’ve talked about in the past. Near the end of that article I briefly mentioned the idea of implementing mergesort via a hylomorphism, and here I just want to elaborate on that a little.

Start with a collection of orderable elements. We can divide the collection into a bunch of smaller collections by using a binary tree:

{-# LANGUAGE DeriveFunctor #-}

import Data.Functor.Foldable (hylo)
import Data.List.Ordered (merge)

data Tree a r =
    Empty
  | Leaf a
  | Node r r
  deriving Functor

The idea is that each node in the tree holds two subtrees, each of which contains half of the remaining elements. We can build a tree like this from a collection - say, a basic Haskell list. The following unfolder function defines what part of a tree to build for any corresponding part of a list:

unfolder []  = Empty
unfolder [x] = Leaf x
unfolder xs  = Node l r where
  (l, r) = splitAt (length xs `div` 2) xs

On the other hand, we can also collapse an existing tree back into a list. The following folder function defines how to collapse any given part of a tree into the corresponding part of a list; again we just pattern match on whatever part of the tree we’re looking at, and construct the complementary list:

folder Empty      = []
folder (Leaf x)   = [x]
folder (Node l r) = merge l r

Now to sort a list we can just glue these instructions together using a hylomorphism:

mergesort :: Ord a => [a] -> [a]
mergesort = hylo folder unfolder

And it works just like you’d expect:

> mergesort [1,10,3,4,5]
[1,3,4,5,10]
> mergesort "aloha"
"aahlo"
> mergesort [True, False, False, True, False]
[False, False, False, True, True]

Pretty concise!

The code is eminently clean and faithful to the high-level algorithm description: first recursively divide a collection into smaller subcollections

  • via a binary tree and unfolder - and then recursively sort and merge the subcollections to reconstruct the (now sorted) original one - via folder.

A version of this post originally appeared on the Fugue blog.

Markov Chains à la Carte

I’ve released a number of libraries for doing Markov Chain Monte Carlo (MCMC) in Haskell.

You can get at them via a ‘frontend’ library, declarative, but each can also be used fruitfully on its own. À la carte, if you will.

Some background: MCMC is a family of stateful algorithms for sampling from a large class of probability distributions. Typically one is interested in doing this to approximate difficult integrals; instead of choosing some suitable grid of points in parameter space over which to approximate an integral, just offload the problem to probability theory and use a Markov chain to find them for you.

For an excellent introduction to MCMC you won’t find better than Iain Murray’s lectures from MLSS ’09 in Cambridge, so check those out if you’re interested in more details.

I’ve put together a handful of popular MCMC algorithms as well as an easy way to glue them together in a couple of useful ways. At present these implementations are useful in cases where you can write your target function in closed form, and that’s pretty much all that’s required (aside from the standard algorithm-specific tuning parameters).

The API should be pretty easy to work with — write your target as a function of its parameters, specify a start location, and away you go. It’s also cool if your target accepts its parameters via most common traversable functors — lists, vectors, sequences, maps, etc.

That’s sort of the goal of this first release: if you can give me a target function, I’ll do my best to give you samples from it. Less is more and all that.

What‘s In The Box

There are a number of libraries involved. I have a few more in the queue and there are a number of additional features I plan to support for these ones in particular, but without further ado:

  • mwc-probability, a sampling-function based probability monad implemented as a thin wrapper over the excellent mwc-random library.
  • mcmc-types, housing a number of types used by the the whole family.
  • mighty-metropolis, an implementation of the famous Metropolis algorithm.
  • speedy-slice, a slice sampling implementation suitable for both continuous & discrete parameter spaces.
  • hasty-hamiltonian, an implementation of the gradient-based Hamiltonian Monte Carlo algorithm.
  • declarative, the one ring to rule them all.

Pull down declarative if you just want to have access to all of them. If you’re a Haskell neophyte you can find installation instructions at the Github repo.

Motivation

MCMC is fundamentally about observing Markov chains over probability spaces. In this context a chain is a stochastic process that wanders around a state space, eventually visiting regions of the space in proportion to their probability.

Markov chains are constructed by transition operators that obey the Markov property: that the probability of transitioning to the next location — conditional on the history of the chain — depends only on the current location. For MCMC we’re also interested in operators that satisfy the reversibility property — that the probability a transition from state A to state B occurs is the same as that a transition from state B to state A occurs. A chain is characterized by a transition operator T that drives it from state to state, and for MCMC we want the stationary or limiting distribution of the chain to be the distribution we’re sampling from.

One of the major cottage industries in Bayesian research is inventing new transition operators to drive the Markov chains used in MCMC. This has been fruitful, but it could likely be aided by a practical way to make existing transition operators work together.

This is easy to do in theory: there are a couple of ways to combine transition operators such that the resulting composite operator preserves all the properties we’re interested in for MCMC — the stationary distribution, reversibility, and Markov property. See Geyer, 2005 for details here, but the crux is that we can establish the following simple grammar for transition operators:

transition ::= primitive <transition>
             | concat transition transition
             | sample transition transition

A transition is either some primitive operator, a deterministic concatenation of operators (via ‘concat’), or a probabilistic concatenation of operators (via ‘sample’). A deterministic concatenation works by just transitioning through two operators one after the other; a probabilistic concatenation works by randomly choosing one transition operator or the other to use on any given transition. These kinds of concatenation preserve all the properties we’re interested in.

We can trivially generalize this further by adding a term that concatenates n transition operators together deterministically, or another for probabilistically concatenating a bunch of operators according to some desired probability distribution.

The idea here is that there are tradeoffs involved in different transition operators. Some may be more computationally expensive than others (perhaps requiring a gradient evaluation, or evaluation of some inner loop) but have better ability to make ‘good’ transitions in certain situations. Other operators are cheap, but can be inefficient (taking a long time to visit certain regions of the space).

By employing deterministic or probabilistic concatenation, one can concoct a Markov chain that uses a varied range of tuning parameters, for example. Or only occasionally employs a computationally expensive transition, otherwise preferring some cheaper, reliable operator.

Usage

The declarative library implements this simple language for transition operators, and the mighty-metropolis, speedy-slice, and hasty-hamiltonian libraries provide some primitive transitions that you can combine as needed.

The Metropolis and slice sampling transitions are cheap and require little information, whereas Hamiltonian Monte Carlo exploits information about the target’s gradient and also involves evaluation of an inner loop (the length of which is determined by a tuning parameter). Feel free to use one that suits your problem, or combine them together using the combinators supplied in declarative to build a custom solution.

As an example, the Rosenbrock density is a great test dummy as it’s simple, low-dimensional, and can be easily visualized, but it still exhibits a pathological anisotropic structure that makes it somewhat tricky to sample from.

Getting started via declarative is pretty simple:

import Numeric.MCMC

You’ll want to supply a target to sample over, and if you want to use an algorithm like Hamiltonian Monte Carlo you’ll also need to provide a gradient. If you can’t be bothered to calculate gradients by hand, you can always turn to your friend automatic differentiation:

import Numeric.AD

The Rosenbrock log-density and its gradient can then be written as follows:

target :: Num a => [a] -> a
target [x0, x1] = negate (100 * (x1  x0 ^ 2) ^ 2 + (1  x0) ^ 2)

gTarget :: Num a => [a] -> [a]
gTarget = grad target

All you need to do here is provide a function proportional to a log-probability density. The logarithmic scale is important; various internals expect to be passed (something proportional to) a log-probability density.

To package these guys up together we can wrap them in a Target. Note that we don’t always care about including a gradient, so that part is optional:

rosenbrock :: Target [Double]
rosenbrock = Target target (Just gTarget)

The Target type is parameterized over the shape of the parameter space. You could similarly have a Target (Seq Double), Target (Map String Double), and so on. Your target may be implemented using a boxed vector for efficiency, for example. Or using a Map or HashMap with string/text keys such that parameter names are preserved. They should work just fine.

Given a target, we can sample from it a bunch of times using a simple Metropolis transition via the mcmc function. Aside from the target and a PRNG, provide it with the desired number of transitions, a starting point, and the transition operator to use:

> -- haskell
> prng <- create
> mcmc 10000 [0, 0] (metropolis 1) rosenbrock prng

In return you’ll get the desired trace of the chain dumped to stdout:

8.136972300105949e-2,0.273896953404261
0.4657348148676972,0.17462596647788464
-0.48609414127836326,9.465052854751566e-2
-0.49781488399832785,0.42092910345708523
-0.3019713424699155,0.39135350029173566
0.12058426470979189,0.12485407390388925
..

The intent is for the chain to be processed elsewhere — if you’re me, that will usually be in R. Libraries like coda have a ton of functionality useful for working with Markov chain traces, and ggplot2 as a library for static statistical graphics can’t really be beat:

> # r
> d = read.csv(rosenbrock-trace.dat, header = F)
> names(d) = c(x, y)
> require(ggplot2)
> ggplot(d, aes(x, y)) + geom_point(colour = darkblue, alpha = 0.2)

You get the following trace over the Rosenbrock density, taken for 10k iterations. This is using a Metropolis transition with variance 1:

metropolis

If you do want to work with chains in memory in Haskell you can do that by writing your own handling code around the supplied transition operators. I’ll probably make this a little easier in later versions.

The implementations are reasonably quick and don’t leak memory — the traces are streamed to stdout as the chains are traversed. Compiling the above with ‘-O2’ and running it for 100k iterations yields the following performance characteristics on my mid-2011 model MacBook Air:

$ ./test/Rosenbrock +RTS -s > /dev/null

3,837,201,632 bytes allocated in the heap
    8,453,696 bytes copied during GC
       89,600 bytes maximum residency (2 sample(s))
       23,288 bytes maximum slop
         1 MB total memory in use (0 MB lost due to fragmentation)

 INIT time 0.000s ( 0.000s elapsed)
  MUT time 3.539s ( 3.598s elapsed)
   GC time 0.049s ( 0.058s elapsed)
 EXIT time 0.000s ( 0.000s elapsed)
Total time 3.591s ( 3.656s elapsed)

%GC time 1.4% (1.6% elapsed)

Alloc rate 1,084,200,280 bytes per MUT second

Productivity 98.6% of total user, 96.8% of total elapsed

The beauty is that rather than running a chain solely on something like the simple Metropolis operator used above, you can sort of ‘hedge your sampling risk’ and use a composite operator that proposes transitions using a multitude of ways. Consider this guy, for example:

transition =
  concatT
    (sampleT (metropolis 0.5) (metropolis 1.0))
    (sampleT (slice 2.0) (slice 3.0))

Here concatT and sampleT correspond to the concat and sample terms in the BNF description in the previous section. This operator performs two transitions back-to-back; the first is randomly a Metropolis transition with standard deviation 0.5 or 1 respectively, and the second is a slice sampling transition using a step size of 2 or 3, randomly.

Running it for 5000 iterations (to keep the total computation approximately constant), we see a chain that has traversed the space a little better:

> mcmc 5000 [0, 0] transition rosenbrock prng

composite

It’s worth noting that I didn’t put any work into coming up with this composite transition: this was just the first example I thought up, and a lot of the benefits here probably come primarily from including the eminently-reliable slice sampling transition. But from informal experimentation, it does seem that chains driven by composite transitions involving numerous operators and tuning parameter settings often seem to perform better on average than a given chain driven by a single (poorly-selected) transition.

I know exactly how meticulous proofs and benchmarks must be so I haven’t rigorously established any properties around this, but hey: it ‘seems to be the case’, and intuitively, including varied transition operators surely hedges your bets when compared to using a single one.

Try it out and see how your mileage varies, and be sure to let me know if you find some killer apps where composite transitions really seem to win.

Implementation Notes

If you’re just interested in using the libraries you can skip the following section, but I just want to point out how easy this is to implement.

The implementations are defined using a small set of types living in mcmc-types:

type Transition m a = StateT a (Prob m) ()

data Chain a b = Chain {
    chainTarget   :: Target a
  , chainScore    :: Double
  , chainPosition :: a
  , chainTunables :: Maybe b
  }

data Target a = Target {
    lTarget  :: a -> Double
  , glTarget :: Maybe (a -> a)
  }

Most important here is the Transition type, which is just a state transformer over a probability monad (itself defined in mwc-probability). The probability monad is the source of randomness used to define transition operators useful for MCMC, and values with type Transition are the transition operators in question.

The Chain type is the state of the Markov chain at any given iteration. All that’s really required here is the chainPosition field, which represents the location of the chain in parameter space. But adding some additional information here is convenient; chainScore caches the most recent score of the chain (which is typically used in internal calculations, and caching avoids recomputing things needlessly) and chainTunables is an optional record intended to be used for stateful tuning parameters (used by adaptive algorithms or in burn-in phases and the like). Additionally the target being sampled from itself — chainTarget — is included in the state.

Undisciplined use of chainTarget and chainTunables can have all sorts of nasty consequences — you can use them to change the stationary distribution you’re sampling from or invalidate the Markov property — but keeping them around is useful for implementing some desirable features. Tweaking chainTarget, for example, allows one to easily implement annealing, which can be very useful for sampling from annoying multi-modal densities.

Setting everything up like this makes it trivial to mix-and-match transition operators as required — the state and probability monad stack provides everything we need. Deterministic concatenation is implemented as follows, for example:

concatT = (>>)

and a generalized version of probabilistic concatenation just requires a coin flip:

bernoulliT p t0 t1 = do
  heads <- lift (MWC.bernoulli p)
  if heads then t0 else t1

A uniform probabilistic concatenation over two operators, implemented in sampleT, is then just bernoulliT 0.5.

The difficulty of implementing primitive operators just depends on the operator itself; the surrounding framework is extremely lightweight. Here’s the Metropolis transition, for example (with type signatures omitted to keep the noise down):

metropolis radial = do
  Chain {..} <- get
  proposal   <- lift (propose radial chainPosition)
  let proposalScore = lTarget chainTarget proposal
      acceptProb    = whenNaN 0
        (exp (min 0 (proposalScore - chainScore)))

  accept <- lift (MWC.bernoulli acceptProb)
  when accept
    (put (Chain chainTarget proposalScore proposal chainTunables))

propose radial = traverse perturb where
  perturb m = MWC.normal m radial

And the excellent pipes library is used to generate a Markov chain:

chain radial = loop where
  loop state prng = do
    next <- lift
      (MWC.sample (execStateT (metropolis radial) state) prng)
    yield next
    loop next prng

The mcmc functions are also implemented using pipes. Take the first n iterations of a chain and print them to stdout. That simple.

Future Work

In the near term I plan on updating some old MCMC implementations I have kicking around on Github (flat-mcmc, lazy-langevin, hnuts) and releasing them within this framework. Additionally I’ve got some code for building annealed operators that I want to release — it has been useful in some situations when sampling from things like the Himmelblau density, which has a few disparate clumps of probability that make it tricky to sample from with conventional algorithms.

This framework is also useful as an inference backend to languages for working with directed graphical models (think BUGS/Stan). The idea here is that you don’t need to specify your target function (typically a posterior density) explicitly: just describe your model and I’ll give you samples from the posterior distribution. A similar version has been put to use around the BayesHive project.

Longer term — I’ll have to see what’s up in terms of demand. There are performance improvements and straightforward extensions to things like parallel tempering, but I’m growing more interested in ‘online’ methods like particle MCMC and friends that are proving useful for inference in more general probabilistic programs (think those expressible by Church and its ilk).

Let me know if you get any use out of these things, or please file an issue if there’s some particular feature you’d like to see supported.

Thanks to Niffe Hermansson for review and helpful comments.

Practical Recursion Schemes

Recursion schemes are elegant and useful patterns for expressing general computation. In particular, they allow you to ‘factor recursion out’ of whatever semantics you may be trying to express when interpreting programs, keeping your interpreters concise, your concerns separated, and your code more maintainable.

What’s more, formulating programs in terms of recursion schemes seems to help suss out particular similarities in structure between what might be seen as disparate problems in other domains. So aside from being a practical computational tool, they seem to be of some use when it comes to ‘hacking understanding’ in varied areas.

Unfortunately, they come with a pretty forbidding barrier to entry. While there are a few nice resources out there for learning about recursion schemes and how they work, most literature around them is quite academic and awash in some astoundingly technical jargon (more on this later). Fortunately, the accessible resources out there do a great job of explaining what recursion schemes are and how you might use them, so they go through some effort to build up the required machinery from scratch.

In this article I want to avoid building up the machinery meticulously and instead concentrate mostly on understanding and using Edward Kmett’s recursion-schemes library, which, while lacking in documentation, is very well put together and implements all the background plumbing one needs to get started.

In particular, to feel comfortable using recursion-schemes I found that there were a few key patterns worth understanding:

  • Factoring recursion out of your data types using pattern functors and a fixed-point wrapper.
  • Using the ‘Foldable’ & ‘Unfoldable’ classes, plus navigating the ‘Base’ type family.
  • How to use some of the more common recursion schemes out there for everyday tasks.

The Basics

If you’re following along in GHCi, I’m going to first bring in some imports and add a useful pragma. I’ll dump a gist at the bottom; note that this article targets GHC 7.10.2 and recursion-schemes-4.1.2, plus I’ll also require data-ordlist-0.4.7.0 for an example later. Here’s the requisite boilerplate:

{-# LANGUAGE DeriveFunctor #-}

import Data.Functor.Foldable
import Data.List.Ordered (merge)
import Prelude hiding (Foldable, succ)

So, let’s get started.

Recursion schemes are applicable to data types that have a suitable recursive structure. Lists, trees, and natural numbers are illustrative candidates.

Being so dead-simple, let’s take the natural numbers as an illustrative/toy example. We can define them recursively as follows:

data Natural =
    Zero
  | Succ Natural

This is a fine definition, but many such recursive structures can also be defined in a different way: we can first ‘factor out’ the recursion by defining some base structure, and then ‘add it back in’ by using a recursive wrapper type.

The price of this abstraction is a slightly more involved type definition, but it unlocks some nice benefits — namely, the ability to reason about recursion and base structures separate from each other. This turns out to be a very useful pattern for getting up and running with recursion schemes.

The trick is to create a different, parameterized type, in which the new parameter takes the place of all recursive points in the original type. We can create this kind of base structure for the natural numbers example as follows:

data NatF r =
    ZeroF
  | SuccF r
  deriving (Show, Functor)

This type must be a functor in this new parameter, so the type is often called a ‘pattern functor’ for some other type. I like to use the notation ‘F’ when defining constructors for pattern functors.

We can define pattern functors for lists and trees in the same way:

data ListF a r =
    NilF
  | ConsF a r
  deriving (Show, Functor)

data TreeF a r =
    EmptyF
  | LeafF a
  | NodeF r r
  deriving (Show, Functor)

Now, to add recursion to these pattern functors we’re going to use the famous fixed-point type, ‘Fix’, to wrap them in:

type Nat    = Fix NatF
type List a = Fix (ListF a)
type Tree a = Fix (TreeF a)

‘Fix’ is a standard fixed-point type imported from the recursion-schemes library. You can get a ton of mileage from it. It introduces the ‘Fix’ constructor everywhere, but that’s actually not much of an issue in practice. One thing I typically like to do is add some smart constructors to get around it:

zero :: Nat
zero = Fix ZeroF

succ :: Nat -> Nat
succ = Fix . SuccF

nil :: List a
nil = Fix NilF

cons :: a -> List a -> List a
cons x xs = Fix (ConsF x xs)

Then you can write expressions like ‘succ (succ (succ zero))’ without having to deal with the ‘Fix’ constructor explicitly. Note also that these expressions are Showable à la carte, for example in GHCi:

> succ (succ (succ zero))
Fix (SuccF (Fix (SuccF (Fix (SuccF (Fix ZeroF))))))

A Short Digression on ‘Fix’

The ‘Fix’ type is brought into scope from ‘Data.Functor.Foldable’, but it’s worth looking at it in some detail. It can be defined as follows, along with two helpful functions for working with it:

newtype Fix f = Fix (f (Fix f))

fix :: f (Fix f) -> Fix f
fix = Fix

unfix :: Fix f -> f (Fix f)
unfix (Fix f) = f

‘Fix’ has a simple recursive structure. For a given value, you can think of ‘fix’ as adding one level of recursion to it. ‘unfix’ in turn removes one level of recursion.

This generic recursive structure is what makes ‘Fix’ so useful: we can write some nominally recursive type we’re interested in without actually using recursion, but then package it up in ‘Fix’ to hijack the recursion it provides automatically.

Understanding Some Internal Plumbing

If we wrap a pattern functor in ‘Fix’ then the underlying machinery of recursion-schemes should ‘just work’. Here it’s worth explaining a little as to why that’s the case.

There are two fundamental type classes involved in recursion-schemes: ‘Foldable’ and ‘Unfoldable’. These serve to tease apart the recursive structure of something like ‘Fix’ even more: loosely, ‘Foldable’ corresponds to types that can be ‘unfixed’, and ‘Unfoldable’ corresponds to types that can be ‘fixed’. That is, we can add more layers of recursion to instances of ‘Unfoldable’, and we can peel off layers of recursion from instances of ‘Foldable’.

In particular ‘Foldable’ and ‘Unfoldable’ contain functions called ‘project’ and ‘embed’ respectively, corresponding to more general forms of ‘unfix’ and ‘fix’. Their types are as follows:

project :: Foldable t   => t -> Base t t
embed   :: Unfoldable t => Base t t -> t

I’ve found it useful while using recursion-schemes to have a decent understanding of how to interpret the type family ‘Base’. It appears frequently in type signatures of various recursion schemes and being able to reason about it can help a lot.

‘Base’ and Basic Type Families

Type families are type-level functions; they take types as input and return types as output. The ‘Base’ definition in recursion-schemes looks like this:

type family Base t :: * -> *

You can interpret this as a function that takes one type ‘t’ as input and returns some other type. An implementation of this function is called an instance of the family. The instance for ‘Fix’, for example, looks like:

type instance Base (Fix f) = f

In particular, a type family like ‘Base’ is a synonym for instances of the family. So using the above example: anywhere you see something like ‘Base (Fix f)’ you can mentally replace it with ‘f’.

Instances of the ‘Base’ type family have a structure like ‘Fix’, but using ‘Base’ enables all the internal machinery of recursion-schemes to work out-of-the-box for types other than ‘Fix’ alone. This has a typical Kmettian flavour: first solve the most general problem, and then recover useful, specific solutions to it automatically.

Predictably, ‘Fix f’ is an instance of ‘Base’, ‘Foldable’, and ‘Unfoldable’ for some functor ‘f’, so if you use it, you can freely use all of recursion-schemes’s innards without needing to manually write any instances for your own data types. But as mentioned above, it’s worth noting that you can exploit the various typeclass & type family machinery to get by without using ‘Fix’ at all: see e.g. Danny Gratzer’s recursion-schemes post for an example of this.

Some Useful Schemes

So, with some discussion of the internals out of the way, we can look at some of the more common and useful recursion schemes. I’ll concentrate on the following four, as they’re the ones I’ve found the most use for:

  • catamorphisms, implemented via ‘cata’, are generalized folds.
  • anamorphisms, implemented via ‘ana’, are generalized unfolds.
  • hylomorphisms, implemented via ‘hylo’, are anamorphisms followed by catamorphisms (corecursive production followed by recursive consumption).
  • paramorphisms, implemented via ‘para’, are generalized folds with access to the input argument corresponding to the most recent state of the computation.

Let me digress slightly on nomenclature.

Yes, the names of these things are celebrations of the ridiculous. There’s no getting around it; they look like self-parody to almost anyone not pre-acquainted with categorical concepts. They have been accused — probably correctly — of being off-putting.

That said, they communicate important technical details and are actually not so bad when you get used to them. It’s perfectly fine and even encouraged to arm-wave about folds or unfolds when speaking informally, but the moment someone distinguishes one particular style of fold from another via a prefix like e.g. para, I know exactly the relevant technical distinctions required to understand the discussion. The names might be silly, but they have their place.

Anyway.

There are many other more exotic schemes that I’m sure are quite useful (see Tim Williams’s recursion schemes talk, for example), but I haven’t made use of any outside of these four just yet. The recursion-schemes library contains a plethora of unfamiliar schemes just waiting to be grokked, but in the interim even cata and ana alone will get you plenty far.

Now let’s use the motley crew of schemes to do some useful computation on our example data types.

Catamorphisms

Take our natural numbers type, ‘Nat’. To start, we can use a catamorphism to represent a ‘Nat’ as an ‘Int’ by summing it up.

natsum :: Nat -> Int
natsum = cata alg where
  alg ZeroF     = 0
  alg (SuccF n) = n + 1

Here ‘alg’ refers to ‘algebra’, which is the function that we use to define our reducing semantics. Notice that the semantics are not defined recursively! The recursion present in ‘Nat’ has been decoupled and is handled for us by ‘cata’. And as a plus, we still don’t have to deal with the ‘Fix’ constructor anywhere.

As a brief aside: I like to write my recursion schemes in this way, but your mileage may vary. If you’d like to enable the ‘LambdaCase’ extension, then another option is to elide mentioning the algebra altogether using a very simple case statement:

{-# LANGUAGE LambdaCase #-}

natsum :: Nat -> Int
natsum = cata $ \case ->
  ZeroF   -> 0
  SuccF n -> n + 1

Some people find this more readable.

To understand how we used ‘cata’ to build this function, take a look at its type:

cata :: Foldable t => (Base t a -> a) -> t -> a

The ‘Base t a -> a’ term is the algebra; ‘t’ is our recursive datatype (i.e. ‘Nat’), and ‘a’ is whatever type we’re reducing a value to.

Historically I’ve found ‘Base’ here to be confusing, but here’s a neat trick to help reason through it.

Remember that ‘Base’ is a type family, so for some appropriate ‘t’ and ‘a’, ‘Base t a’ is going to be a synonym for some other type. To figure out what ‘Base t a’ corresponds to for some concrete ‘t’ and ‘a’, we can ask GHCi via this lesser-known command that evaluates type families:

> :kind! Base Nat Int
Base Nat Int :: *
= NatF Int

So in the ‘natsum’ example the algebra used with ‘cata’ must have type ‘NatF Int -> Int’. This is pretty obvious for ‘cata’, but I initially found that figuring out what type should be replaced for ‘Base’ exactly could be confusing for some of the more exotic schemes.

As another example, we can use a catamorphism to implement ‘filter’ for our list type:

filterL :: (a -> Bool) -> List a -> List a
filterL p = cata alg where
  alg NilF = nil
  alg (ConsF x xs)
    | p x       = cons x xs
    | otherwise = xs

It follows the same simple pattern: we define our semantics by interpreting recursion-less constructors through an algebra, then pump it through ‘cata’.

Anamorphisms

These running examples are toys, but even here it’s really annoying to have to type ‘succ (succ (succ (succ (succ (succ zero)))))’ to get a natural number corresponding to six for debugging or what have you.

We can use an anamorphism to build a ‘Nat’ value from an ‘Int’:

nat :: Int -> Nat
nat = ana coalg where
  coalg n
    | n <= 0    = ZeroF
    | otherwise = SuccF (n - 1)

Just as a small detail: to be descriptive, here I’ve used ‘coalg’ as the argument to ‘ana’, for ‘coalgebra’.

Now the expression ‘nat 6’ will do the same for us as the more verbose example above. As always, recursion is not part of the semantics; to have the integer ‘n’ we pass in correspond to the correct natural number, we use the successor value of ‘n — 1’.

Paramorphisms

As an example, try to express a factorial on a natural number in terms of ‘cata’. It’s (apparently) doable, but an implementation is not immediately clear.

A paramorphism will operate on an algebra that provides access to the input argument corresponding to the running state of the recursion. Check out the type of ‘para’ below:

para :: Foldable t => (Base t (t, a) -> a) -> t -> a

If we’re implementing a factorial on ‘Nat’ values then ‘t’ is going to correspond to ‘Nat’ and ‘a’ is going to correspond to (say) ‘Integer’. Here it might help to use the ‘:kind!’ trick to help reason through the requirements of the algebra. We can ask GHCi to help us out:

> :kind! Base Nat (Nat, Int)
Base Nat (Nat, Int) :: *
= NatF (Nat, Int)

Side note: after doing this trick a few times you’ll probably find it much easier to reason about type families sans-GHCi. In any case, we can now implement an algebra corresponding to the required type:

natfac :: Nat -> Int
natfac = para alg where
 alg ZeroF = 1
 alg (SuccF (n, f)) = natsum (succ n) * f

Here there are some details to point out.

The type of our algebra is ‘NatF (Nat, Int) -> Int’; the value with the ‘Nat’ type, ‘n’, holds the most recent input argument used to compute the state of the computation, ‘f’.

If you picture a factorial defined as

0!       = 1
(k + 1)! = (k + 1) * k!

Then ‘n’ corresponds to ‘k’ and ‘f’ corresponds to ‘k!’. To compute the factorial of the successor to ‘n’, we just convert ‘succ n’ to an integer (via ‘natsum’) and multiply it by ‘f’.

Paramorphisms tend to be pretty useful for a lot of mundane tasks. We can easily implement ‘pred’ on natural numbers via ‘para’:

natpred :: Nat -> Nat
natpred = para alg where
  alg ZeroF          = zero
  alg (SuccF (n, _)) = n

We can also implement ‘tail’ on lists. To check the type of the required algebra we can again get some help from GHCi; here I’ll evaluate a general type family, for illustration:

> :set -XRankNTypes
> :kind! forall a b. Base (List a) (List a, b)
forall a b. Base (List a) (List a, b) :: *
= forall a b. ListF a (Fix (ListF a), b)

Providing an algebra of the correct structure lets ‘tailL’ fall out as follows:

tailL :: List a -> List a
tailL = para alg where
  alg NilF             = nil
  alg (ConsF _ (l, _)) = l

You can check that ‘tailL’ indeed returns the tail of its argument.

Hylomorphisms

Hylomorphisms can express general computation — corecursive production followed by recursive consumption. Compared to the other type signatures in recursion-schemes, ‘hylo’ is quite simple:

hylo :: Functor f => (f b -> b) -> (a -> f a) -> a -> b

It doesn’t even require the full structure built up for e.g. ‘cata’ and ‘ana’; just very simple F-{co}algebras.

My favourite example hylomorphism is an absolutely beautiful implementation of mergesort. I think it helps illustrate how recursion schemes can help tease out incredibly simple structure in what could otherwise be a more involved problem.

Our input will be a Haskell list containing some orderable type. We’ll use it to build a balanced binary tree via an anamorphism and then tear it down with a catamorphism, merging lists together and sorting them as we go.

The resulting code looks like this:

mergeSort :: Ord a => [a] -> [a]
mergeSort = hylo alg coalg where
  alg EmptyF      = []
  alg (LeafF c)   = [c]
  alg (NodeF l r) = merge l r

  coalg []  = EmptyF
  coalg [x] = LeafF x
  coalg xs  = NodeF l r where
    (l, r) = splitAt (length xs `div` 2) xs

What’s more, the fusion achieved via this technique is really quite lovely.

Wrapping Up

Hopefully this article helps fuel any ‘programming via structured recursion’ trend that might be ever-so-slowly growing.

When programming in a language like Haskell, a very natural pattern is to write little embedded languages and mini-interpreters or compilers to accomplish tasks. Typically these tiny embedded languages have a recursive structure, and when you’re interpreting a recursive structure, you have use all these lovely off-the-shelf strategies for recursion available to you to keep your programs concise, modular, and efficient. The recursion-schemes library really has all the built-in machinery you need to start using these things for real.

If you’re interested about hearing about using recursion schemes ‘for real’ I recommend Tim Williams’s Exotic Tools For Exotic Trades talk (for a motivating example for the use of recursion schemes in production) or his talk on recursion schemes from the London Haskell User’s Group a few years ago.

So happy recursing! I’ve dropped the code from this article into a gist.

Thanks to Maciej Woś for review and helpful comments.