Markov Chains à la Carte

14 Oct 2015

I’ve released a number of libraries for doing Markov Chain Monte Carlo (MCMC) in Haskell.

You can get at them via a ‘frontend’ library, declarative, but each can also be used fruitfully on its own. À la carte, if you will.

Some background: MCMC is a family of stateful algorithms for sampling from a large class of probability distributions. Typically one is interested in doing this to approximate difficult integrals; instead of choosing some suitable grid of points in parameter space over which to approximate an integral, just offload the problem to probability theory and use a Markov chain to find them for you.

For an excellent introduction to MCMC you won’t find better than Iain Murray’s lectures from MLSS ’09 in Cambridge, so check those out if you’re interested in more details.

I’ve put together a handful of popular MCMC algorithms as well as an easy way to glue them together in a couple of useful ways. At present these implementations are useful in cases where you can write your target function in closed form, and that’s pretty much all that’s required (aside from the standard algorithm-specific tuning parameters).

The API should be pretty easy to work with — write your target as a function of its parameters, specify a start location, and away you go. It’s also cool if your target accepts its parameters via most common traversable functors — lists, vectors, sequences, maps, etc.

That’s sort of the goal of this first release: if you can give me a target function, I’ll do my best to give you samples from it. Less is more and all that.

What‘s In The Box

There are a number of libraries involved. I have a few more in the queue and there are a number of additional features I plan to support for these ones in particular, but without further ado:

mwc-probability, a sampling-function based probability monad implemented as a thin wrapper over the excellent mwc-random library.
mcmc-types, housing a number of types used by the the whole family.
mighty-metropolis, an implementation of the famous Metropolis algorithm.
speedy-slice, a slice sampling implementation suitable for both continuous & discrete parameter spaces.
hasty-hamiltonian, an implementation of the gradient-based Hamiltonian Monte Carlo algorithm.
declarative, the one ring to rule them all.

Pull down declarative if you just want to have access to all of them. If you’re a Haskell neophyte you can find installation instructions at the Github repo.

Motivation

MCMC is fundamentally about observing Markov chains over probability spaces. In this context a chain is a stochastic process that wanders around a state space, eventually visiting regions of the space in proportion to their probability.

Markov chains are constructed by transition operators that obey the Markov property: that the probability of transitioning to the next location — conditional on the history of the chain — depends only on the current location. For MCMC we’re also interested in operators that satisfy the reversibility property — that the probability a transition from state A to state B occurs is the same as that a transition from state B to state A occurs. A chain is characterized by a transition operator T that drives it from state to state, and for MCMC we want the stationary or limiting distribution of the chain to be the distribution we’re sampling from.

One of the major cottage industries in Bayesian research is inventing new transition operators to drive the Markov chains used in MCMC. This has been fruitful, but it could likely be aided by a practical way to make existing transition operators work together.

This is easy to do in theory: there are a couple of ways to combine transition operators such that the resulting composite operator preserves all the properties we’re interested in for MCMC — the stationary distribution, reversibility, and Markov property. See Geyer, 2005 for details here, but the crux is that we can establish the following simple grammar for transition operators:

transition ::= primitive <transition>
             | concat transition transition
             | sample transition transition

A transition is either some primitive operator, a deterministic concatenation of operators (via ‘concat’), or a probabilistic concatenation of operators (via ‘sample’). A deterministic concatenation works by just transitioning through two operators one after the other; a probabilistic concatenation works by randomly choosing one transition operator or the other to use on any given transition. These kinds of concatenation preserve all the properties we’re interested in.

We can trivially generalize this further by adding a term that concatenates n transition operators together deterministically, or another for probabilistically concatenating a bunch of operators according to some desired probability distribution.

The idea here is that there are tradeoffs involved in different transition operators. Some may be more computationally expensive than others (perhaps requiring a gradient evaluation, or evaluation of some inner loop) but have better ability to make ‘good’ transitions in certain situations. Other operators are cheap, but can be inefficient (taking a long time to visit certain regions of the space).

By employing deterministic or probabilistic concatenation, one can concoct a Markov chain that uses a varied range of tuning parameters, for example. Or only occasionally employs a computationally expensive transition, otherwise preferring some cheaper, reliable operator.

Usage

The declarative library implements this simple language for transition operators, and the mighty-metropolis, speedy-slice, and hasty-hamiltonian libraries provide some primitive transitions that you can combine as needed.

The Metropolis and slice sampling transitions are cheap and require little information, whereas Hamiltonian Monte Carlo exploits information about the target’s gradient and also involves evaluation of an inner loop (the length of which is determined by a tuning parameter). Feel free to use one that suits your problem, or combine them together using the combinators supplied in declarative to build a custom solution.

As an example, the Rosenbrock density is a great test dummy as it’s simple, low-dimensional, and can be easily visualized, but it still exhibits a pathological anisotropic structure that makes it somewhat tricky to sample from.

Getting started via declarative is pretty simple:

import Numeric.MCMC

You’ll want to supply a target to sample over, and if you want to use an algorithm like Hamiltonian Monte Carlo you’ll also need to provide a gradient. If you can’t be bothered to calculate gradients by hand, you can always turn to your friend automatic differentiation:

import Numeric.AD

The Rosenbrock log-density and its gradient can then be written as follows:

target :: Num a => [a] -> a
target [x0, x1] = negate (100 * (x1 — x0 ^ 2) ^ 2 + (1 — x0) ^ 2)

gTarget :: Num a => [a] -> [a]
gTarget = grad target

All you need to do here is provide a function proportional to a log-probability density. The logarithmic scale is important; various internals expect to be passed (something proportional to) a log-probability density.

To package these guys up together we can wrap them in a Target. Note that we don’t always care about including a gradient, so that part is optional:

rosenbrock :: Target [Double]
rosenbrock = Target target (Just gTarget)

The Target type is parameterized over the shape of the parameter space. You could similarly have a Target (Seq Double), Target (Map String Double), and so on. Your target may be implemented using a boxed vector for efficiency, for example. Or using a Map or HashMap with string/text keys such that parameter names are preserved. They should work just fine.

Given a target, we can sample from it a bunch of times using a simple Metropolis transition via the mcmc function. Aside from the target and a PRNG, provide it with the desired number of transitions, a starting point, and the transition operator to use:

> -- haskell
> prng <- create
> mcmc 10000 [0, 0] (metropolis 1) rosenbrock prng

In return you’ll get the desired trace of the chain dumped to stdout:

8.136972300105949e-2,0.273896953404261
0.4657348148676972,0.17462596647788464
-0.48609414127836326,9.465052854751566e-2
-0.49781488399832785,0.42092910345708523
-0.3019713424699155,0.39135350029173566
0.12058426470979189,0.12485407390388925
..

The intent is for the chain to be processed elsewhere — if you’re me, that will usually be in R. Libraries like coda have a ton of functionality useful for working with Markov chain traces, and ggplot2 as a library for static statistical graphics can’t really be beat:

> # r
> d = read.csv(‘rosenbrock-trace.dat’, header = F)
> names(d) = c(‘x’, ‘y’)
> require(ggplot2)
> ggplot(d, aes(x, y)) + geom_point(colour = ‘darkblue’, alpha = 0.2)

You get the following trace over the Rosenbrock density, taken for 10k iterations. This is using a Metropolis transition with variance 1:

metropolis

If you do want to work with chains in memory in Haskell you can do that by writing your own handling code around the supplied transition operators. I’ll probably make this a little easier in later versions.

The implementations are reasonably quick and don’t leak memory — the traces are streamed to stdout as the chains are traversed. Compiling the above with ‘-O2’ and running it for 100k iterations yields the following performance characteristics on my mid-2011 model MacBook Air:

$ ./test/Rosenbrock +RTS -s > /dev/null

3,837,201,632 bytes allocated in the heap
    8,453,696 bytes copied during GC
       89,600 bytes maximum residency (2 sample(s))
       23,288 bytes maximum slop
         1 MB total memory in use (0 MB lost due to fragmentation)

 INIT time 0.000s ( 0.000s elapsed)
  MUT time 3.539s ( 3.598s elapsed)
   GC time 0.049s ( 0.058s elapsed)
 EXIT time 0.000s ( 0.000s elapsed)
Total time 3.591s ( 3.656s elapsed)

%GC time 1.4% (1.6% elapsed)

Alloc rate 1,084,200,280 bytes per MUT second

Productivity 98.6% of total user, 96.8% of total elapsed

The beauty is that rather than running a chain solely on something like the simple Metropolis operator used above, you can sort of ‘hedge your sampling risk’ and use a composite operator that proposes transitions using a multitude of ways. Consider this guy, for example:

transition =
  concatT
    (sampleT (metropolis 0.5) (metropolis 1.0))
    (sampleT (slice 2.0) (slice 3.0))

Here concatT and sampleT correspond to the concat and sample terms in the BNF description in the previous section. This operator performs two transitions back-to-back; the first is randomly a Metropolis transition with standard deviation 0.5 or 1 respectively, and the second is a slice sampling transition using a step size of 2 or 3, randomly.

Running it for 5000 iterations (to keep the total computation approximately constant), we see a chain that has traversed the space a little better:

> mcmc 5000 [0, 0] transition rosenbrock prng

composite

It’s worth noting that I didn’t put any work into coming up with this composite transition: this was just the first example I thought up, and a lot of the benefits here probably come primarily from including the eminently-reliable slice sampling transition. But from informal experimentation, it does seem that chains driven by composite transitions involving numerous operators and tuning parameter settings often seem to perform better on average than a given chain driven by a single (poorly-selected) transition.

I know exactly how meticulous proofs and benchmarks must be so I haven’t rigorously established any properties around this, but hey: it ‘seems to be the case’, and intuitively, including varied transition operators surely hedges your bets when compared to using a single one.

Try it out and see how your mileage varies, and be sure to let me know if you find some killer apps where composite transitions really seem to win.

Implementation Notes

If you’re just interested in using the libraries you can skip the following section, but I just want to point out how easy this is to implement.

The implementations are defined using a small set of types living in mcmc-types:

type Transition m a = StateT a (Prob m) ()

data Chain a b = Chain {
    chainTarget   :: Target a
  , chainScore    :: Double
  , chainPosition :: a
  , chainTunables :: Maybe b
  }

data Target a = Target {
    lTarget  :: a -> Double
  , glTarget :: Maybe (a -> a)
  }

Most important here is the Transition type, which is just a state transformer over a probability monad (itself defined in mwc-probability). The probability monad is the source of randomness used to define transition operators useful for MCMC, and values with type Transition are the transition operators in question.

The Chain type is the state of the Markov chain at any given iteration. All that’s really required here is the chainPosition field, which represents the location of the chain in parameter space. But adding some additional information here is convenient; chainScore caches the most recent score of the chain (which is typically used in internal calculations, and caching avoids recomputing things needlessly) and chainTunables is an optional record intended to be used for stateful tuning parameters (used by adaptive algorithms or in burn-in phases and the like). Additionally the target being sampled from itself — chainTarget — is included in the state.

Undisciplined use of chainTarget and chainTunables can have all sorts of nasty consequences — you can use them to change the stationary distribution you’re sampling from or invalidate the Markov property — but keeping them around is useful for implementing some desirable features. Tweaking chainTarget, for example, allows one to easily implement annealing, which can be very useful for sampling from annoying multi-modal densities.

Setting everything up like this makes it trivial to mix-and-match transition operators as required — the state and probability monad stack provides everything we need. Deterministic concatenation is implemented as follows, for example:

concatT = (>>)

and a generalized version of probabilistic concatenation just requires a coin flip:

bernoulliT p t0 t1 = do
  heads <- lift (MWC.bernoulli p)
  if heads then t0 else t1

A uniform probabilistic concatenation over two operators, implemented in sampleT, is then just bernoulliT 0.5.

The difficulty of implementing primitive operators just depends on the operator itself; the surrounding framework is extremely lightweight. Here’s the Metropolis transition, for example (with type signatures omitted to keep the noise down):

metropolis radial = do
  Chain {..} <- get
  proposal   <- lift (propose radial chainPosition)
  let proposalScore = lTarget chainTarget proposal
      acceptProb    = whenNaN 0
        (exp (min 0 (proposalScore - chainScore)))

  accept <- lift (MWC.bernoulli acceptProb)
  when accept
    (put (Chain chainTarget proposalScore proposal chainTunables))

propose radial = traverse perturb where
  perturb m = MWC.normal m radial

And the excellent pipes library is used to generate a Markov chain:

chain radial = loop where
  loop state prng = do
    next <- lift
      (MWC.sample (execStateT (metropolis radial) state) prng)
    yield next
    loop next prng

The mcmc functions are also implemented using pipes. Take the first n iterations of a chain and print them to stdout. That simple.

Future Work

In the near term I plan on updating some old MCMC implementations I have kicking around on Github (flat-mcmc, lazy-langevin, hnuts) and releasing them within this framework. Additionally I’ve got some code for building annealed operators that I want to release — it has been useful in some situations when sampling from things like the Himmelblau density, which has a few disparate clumps of probability that make it tricky to sample from with conventional algorithms.

This framework is also useful as an inference backend to languages for working with directed graphical models (think BUGS/Stan). The idea here is that you don’t need to specify your target function (typically a posterior density) explicitly: just describe your model and I’ll give you samples from the posterior distribution. A similar version has been put to use around the BayesHive project.

Longer term — I’ll have to see what’s up in terms of demand. There are performance improvements and straightforward extensions to things like parallel tempering, but I’m growing more interested in ‘online’ methods like particle MCMC and friends that are proving useful for inference in more general probabilistic programs (think those expressible by Church and its ilk).

Let me know if you get any use out of these things, or please file an issue if there’s some particular feature you’d like to see supported.

Thanks to Niffe Hermansson for review and helpful comments.

Practical Recursion Schemes

06 Sep 2015

Recursion schemes are elegant and useful patterns for expressing general computation. In particular, they allow you to ‘factor recursion out’ of whatever semantics you may be trying to express when interpreting programs, keeping your interpreters concise, your concerns separated, and your code more maintainable.

What’s more, formulating programs in terms of recursion schemes seems to help suss out particular similarities in structure between what might be seen as disparate problems in other domains. So aside from being a practical computational tool, they seem to be of some use when it comes to ‘hacking understanding’ in varied areas.

Unfortunately, they come with a pretty forbidding barrier to entry. While there are a few nice resources out there for learning about recursion schemes and how they work, most literature around them is quite academic and awash in some astoundingly technical jargon (more on this later). Fortunately, the accessible resources out there do a great job of explaining what recursion schemes are and how you might use them, so they go through some effort to build up the required machinery from scratch.

In this article I want to avoid building up the machinery meticulously and instead concentrate mostly on understanding and using Edward Kmett’s recursion-schemes library, which, while lacking in documentation, is very well put together and implements all the background plumbing one needs to get started.

In particular, to feel comfortable using recursion-schemes I found that there were a few key patterns worth understanding:

Factoring recursion out of your data types using pattern functors and a fixed-point wrapper.
Using the ‘Foldable’ & ‘Unfoldable’ classes, plus navigating the ‘Base’ type family.
How to use some of the more common recursion schemes out there for everyday tasks.

The Basics

If you’re following along in GHCi, I’m going to first bring in some imports and add a useful pragma. I’ll dump a gist at the bottom; note that this article targets GHC 7.10.2 and recursion-schemes-4.1.2, plus I’ll also require data-ordlist-0.4.7.0 for an example later. Here’s the requisite boilerplate:

{-# LANGUAGE DeriveFunctor #-}

import Data.Functor.Foldable
import Data.List.Ordered (merge)
import Prelude hiding (Foldable, succ)

So, let’s get started.

Recursion schemes are applicable to data types that have a suitable recursive structure. Lists, trees, and natural numbers are illustrative candidates.

Being so dead-simple, let’s take the natural numbers as an illustrative/toy example. We can define them recursively as follows:

data Natural =
    Zero
  | Succ Natural

This is a fine definition, but many such recursive structures can also be defined in a different way: we can first ‘factor out’ the recursion by defining some base structure, and then ‘add it back in’ by using a recursive wrapper type.

The price of this abstraction is a slightly more involved type definition, but it unlocks some nice benefits — namely, the ability to reason about recursion and base structures separate from each other. This turns out to be a very useful pattern for getting up and running with recursion schemes.

The trick is to create a different, parameterized type, in which the new parameter takes the place of all recursive points in the original type. We can create this kind of base structure for the natural numbers example as follows:

data NatF r =
    ZeroF
  | SuccF r
  deriving (Show, Functor)

This type must be a functor in this new parameter, so the type is often called a ‘pattern functor’ for some other type. I like to use the notation ‘F’ when defining constructors for pattern functors.

We can define pattern functors for lists and trees in the same way:

data ListF a r =
    NilF
  | ConsF a r
  deriving (Show, Functor)

data TreeF a r =
    EmptyF
  | LeafF a
  | NodeF r r
  deriving (Show, Functor)

Now, to add recursion to these pattern functors we’re going to use the famous fixed-point type, ‘Fix’, to wrap them in:

type Nat    = Fix NatF
type List a = Fix (ListF a)
type Tree a = Fix (TreeF a)

‘Fix’ is a standard fixed-point type imported from the recursion-schemes library. You can get a ton of mileage from it. It introduces the ‘Fix’ constructor everywhere, but that’s actually not much of an issue in practice. One thing I typically like to do is add some smart constructors to get around it:

zero :: Nat
zero = Fix ZeroF

succ :: Nat -> Nat
succ = Fix . SuccF

nil :: List a
nil = Fix NilF

cons :: a -> List a -> List a
cons x xs = Fix (ConsF x xs)

Then you can write expressions like ‘succ (succ (succ zero))’ without having to deal with the ‘Fix’ constructor explicitly. Note also that these expressions are Showable à la carte, for example in GHCi:

> succ (succ (succ zero))
Fix (SuccF (Fix (SuccF (Fix (SuccF (Fix ZeroF))))))

A Short Digression on ‘Fix’

The ‘Fix’ type is brought into scope from ‘Data.Functor.Foldable’, but it’s worth looking at it in some detail. It can be defined as follows, along with two helpful functions for working with it:

newtype Fix f = Fix (f (Fix f))

fix :: f (Fix f) -> Fix f
fix = Fix

unfix :: Fix f -> f (Fix f)
unfix (Fix f) = f

‘Fix’ has a simple recursive structure. For a given value, you can think of ‘fix’ as adding one level of recursion to it. ‘unfix’ in turn removes one level of recursion.

This generic recursive structure is what makes ‘Fix’ so useful: we can write some nominally recursive type we’re interested in without actually using recursion, but then package it up in ‘Fix’ to hijack the recursion it provides automatically.

Understanding Some Internal Plumbing

If we wrap a pattern functor in ‘Fix’ then the underlying machinery of recursion-schemes should ‘just work’. Here it’s worth explaining a little as to why that’s the case.

There are two fundamental type classes involved in recursion-schemes: ‘Foldable’ and ‘Unfoldable’. These serve to tease apart the recursive structure of something like ‘Fix’ even more: loosely, ‘Foldable’ corresponds to types that can be ‘unfixed’, and ‘Unfoldable’ corresponds to types that can be ‘fixed’. That is, we can add more layers of recursion to instances of ‘Unfoldable’, and we can peel off layers of recursion from instances of ‘Foldable’.

In particular ‘Foldable’ and ‘Unfoldable’ contain functions called ‘project’ and ‘embed’ respectively, corresponding to more general forms of ‘unfix’ and ‘fix’. Their types are as follows:

project :: Foldable t   => t -> Base t t
embed   :: Unfoldable t => Base t t -> t

I’ve found it useful while using recursion-schemes to have a decent understanding of how to interpret the type family ‘Base’. It appears frequently in type signatures of various recursion schemes and being able to reason about it can help a lot.

‘Base’ and Basic Type Families

Type families are type-level functions; they take types as input and return types as output. The ‘Base’ definition in recursion-schemes looks like this:

type family Base t :: * -> *

You can interpret this as a function that takes one type ‘t’ as input and returns some other type. An implementation of this function is called an instance of the family. The instance for ‘Fix’, for example, looks like:

type instance Base (Fix f) = f

In particular, a type family like ‘Base’ is a synonym for instances of the family. So using the above example: anywhere you see something like ‘Base (Fix f)’ you can mentally replace it with ‘f’.

Instances of the ‘Base’ type family have a structure like ‘Fix’, but using ‘Base’ enables all the internal machinery of recursion-schemes to work out-of-the-box for types other than ‘Fix’ alone. This has a typical Kmettian flavour: first solve the most general problem, and then recover useful, specific solutions to it automatically.

Predictably, ‘Fix f’ is an instance of ‘Base’, ‘Foldable’, and ‘Unfoldable’ for some functor ‘f’, so if you use it, you can freely use all of recursion-schemes’s innards without needing to manually write any instances for your own data types. But as mentioned above, it’s worth noting that you can exploit the various typeclass & type family machinery to get by without using ‘Fix’ at all: see e.g. Danny Gratzer’s recursion-schemes post for an example of this.

Some Useful Schemes

So, with some discussion of the internals out of the way, we can look at some of the more common and useful recursion schemes. I’ll concentrate on the following four, as they’re the ones I’ve found the most use for:

catamorphisms, implemented via ‘cata’, are generalized folds.
anamorphisms, implemented via ‘ana’, are generalized unfolds.
hylomorphisms, implemented via ‘hylo’, are anamorphisms followed by catamorphisms (corecursive production followed by recursive consumption).
paramorphisms, implemented via ‘para’, are generalized folds with access to the input argument corresponding to the most recent state of the computation.

Let me digress slightly on nomenclature.

Yes, the names of these things are celebrations of the ridiculous. There’s no getting around it; they look like self-parody to almost anyone not pre-acquainted with categorical concepts. They have been accused — probably correctly — of being off-putting.

That said, they communicate important technical details and are actually not so bad when you get used to them. It’s perfectly fine and even encouraged to arm-wave about folds or unfolds when speaking informally, but the moment someone distinguishes one particular style of fold from another via a prefix like e.g. para, I know exactly the relevant technical distinctions required to understand the discussion. The names might be silly, but they have their place.

Anyway.

There are many other more exotic schemes that I’m sure are quite useful (see Tim Williams’s recursion schemes talk, for example), but I haven’t made use of any outside of these four just yet. The recursion-schemes library contains a plethora of unfamiliar schemes just waiting to be grokked, but in the interim even cata and ana alone will get you plenty far.

Now let’s use the motley crew of schemes to do some useful computation on our example data types.

Catamorphisms

Take our natural numbers type, ‘Nat’. To start, we can use a catamorphism to represent a ‘Nat’ as an ‘Int’ by summing it up.

natsum :: Nat -> Int
natsum = cata alg where
  alg ZeroF     = 0
  alg (SuccF n) = n + 1

Here ‘alg’ refers to ‘algebra’, which is the function that we use to define our reducing semantics. Notice that the semantics are not defined recursively! The recursion present in ‘Nat’ has been decoupled and is handled for us by ‘cata’. And as a plus, we still don’t have to deal with the ‘Fix’ constructor anywhere.

As a brief aside: I like to write my recursion schemes in this way, but your mileage may vary. If you’d like to enable the ‘LambdaCase’ extension, then another option is to elide mentioning the algebra altogether using a very simple case statement:

{-# LANGUAGE LambdaCase #-}

natsum :: Nat -> Int
natsum = cata $ \case ->
  ZeroF   -> 0
  SuccF n -> n + 1

Some people find this more readable.

To understand how we used ‘cata’ to build this function, take a look at its type:

cata :: Foldable t => (Base t a -> a) -> t -> a

The ‘Base t a -> a’ term is the algebra; ‘t’ is our recursive datatype (i.e. ‘Nat’), and ‘a’ is whatever type we’re reducing a value to.

Historically I’ve found ‘Base’ here to be confusing, but here’s a neat trick to help reason through it.

Remember that ‘Base’ is a type family, so for some appropriate ‘t’ and ‘a’, ‘Base t a’ is going to be a synonym for some other type. To figure out what ‘Base t a’ corresponds to for some concrete ‘t’ and ‘a’, we can ask GHCi via this lesser-known command that evaluates type families:

> :kind! Base Nat Int
Base Nat Int :: *
= NatF Int

So in the ‘natsum’ example the algebra used with ‘cata’ must have type ‘NatF Int -> Int’. This is pretty obvious for ‘cata’, but I initially found that figuring out what type should be replaced for ‘Base’ exactly could be confusing for some of the more exotic schemes.

As another example, we can use a catamorphism to implement ‘filter’ for our list type:

filterL :: (a -> Bool) -> List a -> List a
filterL p = cata alg where
  alg NilF = nil
  alg (ConsF x xs)
    | p x       = cons x xs
    | otherwise = xs

It follows the same simple pattern: we define our semantics by interpreting recursion-less constructors through an algebra, then pump it through ‘cata’.

Anamorphisms

These running examples are toys, but even here it’s really annoying to have to type ‘succ (succ (succ (succ (succ (succ zero)))))’ to get a natural number corresponding to six for debugging or what have you.

We can use an anamorphism to build a ‘Nat’ value from an ‘Int’:

nat :: Int -> Nat
nat = ana coalg where
  coalg n
    | n <= 0    = ZeroF
    | otherwise = SuccF (n - 1)

Just as a small detail: to be descriptive, here I’ve used ‘coalg’ as the argument to ‘ana’, for ‘coalgebra’.

Now the expression ‘nat 6’ will do the same for us as the more verbose example above. As always, recursion is not part of the semantics; to have the integer ‘n’ we pass in correspond to the correct natural number, we use the successor value of ‘n — 1’.

Paramorphisms

As an example, try to express a factorial on a natural number in terms of ‘cata’. It’s (apparently) doable, but an implementation is not immediately clear.

A paramorphism will operate on an algebra that provides access to the input argument corresponding to the running state of the recursion. Check out the type of ‘para’ below:

para :: Foldable t => (Base t (t, a) -> a) -> t -> a

If we’re implementing a factorial on ‘Nat’ values then ‘t’ is going to correspond to ‘Nat’ and ‘a’ is going to correspond to (say) ‘Integer’. Here it might help to use the ‘:kind!’ trick to help reason through the requirements of the algebra. We can ask GHCi to help us out:

> :kind! Base Nat (Nat, Int)
Base Nat (Nat, Int) :: *
= NatF (Nat, Int)

Side note: after doing this trick a few times you’ll probably find it much easier to reason about type families sans-GHCi. In any case, we can now implement an algebra corresponding to the required type:

natfac :: Nat -> Int
natfac = para alg where
 alg ZeroF = 1
 alg (SuccF (n, f)) = natsum (succ n) * f

Here there are some details to point out.

The type of our algebra is ‘NatF (Nat, Int) -> Int’; the value with the ‘Nat’ type, ‘n’, holds the most recent input argument used to compute the state of the computation, ‘f’.

If you picture a factorial defined as

0!       = 1
(k + 1)! = (k + 1) * k!

Then ‘n’ corresponds to ‘k’ and ‘f’ corresponds to ‘k!’. To compute the factorial of the successor to ‘n’, we just convert ‘succ n’ to an integer (via ‘natsum’) and multiply it by ‘f’.

Paramorphisms tend to be pretty useful for a lot of mundane tasks. We can easily implement ‘pred’ on natural numbers via ‘para’:

natpred :: Nat -> Nat
natpred = para alg where
  alg ZeroF          = zero
  alg (SuccF (n, _)) = n

We can also implement ‘tail’ on lists. To check the type of the required algebra we can again get some help from GHCi; here I’ll evaluate a general type family, for illustration:

> :set -XRankNTypes
> :kind! forall a b. Base (List a) (List a, b)
forall a b. Base (List a) (List a, b) :: *
= forall a b. ListF a (Fix (ListF a), b)

Providing an algebra of the correct structure lets ‘tailL’ fall out as follows:

tailL :: List a -> List a
tailL = para alg where
  alg NilF             = nil
  alg (ConsF _ (l, _)) = l

You can check that ‘tailL’ indeed returns the tail of its argument.

Hylomorphisms

Hylomorphisms can express general computation — corecursive production followed by recursive consumption. Compared to the other type signatures in recursion-schemes, ‘hylo’ is quite simple:

hylo :: Functor f => (f b -> b) -> (a -> f a) -> a -> b

It doesn’t even require the full structure built up for e.g. ‘cata’ and ‘ana’; just very simple F-{co}algebras.

My favourite example hylomorphism is an absolutely beautiful implementation of mergesort. I think it helps illustrate how recursion schemes can help tease out incredibly simple structure in what could otherwise be a more involved problem.

Our input will be a Haskell list containing some orderable type. We’ll use it to build a balanced binary tree via an anamorphism and then tear it down with a catamorphism, merging lists together and sorting them as we go.

The resulting code looks like this:

mergeSort :: Ord a => [a] -> [a]
mergeSort = hylo alg coalg where
  alg EmptyF      = []
  alg (LeafF c)   = [c]
  alg (NodeF l r) = merge l r

  coalg []  = EmptyF
  coalg [x] = LeafF x
  coalg xs  = NodeF l r where
    (l, r) = splitAt (length xs `div` 2) xs

What’s more, the fusion achieved via this technique is really quite lovely.

Wrapping Up

Hopefully this article helps fuel any ‘programming via structured recursion’ trend that might be ever-so-slowly growing.

When programming in a language like Haskell, a very natural pattern is to write little embedded languages and mini-interpreters or compilers to accomplish tasks. Typically these tiny embedded languages have a recursive structure, and when you’re interpreting a recursive structure, you have use all these lovely off-the-shelf strategies for recursion available to you to keep your programs concise, modular, and efficient. The recursion-schemes library really has all the built-in machinery you need to start using these things for real.

If you’re interested about hearing about using recursion schemes ‘for real’ I recommend Tim Williams’s Exotic Tools For Exotic Trades talk (for a motivating example for the use of recursion schemes in production) or his talk on recursion schemes from the London Haskell User’s Group a few years ago.

So happy recursing! I’ve dropped the code from this article into a gist.

Thanks to Maciej Woś for review and helpful comments.

Automasymbolic Differentiation

06 Jul 2014

Automatic differentiation is one of those things that’s famous for not being as famous as it should be (uh..). It’s useful, it’s convenient, and yet fewer know about it than one would think.

This article (by one of the guys working on Venture) is the single best introduction you’ll probably find to AD, anywhere. It gives a wonderful introduction to the basics, the subtleties, and the gotchas of the subject. You should read it. I’m going to assume you have.

In particular, you should note this part:

[computing gradients] can be done — the method is called reverse mode — but it introduces both code complexity and runtime cost in the form of managing this storage (traditionally called the “tape”). In particular, the space requirements of raw reverse mode are proportional to the runtime of f.

In some applications this can be somewhat inconvenient - picture iterative gradient-based sampling algorithms like Hamiltonian Monte Carlo, its famous auto-tuning version NUTS, and Riemannian manifold variants. Tom noted in this reddit thread that symbolic differentiation - which doesn’t need to deal with tapes and infinitesimal-tracking - can often be orders of magnitude faster than AD for this kind of problem. When running these algos we calculate gradients many, many times, and the additional runtime cost of the reverse-mode AD dance can add up.

An interesting question is whether or not this can this be mitigated at all, and to what degree. In particular: can we use automatic differentiation to implement efficient symbolic differentiation? Here’s a possible attack plan:

use an existing automatic differentiation implementation to calculate the gradient for some target function at a point
capture the symbolic expression for the gradient and optimize it by eliminating common subexpressions or whatnot
reify that expression as a function that we can evaluate for any input
voila, (more) efficient gradient

Ed Kmett’s ‘ad’ library is the best automatic differentiation library I know of, in terms of its balance between power and accessibility. It can be used on arbitrary Haskell types that are instances of the Num typeclass and can carry out automatic differentiation via a number of modes, so it’s very easy to get started with. Rather than rolling my own AD implementation to try to do this sort of thing, I’d much rather use his.

Start with a really basic language. In practice we’d be interested in working with more expressive things, but this is sort of the minimal interesting language capable of illustrating the problem:

import Numeric.AD

data Expr a =
    Lit a
  | Var String
  | Add (Expr a) (Expr a)
  | Sub (Expr a) (Expr a)
  | Mul (Expr a) (Expr a)
  deriving (Eq, Show)

instance Num a => Num (Expr a) where
  fromInteger = Lit . fromInteger
  e0 + e1 = Add e0 e1
  e0 - e1 = Sub e0 e1
  e0 * e1 = Mul e0 e1

We can already use Kmett’s ad library on these expressions to generate symbolic expressions. We just have to write functions we’re interested in generically (using +, -, and *) and then call diff or grad or whatnot on them with a concretely-typed argument. Some examples:

> :t diff (\x -> 2 * x ^ 2)
diff (\x -> 2 * x ^ 2) :: Num a => a -> a

> diff (\x -> 2 * x ^ 2) (Lit 1)
Mul (Add (Mul (Lit 1) (Lit 1)) (Mul (Lit 1) (Lit 1))) (Lit 2)

> grad (\[x, y] -> 2 * x ^ 2 + 3 * y) [Lit 1, Lit 2]
[ Add (Lit 0) (Add (Add (Lit 0) (Mul (Lit 1) (Add (Lit 0) (Mul (Lit 2) (Add
  (Lit 0) (Mul (Lit 1) (Lit 1))))))) (Mul (Lit 1) (Add (Lit 0) (Mul (Lit 2) (Add
  (Lit 0) (Mul (Lit 1) (Lit 1)))))))
, Add (Lit 0) (Add (Lit 0) (Mul (Lit 3) (Add (Lit 0) (Mul (Lit 1) (Lit 1)))))
]

It’s really easy to extract a proper derivative/gradient ‘function’ by doing something like this:

> diff (\x -> 2 * x ^ 2) (Var "x")
Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1) (Var "x"))) (Lit 2)

and then, given that expression, reifying a direct function for the derivative by substituting over the captured variable and evaluating everything. Here’s some initial machinery to handle that:

-- | Close an expression over some variable.
close :: Expr a -> String -> a -> Expr a
close (Add e0 e1) s x = Add (close e0 s x) (close e1 s x)
close (Sub e0 e1) s x = Sub (close e0 s x) (close e1 s x)
close (Mul e0 e1) s x = Mul (close e0 s x) (close e1 s x)

close (Var v) s x
  | v == s    = Lit x
  | otherwise = Var v

close e _ _ = e

-- | Evaluate a closed expression.
eval :: Num a => Expr a -> a
eval (Lit d) = d
eval (Var _) = error "expression not closed"
eval (Add e0 e1) = eval e0 + eval e1
eval (Sub e0 e1) = eval e0 - eval e1
eval (Mul e0 e1) = eval e0 * eval e1

So, using this on the example above yields

> let testExpr = diff (\x -> 2 * x ^ 2) (Var "x")
> eval . close testExpr "x" $ 1
2

and it looks like a basic toDerivative function for expressions could be implemented as follows:

-- | Do some roundabout AD.
toDerivative expr = eval . close diffExpr "x" where
  diffExpr = diff expr (Var "x")

But that’s a no go. ‘ad’ throws a type error, as presumably we’d be at risk of perturbation confusion:

Couldn't match expected type ‘AD
       s (Numeric.AD.Internal.Forward.Forward (Expr c))
    -> AD s (Numeric.AD.Internal.Forward.Forward (Expr c))’
   with actual type ‘t’
  because type variable ‘s’ would escape its scope
This (rigid, skolem) type variable is bound by
  a type expected by the context:
    AD s (Numeric.AD.Internal.Forward.Forward (Expr c))
    -> AD s (Numeric.AD.Internal.Forward.Forward (Expr c))
  at ParamBasic.hs:174:14-32
Relevant bindings include
  diffExpr :: Expr c (bound at ParamBasic.hs:174:3)
  expr :: t (bound at ParamBasic.hs:173:14)
  toDerivative :: t -> c -> c (bound at ParamBasic.hs:173:1)
In the first argument of ‘diff’, namely ‘expr’
In the expression: diff expr (Var "x")

Instead, we can use ad’s auto combinator to write an alternate eval function:

autoEval :: Mode a => String -> Expr (Scalar a) -> a -> a
autoEval x expr = (`go` expr) where
  go _ (Lit d) = auto d
  go v (Var s)
    | s == x    = v
    | otherwise = error "expression not closed"

  go v (Add e0 e1) = go v e0 + go v e1
  go v (Sub e0 e1) = go v e0 - go v e1
  go v (Mul e0 e1) = go v e0 * go v e1

and using that, implement a working toDerivative:

toDerivative :: Num a => String -> Expr (Expr a) -> Expr a
toDerivative v expr = diff (autoEval v expr)

which, though it has a weird-looking type, typechecks and does the trick:

> let d = toDerivative "x" (Mul (Lit 2) (Mul (Var "x") (Var "x"))) (Var "x")
Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1) (Var "x"))) (Lit 2)

> eval . close "x" 1 $ d
4

So now we have access to a reified AST for a derivative (or gradient), which can be tweaked and optimized as needed. Cool.

The available optimizations depend heavily on the underlying language. For starters, there’s easy and universal stuff like this:

-- | Reduce superfluous expressions.
elimIdent :: (Num a, Eq a) => Expr a -> Expr a
elimIdent (Add (Lit 0) e) = elimIdent e
elimIdent (Add e (Lit 0)) = elimIdent e
elimIdent (Add e0 e1)     = Add (elimIdent e0) (elimIdent e1)

elimIdent (Sub (Lit 0) e) = elimIdent e
elimIdent (Sub e (Lit 0)) = elimIdent e
elimIdent (Sub e0 e1)     = Sub (elimIdent e0) (elimIdent e1)

elimIdent (Mul (Lit 1) e) = elimIdent e
elimIdent (Mul e (Lit 1)) = elimIdent e
elimIdent (Mul e0 e1)     = Mul (elimIdent e0) (elimIdent e1)

elimIdent e = e

Which lets us do some work up-front:

> let e = 2 * Var "x" ^ 2 + Var "x" ^ 4
Add (Mul (Lit 2) (Mul (Var "x") (Var "x"))) (Mul (Mul (Var "x") (Var "x"))
(Mul (Var "x") (Var "x")))

> let ge = toDerivative "x" e
Add (Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1) (Var "x"))) (Lit 2))
(Add (Mul (Mul (Var "x") (Var "x")) (Add (Mul (Var "x") (Lit 1)) (Mul
(Lit 1) (Var "x")))) (Mul (Add (Mul (Var "x") (Lit 1)) (Mul (Lit 1)
(Var "x"))) (Mul (Var "x") (Var "x"))))

> let geOptim = elimIdent ge
Add (Mul (Add (Var "x") (Var "x")) (Lit 2)) (Add (Mul (Mul (Var "x")
(Var "x")) (Add (Var "x") (Var "x"))) (Mul (Add (Var "x") (Var "x")) (Mul
(Var "x") (Var "x"))))

> eval . close "x" 1 $ geOptim
8

But there are also some more involved optimizations that can be useful for some languages. The basic language I’ve been using above, for example, has no explicit support for sharing common subexpressions. You’ll recall from one of my previous posts that we have a variety of methods to do that in Haskell EDSLs, including some that allow sharing to be observed without modifying the underlying language. We can use data-reify, for example, to observe any implicit sharing in expressions:

> reifyGraph $ geOptim
let [(1,AddF 2 6),(6,AddF 7 10),(10,MulF 11 12),(12,MulF 4 4),(11,AddF 4 4),
(7,MulF 8 9),(9,AddF 4 4),(8,MulF 4 4),(2,MulF 3 5),(5,LitF 2),(3,AddF 4 4),
(4,VarF "x")] in 1

And even make use of a handy library found on Hackage for performing common subexpression elimination on graphs returned by reifyGraph:

> cse <$> reifyGraph geOptim
let [(5,LitF 2),(1,AddF 2 6),(3,AddF 4 4),(6,AddF 7 10),(2,MulF 3 5),
(10,MulF 3 8),(8,MulF 4 4),(7,MulF 8 3),(4,VarF "x")] in 1

With an appropriate graph evaluator we can cut down the size of the syntax we have to traverse substantially.

Happy automasymbolic differentiating!

Older Newer

jtobin.io