08 Jan 2016
I noticed this
article
by Tom Ellis today that provides an excellent ‘demystified’ introduction to
automatic differentiation. His exposition is exceptionally clear and simple.
Hopefully not in the spirit of re-mystifying things too much, I wanted to
demonstrate that this kind of forward-mode automatic differentiation can be
implemented using a catamorphism, which cleans up the various let
statements
found in Tom’s version (at the expense of slightly more pattern matching).
Let me first duplicate his setup using the standard recursion
scheme machinery:
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE LambdaCase #-}
import Data.Functor.Foldable
data ExprF r =
VarF
| ZeroF
| OneF
| NegateF r
| SumF r r
| ProductF r r
| ExpF r
deriving (Show, Functor)
type Expr = Fix ExprF
Since my expression type uses a fixed-point wrapper I’ll define my own
embedded language terms to get around it:
var :: Expr
var = Fix VarF
zero :: Expr
zero = Fix ZeroF
one :: Expr
one = Fix OneF
neg :: Expr -> Expr
neg x = Fix (NegateF x)
add :: Expr -> Expr -> Expr
add a b = Fix (SumF a b)
prod :: Expr -> Expr -> Expr
prod a b = Fix (ProductF a b)
e :: Expr -> Expr
e x = Fix (ExpF x)
To implement a corresponding eval
function we can use a catamorphism:
eval :: Double -> Expr -> Double
eval x = cata $ \case
VarF -> x
ZeroF -> 0
OneF -> 1
NegateF a -> negate a
SumF a b -> a + b
ProductF a b -> a * b
ExpF a -> exp a
Very clear. We just match things mechanically.
Now, symbolic differentiation. If you refer to the original diff
function
you’ll notice that in cases like Product
or Exp
there are uses of both an
original expression and also its derivative. This can be captured simply by a
paramorphism:
diff :: Expr -> Expr
diff = para $ \case
VarF -> one
ZeroF -> zero
OneF -> zero
NegateF (_, x') -> neg x'
SumF (_, x') (_, y') -> add x' y'
ProductF (x, x') (y, y') -> add (prod x y') (prod x' y)
ExpF (x, x') -> prod (e x) x'
Here the primes indicate derivatives in the usual fashion, and the standard
rules of differentiation are self-explanatory.
For automatic differentiation we just do sort of the same thing, except we’re
also also going to lug around the evaluated function value itself at each point
and evaluate to doubles instead of other expressions.
It’s worth noting here: why doubles? Because the expression type that we’ve
defined has no notion of sharing, and thus the expressions will blow up à la
diff
(to see what I mean, try printing the analogue of diff bigExpression
in GHCi). This could probably be mitigated by incorporating sharing into the
embedded language in some way, but that’s a topic
for another post.
Anyway, a catamorphism will do the trick:
ad :: Double -> Expr -> (Double, Double)
ad x = cata $ \case
VarF -> (x, 1)
ZeroF -> (0, 0)
OneF -> (1, 0)
NegateF (x, x') -> (negate x, negate x')
SumF (x, x') (y, y') -> (x + y, x' + y')
ProductF (x, x') (y, y') -> (x * y, x * y' + x' * y)
ExpF (x, x') -> (exp x, exp x * x')
Take a look at the pairs to the right of the pattern matches; the first element
in each is just the corresponding term from eval
, and the second is just the
corresponding term from diff
(made ‘Double’-friendly). The catamorphism
gives us access to all the terms we need, and we can avoid a lot of work on
the right-hand side by doing some more pattern matching on the left.
Some sanity checks to make sure that these functions match up with Tom’s:
*Main> map (snd . (`ad` testSmall)) [0.0009, 1.0, 1.0001]
[0.12254834896191881,1.0,1.0003000600100016]
*Main> map (snd . (`ad` testBig)) [0.00009, 1.0, 1.00001]
[3.2478565715996756e-6,1.0,1.0100754777229357]
UPDATE:
I had originally defined ad
using a paramorphism but noticed that we can get
by just fine with cata.
09 Dec 2015
I’m presently at NIPS and so felt like writing about some
appropriate machine learning topic, but along the way I wound up talking about
parameterized recursive types, and here we are. Enjoy!
One starts to see common ‘shapes’ in algebraic data types after working with
them for a while. Take the natural numbers and a standard linked list, for
example:
data Natural =
One
| Succ Natural
data List a =
Empty
| Cons a (List a)
These are similar in some sense. There are some differences - a list has an
additional type parameter, and each recursive point in the list is tagged with
a value of that type - but the nature of the recursion in each is the same.
There is a single recursive point wrapped up in a single constructor, plus a
single base case.
Consider a recursive type that is parameterized by a functor with kind ‘* ->
*’, such that the kind of the resulting type is something like ‘(* -> *) ->
*’ or ‘(* -> *) -> * -> *’ or so on. It’s interesting to look at the
‘shapes’ of some useful types like this and see what kind of similarities and
differences in recursive structure that we can find.
In this article we’ll look at three such recursive types: ‘Fix’, ‘Free’, and
‘Cofree’. I’ll demonstrate that each can be viewed as a kind of program
parameterized by some underlying instruction set.
Fix
To start, let’s review the famous fixed-point type ‘Fix’. I’ve talked about it
before, but will go into a bit more detail here.
{-# LANGUAGE DeriveFunctor #-}
{-# LANGUAGE FlexibleContexts #-}
{-# LANGUAGE StandaloneDeriving #-}
{-# LANGUAGE UndecideableInstances #-}
newtype Fix f = Fix (f (Fix f))
deriving instance (Show (f (Fix f))) => Show (Fix f)
Note: I’ll omit interpreter output for examples throughout this article, but
feel free to try the code yourself in GHCi. I’ll post some gists at the
bottom. The above code block also contains some pragmas that you can ignore;
they’re just there to help GHC derive some instances for us.
Anyway. ‘Fix’ is in some sense a template recursive structure. It relies on
some underlying functor ‘f’ to define the scope of recursion that you can
expect a value with type ‘Fix f’ to support. There is the degenerate constant
case, for example, which supports no recursion:
data DegenerateF r = DegenerateF
deriving (Functor, Show)
type Degenerate = Fix DegenerateF
degenerate :: Degenerate
degenerate = Fix DegenerateF
Then you have the case like the one below, where only an infinitely recursive
value exists:
newtype InfiniteF r = InfiniteF r
deriving (Functor, Show)
type Infinite = Fix InfiniteF
infinite :: Infinite
infinite = Fix (InfiniteF infinite)
But in practice you’ll have something in between; a type with at least one
recursive point or ‘running’ case and also at least one base or ‘terminating’
case. Take the natural numbers, for example:
data NatF r =
OneF
| SuccF r
deriving (Functor, Show)
type Nat = Fix NatF
one :: Nat
one = Fix OneF
succ :: Nat -> Nat
succ = Fix . SuccF
Here ‘NatF’ provides both a ‘running’ case - ‘SuccF’ - and a ‘terminating’ case
in - ‘OneF’. ‘Fix’ just lets ‘NatF’ do whatever it wants, having no say of its
own about termination. In fact, we could have defined ‘Fix’ like this:
data Program f = Running (f (Program f))
Indeed, you can think of ‘Fix’ as defining a program that runs until ‘f’
decides to terminate. In turn, you can think of ‘f’ as an instruction
set for the program. The whole shebang of ‘Fix f’ may only terminate if ‘f’
contains a terminating instruction.
Here’s a simple set of instructions, for example:
data Instruction r =
Increment r
| Decrement r
| Terminate
deriving (Functor, Show)
increment :: Program Instruction -> Program Instruction
increment = Running . Increment
decrement :: Program Instruction -> Program Instruction
decrement = Running . Decrement
terminate :: Program Instruction
terminate = Running Terminate
And we can write a sort of stack-based program like so:
program :: Program Instruction
program =
increment
. increment
. decrement
$ terminate
Richness of ‘Fix’
It’s worthwhile to review two functions that are useful for working with ‘Fix’,
unimaginatively named ‘fix’ and ‘unfix’:
fix :: f (Fix f) -> Fix f
fix = Fix
unfix :: Fix f -> f (Fix f)
unfix (Fix f) = f
You can think of them like so: ‘fix’ embeds a value of type ‘f’ into a
recursive structure by adding a new layer of recursion, while ‘unfix’ projects
a value of type ‘f’ out of a recursive structure by peeling back a layer of
recursion.
This is a pretty rich recursive structure - we have a guarantee that we can
always embed into or project out of something with type ‘Fix f’, no matter
what ‘f’ is.
Free
Next up is ‘Free’, which is really just ‘Fix’ with some added structure. It is
defined as follows:
data Free f a =
Free (f (Free f a))
| Pure a
deriving Functor
deriving instance (Show a, Show (f (Free f a))) => Show (Free f a)
The ‘Free’ constructor has an analogous definition to the ‘Fix’ constructor,
meaning we can use ‘Free’ to implement the same things we did previously. Here
are the natural numbers redux, for example:
type NatFree = Free NatF
oneFree :: NatFree a
oneFree = Free OneF
succFree :: NatFree a -> NatFree a
succFree = Free . SuccF
There’s also another branch here called ‘Pure’, though, that just bluntly wraps
a value of type ‘a’, and has nothing to do with the parameter ‘f’. This has an
interesting consequence: it means that ‘Free’ can have an opinion of its own
about termination, regardless about what ‘f’ might decree:
type NotSoInfinite = Free InfiniteF
notSoInfinite :: NotSoInfinite ()
notSoInfinite = Free (InfiniteF (Free (InfiniteF (Pure ()))))
(Note that here I’ve returned the value of type unit when terminating under the
‘Pure’ branch, but you could pick whatever else you’d like.)
You’ll recall that ‘InfiniteF’ provides no terminating instruction,
and left to its own devices will just recurse endlessly.
So: instead of being forced to choose a branch of the underlying functor to
recurse on, ‘Free’ can just bail out on a whim and return some value wrapped up
in ‘Pure’. We could have defined the whole type like this:
data Program f a =
Running (f (Program f a))
| Terminated a
deriving Functor
Again, it’s ‘Fix’ with more structure. It’s a program that runs until ‘f’
decides to terminate, or that terminates and returns a value of type ‘a’
As a quick illustration, take our simple stack-based instruction set again. We
can define the following embedded language terms:
increment :: Program Instruction a -> Program Instruction a
increment = Running . Increment
decrement :: Program Instruction a -> Program Instruction a
decrement = Running . Decrement
terminate :: Program Instruction a
terminate = Running Terminate
sigkill :: Program f Int
sigkill = Terminated 1
So note that ‘sigkill’ is independent of whatever instruction set we’re working
with. We can thus write another simple program like before, except this time
have ‘sigkill’ terminate it:
program :: Program Instruction Int
program =
increment
. increment
. decrement
$ sigkill
Richness of ‘Free’
Try to define the equivalent versions of ‘fix’ and ‘unfix’ for ‘Free’. The
equivalent to ‘fix’ is easy:
free :: f (Free f a) -> Free f a
free = Free
You’ll hit a wall, though, if you want to implement the (total) analogue to
‘unfix’. One wants a function of type ‘Free f a -> f (Free f a)’, but the
existence of the ‘Pure’ branch makes this impossible to implement totally. In
general there is not going to be an ‘f’ to pluck out:
unfree :: Free f a -> f (Free f a)
unfree (Free f) = f
unfree (Pure a) = error "kaboom"
The recursion provided by ‘Free’ is thus a little less rich than that provided
by ‘Fix’. With ‘Fix’ one can always project a value out of its recursive
structure - but that’s not the case with ‘Free’.
It’s well-known that ‘Free’ is monadic, and indeed it’s usually called the
‘free
monad’.
The namesake ‘free’ comes from an algebraic definition; roughly, a free ‘foo’
is a ‘foo’ that satisfies the minimum possible constraints to make it a ‘foo’,
and nothing else. Check out the
slides
from Dan Piponi’s excellent talk from Bayhac a few years back for a deeper dive
on algebraic freeness.
Cofree
‘Cofree’ is also like ‘Fix’, but again with some extra structure. It can be
defined as follows:
data Cofree f a = Cofree a (f (Cofree f a))
deriving Functor
deriving instance (Show a, Show (f (Cofree f a))) => Show (Cofree f a)
Again, part of the definition - the second field of the ‘Cofree’ constructor -
looks just like ‘Fix’. So predictably we can do a redux-redux of the natural
numbers using ‘Cofree’:
type NatCofree = Cofree NatF
oneCofree :: NatCofree ()
oneCofree = Cofree () OneF
succFree :: NatCofree () -> NatCofree ()
succFree f = Cofree () (SuccF f)
(Note that here I’ve again used unit to fill in the first field - you could
of course choose whatever you’d like.)
This looks a lot like ‘Free’, and in fact it’s the categorical dual of
‘Free’. Whereas ‘Free’ is a sum type with two branches, ‘Cofree’ is a
product type with two fields. In the case of ‘Free’, we could have a program
that either runs an instruction from a set ‘f’, or terminates with a value
having type ‘a’. In the case of ‘Cofree’, we have a program that runs an
instruction from a set ‘f’ and returns a value of type ‘a’.
A ‘Free’ value thus contains at most one recursive point wrapping the value
with type ‘a’, while a ‘Cofree’ value contains potentially infinite recursive
points - each one of which is tagged with a value of type ‘a’.
Rolling with the ‘Program’ analogy, we could have written this alternate
definition for ‘Cofree’:
data Program f a = Program {
annotation :: a
, running :: f (Program f a)
} deriving Show
A ‘Cofree’ value is thus a program in which every instruction is annotated with
a value of type ‘a’. This means that, unlike ‘Free’, it can’t have its own
opinion on termination. Like ‘Fix’, it has to let ‘f’ decide how to do that.
We’ll use the stack-based instruction set example to wrap up. Here we can
annotate instructions with progress about how many instructions remain to
execute. First our new embedded language terms:
increment :: Program Instruction Int -> Program Instruction Int
increment p = Program (remaining p) (Increment p)
decrement :: Program Instruction Int -> Program Instruction Int
decrement p = Program (remaining p) (Decrement p)
terminate :: Program Instruction Int
terminate = Program 0 Terminate
Notice that two of these terms use a helper function ‘remaining’ that counts
the number of instructions left in the program. It’s defined as follows:
remaining :: Program Instruction Int -> Int
remaining = loop where
loop (Program a f) = case f of
Increment p -> succ (loop p)
Decrement p -> succ (loop p)
Terminate -> succ a
And we can write our toy program like so:
program :: Program Instruction Int
program =
increment
. increment
. decrement
$ terminate
Evaluate it in GHCi to see what the resulting value looks like.
Richness of ‘Cofree’
If you try and implement the ‘fix’ and ‘unfix’ analogues for ‘Cofree’ you’ll
rapidly infer that we have the opposite situation to ‘Free’ here. Implementing
the ‘unfix’ analogue is easy:
uncofree :: Cofree f a -> f (Cofree f a)
uncofree (Cofree _ f) = f
But implementing a total function corresponding to ‘fix’ is impossible - we
can’t just come up with something of arbitrary type ‘a’ to tag an instruction
‘f’ with, so, like before, we can’t do any better than define something
partially:
cofree :: f (Cofree f a) -> Cofree f a
cofree f = Cofree (error "kaboom") f
Just as how ‘Free’ forms a monad, ‘Cofree’ forms a comonad. It’s thus known as
the ‘cofree comonad’, though I can’t claim to really have any idea what the
algebraic notion of ‘cofreeness’ captures, exactly.
Wrapping Up
So: ‘Fix’, ‘Free’, and ‘Cofree’ all share a similar sort of recursive structure
that make them useful for encoding programs, given some instruction set. And
while their definitions are similar, ‘Fix’ supports the richest recursion of
the three in some sense - it can both ‘embed’ things into and ‘project’
things out of its recursive structure, while ‘Free’ supports only embedding and
‘Cofree’ supports only projecting.
This has a practical implication: it means one can’t make use of certain
recursion schemes for ‘Free’ and ‘Cofree’ in the same way that one can for
‘Fix’. There do exist analogues, but they’re sort of out-of-scope for this
post.
I haven’t actually mentioned any truly practical uses of ‘Free’ and ‘Cofree’
here, but they’re wonderful things to keep in your toolkit if you’re doing any
work with embedded languages, and I’ll likely write more about them in the
future. In the meantime, Dave Laing wrote an excellent series of
posts on ‘Free’ and
‘Cofree’ that are more than worth reading. They go into much more interesting
detail than I’ve done here - in particular he details a nice pairing that
exists between ‘Free’ and ‘Cofree’ (also
discussed by Dan
Piponi), plus a whack of examples.
You can also find industrial-strength infrastructure for both ‘Free’ and
‘Cofree’ in Edward Kmett’s excellent
free library, and for ‘Fix’ in
recursion-schemes.
I’ve dumped the code for this article into a few gists.
Here’s one of everything
excluding the running ‘Program’ examples, and here are the corresponding
‘Program’ examples for the
Fix,
Free, and
Cofree cases
respectively.
Thanks to Fredrik Olsen for review and great feedback.
02 Dec 2015
Merge sort is a famous
comparison-based sorting algorithm that starts by first recursively dividing a
collection of orderable elements into smaller subcollections, and then finishes
by recursively sorting and merging the smaller subcollections together to
reconstruct the (now sorted) original.
A clear implementation of mergesort should by definition be as faithful to that
high-level description as possible. We can get pretty close to that using the
whole recursion schemes
business that I’ve talked about in the past. Near the end of that article I
briefly mentioned the idea of implementing mergesort via a
hylomorphism,
and here I just want to elaborate on that a little.
Start with a collection of orderable elements. We can divide the collection
into a bunch of smaller collections by using a binary tree:
{-# LANGUAGE DeriveFunctor #-}
import Data.Functor.Foldable (hylo)
import Data.List.Ordered (merge)
data Tree a r =
Empty
| Leaf a
| Node r r
deriving Functor
The idea is that each node in the tree holds two subtrees, each of which
contains half of the remaining elements. We can build a tree like this from a
collection - say, a basic Haskell list. The following unfolder
function
defines what part of a tree to build for any corresponding part of a list:
unfolder [] = Empty
unfolder [x] = Leaf x
unfolder xs = Node l r where
(l, r) = splitAt (length xs `div` 2) xs
On the other hand, we can also collapse an existing tree back into a list. The
following folder
function defines how to collapse any given part of a tree
into the corresponding part of a list; again we just pattern match on whatever
part of the tree we’re looking at, and construct the complementary list:
folder Empty = []
folder (Leaf x) = [x]
folder (Node l r) = merge l r
Now to sort a list we can just glue these instructions together using
a hylomorphism:
mergesort :: Ord a => [a] -> [a]
mergesort = hylo folder unfolder
And it works just like you’d expect:
> mergesort [1,10,3,4,5]
[1,3,4,5,10]
> mergesort "aloha"
"aahlo"
> mergesort [True, False, False, True, False]
[False, False, False, True, True]
Pretty concise!
The code is eminently clean and faithful to the high-level algorithm
description: first recursively divide a collection into smaller subcollections
- via a binary tree and
unfolder
- and then recursively sort and merge the
subcollections to reconstruct the (now sorted) original one - via folder
.
A version of this post originally appeared on the Fugue
blog.
14 Oct 2015
I’ve released a number of libraries for doing Markov Chain Monte Carlo (MCMC)
in Haskell.
You can get at them via a ‘frontend’ library,
declarative, but each can
also be used fruitfully on its own. À la carte, if you will.
Some background: MCMC is a family of stateful algorithms for sampling from a
large class of probability distributions. Typically one is interested in doing
this to approximate difficult integrals; instead of choosing some suitable grid
of points in parameter space over which to approximate an integral, just
offload the problem to probability theory and use a Markov chain to find them
for you.
For an excellent introduction to MCMC you won’t find better than Iain Murray’s
lectures from MLSS ’09 in
Cambridge, so check those out if you’re interested in more details.
I’ve put together a handful of popular MCMC algorithms as well as an easy way
to glue them together in a couple of useful ways. At present these
implementations are useful in cases where you can write your target function in
closed form, and that’s pretty much all that’s required (aside from the
standard algorithm-specific tuning parameters).
The API should be pretty easy to work with — write your target as a function of
its parameters, specify a start location, and away you go. It’s also cool if
your target accepts its parameters via most common traversable
functors — lists, vectors, sequences, maps, etc.
That’s sort of the goal of this first release: if you can give me a target
function, I’ll do my best to give you samples from it. Less is more and all
that.
What‘s In The Box
There are a number of libraries involved. I have a few more in the queue and
there are a number of additional features I plan to support for these ones in
particular, but without further ado:
- mwc-probability, a
sampling-function based probability monad implemented as a thin wrapper over
the excellent mwc-random
library.
- mcmc-types, housing a
number of types used by the the whole family.
- mighty-metropolis,
an implementation of the famous Metropolis algorithm.
- speedy-slice, a slice
sampling implementation suitable for both continuous & discrete parameter
spaces.
- hasty-hamiltonian,
an implementation of the gradient-based Hamiltonian Monte Carlo algorithm.
- declarative, the one ring
to rule them all.
Pull down declarative if you just want to have access to all of them. If
you’re a Haskell neophyte you can find installation instructions at the Github
repo.
Motivation
MCMC is fundamentally about observing Markov chains over probability spaces.
In this context a chain is a stochastic process that wanders around a state
space, eventually visiting regions of the space in proportion to their
probability.
Markov chains are constructed by transition operators that obey the Markov
property: that the probability of transitioning to the next
location — conditional on the history of the chain — depends only on the
current location. For MCMC we’re also interested in operators that satisfy the
reversibility property — that the probability a transition from state A to
state B occurs is the same as that a transition from state B to state A occurs.
A chain is characterized by a transition operator T that drives it from state
to state, and for MCMC we want the stationary or limiting distribution of the
chain to be the distribution we’re sampling from.
One of the major cottage industries in Bayesian research is inventing new
transition operators to drive the Markov chains used in MCMC. This has been
fruitful, but it could likely be aided by a practical way to make existing
transition operators work together.
This is easy to do in theory: there are a couple of ways to combine transition
operators such that the resulting composite operator preserves all the
properties we’re interested in for MCMC — the stationary distribution,
reversibility, and Markov property. See Geyer,
2005 for details here, but
the crux is that we can establish the following simple grammar for transition
operators:
transition ::= primitive <transition>
| concat transition transition
| sample transition transition
A transition is either some primitive operator, a deterministic concatenation
of operators (via ‘concat’), or a probabilistic concatenation of operators (via
‘sample’). A deterministic concatenation works by just transitioning through
two operators one after the other; a probabilistic concatenation works by
randomly choosing one transition operator or the other to use on any given
transition. These kinds of concatenation preserve all the properties we’re
interested in.
We can trivially generalize this further by adding a term that concatenates n
transition operators together deterministically, or another for
probabilistically concatenating a bunch of operators according to some desired
probability distribution.
The idea here is that there are tradeoffs involved in different transition
operators. Some may be more computationally expensive than others (perhaps
requiring a gradient evaluation, or evaluation of some inner loop) but have
better ability to make ‘good’ transitions in certain situations. Other
operators are cheap, but can be inefficient (taking a long time to visit
certain regions of the space).
By employing deterministic or probabilistic concatenation, one can concoct a
Markov chain that uses a varied range of tuning parameters, for example. Or
only occasionally employs a computationally expensive transition, otherwise
preferring some cheaper, reliable operator.
Usage
The declarative library implements this simple language for transition
operators, and the mighty-metropolis, speedy-slice, and hasty-hamiltonian
libraries provide some primitive transitions that you can combine as needed.
The Metropolis and slice sampling transitions are cheap and require little
information, whereas Hamiltonian Monte Carlo exploits information about the
target’s gradient and also involves evaluation of an inner loop (the length of
which is determined by a tuning parameter). Feel free to use one that suits
your problem, or combine them together using the combinators supplied in
declarative to build a custom solution.
As an example, the Rosenbrock
density is a great test
dummy as it’s simple, low-dimensional, and can be easily visualized, but it
still exhibits a pathological anisotropic structure that makes it somewhat
tricky to sample from.
Getting started via declarative is pretty simple:
You’ll want to supply a target to sample over, and if you want to use an
algorithm like Hamiltonian Monte Carlo you’ll also need to provide a gradient.
If you can’t be bothered to calculate gradients by hand, you can always turn to
your friend automatic
differentiation:
The Rosenbrock log-density and its gradient can then be written as follows:
target :: Num a => [a] -> a
target [x0, x1] = negate (100 * (x1 — x0 ^ 2) ^ 2 + (1 — x0) ^ 2)
gTarget :: Num a => [a] -> [a]
gTarget = grad target
All you need to do here is provide a function proportional to a
log-probability density. The logarithmic scale is important; various internals
expect to be passed (something proportional to) a log-probability density.
To package these guys up together we can wrap them in a Target
. Note that we
don’t always care about including a gradient, so that part is optional:
rosenbrock :: Target [Double]
rosenbrock = Target target (Just gTarget)
The Target
type is parameterized over the shape of the parameter space. You
could similarly have a Target (Seq Double)
, Target (Map String Double)
, and
so on. Your target may be implemented using a boxed vector for efficiency, for
example. Or using a Map or HashMap with string/text keys such that parameter
names are preserved. They should work just fine.
Given a target, we can sample from it a bunch of times using a simple
Metropolis transition via the mcmc function. Aside from the target and a
PRNG, provide it with the desired number of transitions, a starting point, and
the transition operator to use:
> -- haskell
> prng <- create
> mcmc 10000 [0, 0] (metropolis 1) rosenbrock prng
In return you’ll get the desired trace of the chain dumped to stdout:
8.136972300105949e-2,0.273896953404261
0.4657348148676972,0.17462596647788464
-0.48609414127836326,9.465052854751566e-2
-0.49781488399832785,0.42092910345708523
-0.3019713424699155,0.39135350029173566
0.12058426470979189,0.12485407390388925
..
The intent is for the chain to be processed elsewhere — if you’re me, that will
usually be in R. Libraries like
coda have a ton of
functionality useful for working with Markov chain traces, and
ggplot2 as a library for static statistical graphics
can’t really be beat:
> # r
> d = read.csv(‘rosenbrock-trace.dat’, header = F)
> names(d) = c(‘x’, ‘y’)
> require(ggplot2)
> ggplot(d, aes(x, y)) + geom_point(colour = ‘darkblue’, alpha = 0.2)
You get the following trace over the Rosenbrock density, taken for 10k
iterations. This is using a Metropolis transition with variance 1:
If you do want to work with chains in memory in Haskell you can do that by
writing your own handling code around the supplied transition operators. I’ll
probably make this a little easier in later versions.
The implementations are reasonably quick and don’t leak memory — the traces are
streamed to stdout as the chains are traversed. Compiling the above with ‘-O2’
and running it for 100k iterations yields the following performance
characteristics on my mid-2011 model MacBook Air:
$ ./test/Rosenbrock +RTS -s > /dev/null
3,837,201,632 bytes allocated in the heap
8,453,696 bytes copied during GC
89,600 bytes maximum residency (2 sample(s))
23,288 bytes maximum slop
1 MB total memory in use (0 MB lost due to fragmentation)
INIT time 0.000s ( 0.000s elapsed)
MUT time 3.539s ( 3.598s elapsed)
GC time 0.049s ( 0.058s elapsed)
EXIT time 0.000s ( 0.000s elapsed)
Total time 3.591s ( 3.656s elapsed)
%GC time 1.4% (1.6% elapsed)
Alloc rate 1,084,200,280 bytes per MUT second
Productivity 98.6% of total user, 96.8% of total elapsed
The beauty is that rather than running a chain solely on something like the
simple Metropolis operator used above, you can sort of ‘hedge your sampling
risk’ and use a composite operator that proposes transitions using a multitude
of ways. Consider this guy, for example:
transition =
concatT
(sampleT (metropolis 0.5) (metropolis 1.0))
(sampleT (slice 2.0) (slice 3.0))
Here concatT
and sampleT
correspond to the concat
and sample
terms in
the BNF description in the previous section. This operator performs two
transitions back-to-back; the first is randomly a Metropolis transition with
standard deviation 0.5 or 1 respectively, and the second is a slice sampling
transition using a step size of 2 or 3, randomly.
Running it for 5000 iterations (to keep the total computation approximately
constant), we see a chain that has traversed the space a little better:
> mcmc 5000 [0, 0] transition rosenbrock prng
It’s worth noting that I didn’t put any work into coming up with this composite
transition: this was just the first example I thought up, and a lot of the
benefits here probably come primarily from including the eminently-reliable
slice sampling transition. But from informal experimentation, it does seem that
chains driven by composite transitions involving numerous operators and tuning
parameter settings often seem to perform better on average than a given chain
driven by a single (poorly-selected) transition.
I know exactly how meticulous proofs and benchmarks must be so I haven’t
rigorously established any properties around this, but hey: it ‘seems to be the
case’, and intuitively, including varied transition operators surely hedges
your bets when compared to using a single one.
Try it out and see how your mileage varies, and be sure to let me know if you
find some killer apps where composite transitions really seem to win.
Implementation Notes
If you’re just interested in using the libraries you can skip the following
section, but I just want to point out how easy this is to implement.
The implementations are defined using a small set of types living in
mcmc-types:
type Transition m a = StateT a (Prob m) ()
data Chain a b = Chain {
chainTarget :: Target a
, chainScore :: Double
, chainPosition :: a
, chainTunables :: Maybe b
}
data Target a = Target {
lTarget :: a -> Double
, glTarget :: Maybe (a -> a)
}
Most important here is the Transition
type, which is just a state transformer
over a probability monad (itself defined in mwc-probability). The probability
monad is the source of randomness used to define transition operators useful
for MCMC, and values with type Transition
are the transition operators in
question.
The Chain
type is the state of the Markov chain at any given iteration. All
that’s really required here is the chainPosition
field, which represents the
location of the chain in parameter space. But adding some additional
information here is convenient; chainScore
caches the most recent score of
the chain (which is typically used in internal calculations, and caching avoids
recomputing things needlessly) and chainTunables
is an optional record
intended to be used for stateful tuning parameters (used by adaptive algorithms
or in burn-in phases and the like). Additionally the target being sampled from
itself — chainTarget
— is included in the state.
Undisciplined use of chainTarget
and chainTunables
can have all sorts of
nasty consequences — you can use them to change the stationary distribution
you’re sampling from or invalidate the Markov property — but keeping them
around is useful for implementing some desirable features. Tweaking
chainTarget
, for example, allows one to easily implement annealing, which can
be very useful for sampling from annoying multi-modal densities.
Setting everything up like this makes it trivial to mix-and-match transition
operators as required — the state and probability monad stack provides
everything we need. Deterministic concatenation is implemented as follows, for
example:
and a generalized version of probabilistic concatenation just requires a coin
flip:
bernoulliT p t0 t1 = do
heads <- lift (MWC.bernoulli p)
if heads then t0 else t1
A uniform probabilistic concatenation over two operators, implemented in
sampleT
, is then just bernoulliT 0.5
.
The difficulty of implementing primitive operators just depends on the operator
itself; the surrounding framework is extremely lightweight. Here’s the
Metropolis transition, for example (with type signatures omitted to keep the
noise down):
metropolis radial = do
Chain {..} <- get
proposal <- lift (propose radial chainPosition)
let proposalScore = lTarget chainTarget proposal
acceptProb = whenNaN 0
(exp (min 0 (proposalScore - chainScore)))
accept <- lift (MWC.bernoulli acceptProb)
when accept
(put (Chain chainTarget proposalScore proposal chainTunables))
propose radial = traverse perturb where
perturb m = MWC.normal m radial
And the excellent pipes library is
used to generate a Markov chain:
chain radial = loop where
loop state prng = do
next <- lift
(MWC.sample (execStateT (metropolis radial) state) prng)
yield next
loop next prng
The mcmc
functions are also implemented using pipes. Take the first n
iterations of a chain and print them to stdout. That simple.
Future Work
In the near term I plan on updating some old MCMC implementations I have
kicking around on Github (flat-mcmc,
lazy-langevin,
hnuts) and releasing them within this
framework. Additionally I’ve got some code for building annealed operators that
I want to release — it has been useful in some situations when sampling from
things like the Himmelblau
density, which has a
few disparate clumps of probability that make it tricky to sample from with
conventional algorithms.
This framework is also useful as an inference backend to languages for working
with directed graphical models (think BUGS/Stan). The idea here is that you
don’t need to specify your target function (typically a posterior density)
explicitly: just describe your model and I’ll give you samples from the
posterior distribution. A similar version has been put to use around the
BayesHive project.
Longer term — I’ll have to see what’s up in terms of demand. There are
performance improvements and straightforward extensions to things like
parallel tempering, but I’m
growing more interested in ‘online’ methods like particle
MCMC
and friends that are proving useful for inference in more general probabilistic
programs (think those expressible by Church and its
ilk).
Let me know if you get any use out of these things, or please file an issue if
there’s some particular feature you’d like to see supported.
Thanks to Niffe Hermansson for review and helpful comments.
06 Sep 2015
Recursion schemes are elegant and useful patterns for expressing general
computation. In particular, they allow you to ‘factor recursion out’ of
whatever semantics you may be trying to express when interpreting programs,
keeping your interpreters concise, your concerns separated, and your code more
maintainable.
What’s more, formulating programs in terms of recursion schemes seems to help
suss out particular similarities in structure between what might be seen as
disparate problems in
other domains. So aside from being a practical computational tool, they seem to
be of some use when it comes to ‘hacking understanding’ in varied areas.
Unfortunately, they come with a pretty forbidding barrier to entry. While there
are a
few
nice
resources
out there for learning about recursion schemes and how they work, most
literature around them is quite
academic and
awash in some astoundingly technical jargon (more on this later). Fortunately,
the accessible resources out there do a great job of explaining what recursion
schemes are and how you might use them, so they go through some effort to build
up the required machinery from scratch.
In this article I want to avoid building up the machinery meticulously and
instead concentrate mostly on understanding and using Edward Kmett’s
recursion-schemes
library, which, while lacking in documentation, is very well put together and
implements all the background plumbing one needs to get started.
In particular, to feel comfortable using recursion-schemes I found that there
were a few key patterns worth understanding:
- Factoring recursion out of your data types using pattern functors and a
fixed-point wrapper.
- Using the ‘Foldable’ & ‘Unfoldable’ classes, plus navigating the ‘Base’ type
family.
- How to use some of the more common recursion schemes out there for everyday
tasks.
The Basics
If you’re following along in GHCi, I’m going to first bring in some
imports and add a useful pragma. I’ll dump a gist at the bottom; note that this
article targets GHC 7.10.2 and recursion-schemes-4.1.2, plus I’ll also require
data-ordlist-0.4.7.0 for an example later. Here’s the requisite boilerplate:
{-# LANGUAGE DeriveFunctor #-}
import Data.Functor.Foldable
import Data.List.Ordered (merge)
import Prelude hiding (Foldable, succ)
So, let’s get started.
Recursion schemes are applicable to data types that have a suitable recursive
structure. Lists, trees, and natural numbers are illustrative candidates.
Being so dead-simple, let’s take the natural numbers as an illustrative/toy
example. We can define them recursively as follows:
data Natural =
Zero
| Succ Natural
This is a fine definition, but many such recursive structures can also be
defined in a different way: we can first ‘factor out’ the recursion by defining
some base structure, and then ‘add it back in’ by using a recursive wrapper
type.
The price of this abstraction is a slightly more involved type definition, but
it unlocks some nice benefits — namely, the ability to reason about recursion
and base structures separate from each other. This turns out to be a very
useful pattern for getting up and running with recursion schemes.
The trick is to create a different, parameterized type, in which the new
parameter takes the place of all recursive points in the original type. We can
create this kind of base structure for the natural numbers example as follows:
data NatF r =
ZeroF
| SuccF r
deriving (Show, Functor)
This type must be a functor in this new parameter, so the type is often called
a ‘pattern functor’ for some other type. I like to use the notation
‘F’ when defining constructors for pattern functors.
We can define pattern functors for lists and trees in the same way:
data ListF a r =
NilF
| ConsF a r
deriving (Show, Functor)
data TreeF a r =
EmptyF
| LeafF a
| NodeF r r
deriving (Show, Functor)
Now, to add recursion to these pattern functors we’re going to use the famous
fixed-point type, ‘Fix’, to wrap them in:
type Nat = Fix NatF
type List a = Fix (ListF a)
type Tree a = Fix (TreeF a)
‘Fix’ is a standard fixed-point type imported from the recursion-schemes
library. You can get a ton of mileage from it. It introduces the ‘Fix’
constructor everywhere, but that’s actually not much of an issue in practice.
One thing I typically like to do is add some smart constructors to get around
it:
zero :: Nat
zero = Fix ZeroF
succ :: Nat -> Nat
succ = Fix . SuccF
nil :: List a
nil = Fix NilF
cons :: a -> List a -> List a
cons x xs = Fix (ConsF x xs)
Then you can write expressions like ‘succ (succ (succ zero))’ without having to
deal with the ‘Fix’ constructor explicitly. Note also that these expressions
are Showable à la carte, for example in GHCi:
> succ (succ (succ zero))
Fix (SuccF (Fix (SuccF (Fix (SuccF (Fix ZeroF))))))
A Short Digression on ‘Fix’
The ‘Fix’ type is brought into scope from ‘Data.Functor.Foldable’, but it’s
worth looking at it in some detail. It can be defined as follows, along with
two helpful functions for working with it:
newtype Fix f = Fix (f (Fix f))
fix :: f (Fix f) -> Fix f
fix = Fix
unfix :: Fix f -> f (Fix f)
unfix (Fix f) = f
‘Fix’ has a simple recursive structure. For a given value, you can think of
‘fix’ as adding one level of recursion to it. ‘unfix’ in turn removes one level
of recursion.
This generic recursive structure is what makes ‘Fix’ so useful: we can write
some nominally recursive type we’re interested in without actually using
recursion, but then package it up in ‘Fix’ to hijack the recursion it provides
automatically.
Understanding Some Internal Plumbing
If we wrap a pattern functor in ‘Fix’ then the underlying machinery of
recursion-schemes should ‘just work’. Here it’s worth explaining a little as to
why that’s the case.
There are two fundamental type classes involved in recursion-schemes:
‘Foldable’ and ‘Unfoldable’. These serve to tease apart the recursive structure
of something like ‘Fix’ even more: loosely, ‘Foldable’ corresponds to types
that can be ‘unfixed’, and ‘Unfoldable’ corresponds to types that can be
‘fixed’. That is, we can add more layers of recursion to instances of
‘Unfoldable’, and we can peel off layers of recursion from instances of
‘Foldable’.
In particular ‘Foldable’ and ‘Unfoldable’ contain functions called ‘project’
and ‘embed’ respectively, corresponding to more general forms of ‘unfix’ and
‘fix’. Their types are as follows:
project :: Foldable t => t -> Base t t
embed :: Unfoldable t => Base t t -> t
I’ve found it useful while using recursion-schemes to have a decent
understanding of how to interpret the type family ‘Base’. It appears frequently
in type signatures of various recursion schemes and being able to reason about
it can help a lot.
‘Base’ and Basic Type Families
Type families are type-level functions; they take types as input and return
types as output. The ‘Base’ definition in recursion-schemes looks like this:
type family Base t :: * -> *
You can interpret this as a function that takes one type ‘t’ as input and
returns some other type. An implementation of this function is called an
instance of the family. The instance for ‘Fix’, for example, looks like:
type instance Base (Fix f) = f
In particular, a type family like ‘Base’ is a synonym for instances of the
family. So using the above example: anywhere you see something like ‘Base (Fix
f)’ you can mentally replace it with ‘f’.
Instances of the ‘Base’ type family have a structure like ‘Fix’, but using
‘Base’ enables all the internal machinery of recursion-schemes to work
out-of-the-box for types other than ‘Fix’ alone. This has a typical Kmettian
flavour: first solve the most general problem, and then recover useful,
specific solutions to it automatically.
Predictably, ‘Fix f’ is an instance of ‘Base’, ‘Foldable’, and ‘Unfoldable’ for
some functor ‘f’, so if you use it, you can freely use all of
recursion-schemes’s innards without needing to manually write any instances for
your own data types. But as mentioned above, it’s worth noting that you can
exploit the various typeclass & type family machinery to get by without using
‘Fix’ at all: see e.g. Danny Gratzer’s recursion-schemes post for an example of
this.
Some Useful Schemes
So, with some discussion of the internals out of the way, we can look at some
of the more common and useful recursion schemes. I’ll concentrate on the
following four, as they’re the ones I’ve found the most use for:
- catamorphisms, implemented via ‘cata’, are generalized folds.
- anamorphisms, implemented via ‘ana’, are generalized unfolds.
- hylomorphisms, implemented via ‘hylo’, are anamorphisms followed by catamorphisms (corecursive production followed by recursive consumption).
- paramorphisms, implemented via ‘para’, are generalized folds with access to the input argument corresponding to the most recent state of the computation.
Let me digress slightly on nomenclature.
Yes, the names of these things are celebrations of the ridiculous. There’s no
getting around it; they look like self-parody to almost anyone not
pre-acquainted with categorical concepts. They have been accused — probably
correctly — of being off-putting.
That said, they communicate important technical details and are actually not so
bad when you get used to them. It’s perfectly fine and even encouraged to
arm-wave about folds or unfolds when speaking informally, but the moment
someone distinguishes one particular style of fold from another via a prefix
like e.g. para, I know exactly the relevant technical distinctions required to
understand the discussion. The names might be silly, but they have their place.
Anyway.
There are many other more exotic schemes that I’m sure are quite useful (see
Tim Williams’s recursion schemes talk, for example), but I haven’t made use of
any outside of these four just yet. The recursion-schemes library contains a
plethora of unfamiliar schemes just waiting to be grokked, but in the interim
even cata and ana alone will get you plenty far.
Now let’s use the motley crew of schemes to do some useful computation on our
example data types.
Catamorphisms
Take our natural numbers type, ‘Nat’. To start, we can use a catamorphism to
represent a ‘Nat’ as an ‘Int’ by summing it up.
natsum :: Nat -> Int
natsum = cata alg where
alg ZeroF = 0
alg (SuccF n) = n + 1
Here ‘alg’ refers to ‘algebra’, which is the function that we use to define our
reducing semantics. Notice that the semantics are not defined recursively! The
recursion present in ‘Nat’ has been decoupled and is handled for us by ‘cata’.
And as a plus, we still don’t have to deal with the ‘Fix’ constructor anywhere.
As a brief aside: I like to write my recursion schemes in this way, but your
mileage may vary. If you’d like to enable the ‘LambdaCase’ extension, then
another option is to elide mentioning the algebra altogether using a very
simple case statement:
{-# LANGUAGE LambdaCase #-}
natsum :: Nat -> Int
natsum = cata $ \case ->
ZeroF -> 0
SuccF n -> n + 1
Some people find this more readable.
To understand how we used ‘cata’ to build this function, take a look at its
type:
cata :: Foldable t => (Base t a -> a) -> t -> a
The ‘Base t a -> a’ term is the algebra; ‘t’ is our recursive datatype (i.e.
‘Nat’), and ‘a’ is whatever type we’re reducing a value to.
Historically I’ve found ‘Base’ here to be confusing, but here’s a neat trick to
help reason through it.
Remember that ‘Base’ is a type family, so for some appropriate ‘t’ and ‘a’,
‘Base t a’ is going to be a synonym for some other type. To figure out what
‘Base t a’ corresponds to for some concrete ‘t’ and ‘a’, we can ask GHCi via
this lesser-known command that evaluates type families:
> :kind! Base Nat Int
Base Nat Int :: *
= NatF Int
So in the ‘natsum’ example the algebra used with ‘cata’ must have type ‘NatF
Int -> Int’. This is pretty obvious for ‘cata’, but I initially found that
figuring out what type should be replaced for ‘Base’ exactly could be confusing
for some of the more exotic schemes.
As another example, we can use a catamorphism to implement ‘filter’ for our list type:
filterL :: (a -> Bool) -> List a -> List a
filterL p = cata alg where
alg NilF = nil
alg (ConsF x xs)
| p x = cons x xs
| otherwise = xs
It follows the same simple pattern: we define our semantics by interpreting
recursion-less constructors through an algebra, then pump it through ‘cata’.
Anamorphisms
These running examples are toys, but even here it’s really annoying to have to
type ‘succ (succ (succ (succ (succ (succ zero)))))’ to get a natural number
corresponding to six for debugging or what have you.
We can use an anamorphism to build a ‘Nat’ value from an ‘Int’:
nat :: Int -> Nat
nat = ana coalg where
coalg n
| n <= 0 = ZeroF
| otherwise = SuccF (n - 1)
Just as a small detail: to be descriptive, here I’ve used ‘coalg’ as the
argument to ‘ana’, for ‘coalgebra’.
Now the expression ‘nat 6’ will do the same for us as the more verbose example
above. As always, recursion is not part of the semantics; to have the integer
‘n’ we pass in correspond to the correct natural number, we use the successor
value of ‘n — 1’.
Paramorphisms
As an example, try to express a factorial on a natural number in terms of
‘cata’. It’s (apparently) doable, but an implementation is not immediately
clear.
A paramorphism will operate on an algebra that provides access to the input
argument corresponding to the running state of the recursion. Check out the
type of ‘para’ below:
para :: Foldable t => (Base t (t, a) -> a) -> t -> a
If we’re implementing a factorial on ‘Nat’ values then ‘t’ is going to
correspond to ‘Nat’ and ‘a’ is going to correspond to (say) ‘Integer’. Here it
might help to use the ‘:kind!’ trick to help reason through the requirements of
the algebra. We can ask GHCi to help us out:
> :kind! Base Nat (Nat, Int)
Base Nat (Nat, Int) :: *
= NatF (Nat, Int)
Side note: after doing this trick a few times you’ll probably find it much
easier to reason about type families sans-GHCi. In any case, we can now
implement an algebra corresponding to the required type:
natfac :: Nat -> Int
natfac = para alg where
alg ZeroF = 1
alg (SuccF (n, f)) = natsum (succ n) * f
Here there are some details to point out.
The type of our algebra is ‘NatF (Nat, Int) -> Int’; the value with the ‘Nat’
type, ‘n’, holds the most recent input argument used to compute the state of
the computation, ‘f’.
If you picture a factorial defined as
0! = 1
(k + 1)! = (k + 1) * k!
Then ‘n’ corresponds to ‘k’ and ‘f’ corresponds to ‘k!’. To compute the
factorial of the successor to ‘n’, we just convert ‘succ n’ to an integer (via
‘natsum’) and multiply it by ‘f’.
Paramorphisms tend to be pretty useful for a lot of mundane tasks. We can
easily implement ‘pred’ on natural numbers via ‘para’:
natpred :: Nat -> Nat
natpred = para alg where
alg ZeroF = zero
alg (SuccF (n, _)) = n
We can also implement ‘tail’ on lists. To check the type of the required
algebra we can again get some help from GHCi; here I’ll evaluate a general type
family, for illustration:
> :set -XRankNTypes
> :kind! forall a b. Base (List a) (List a, b)
forall a b. Base (List a) (List a, b) :: *
= forall a b. ListF a (Fix (ListF a), b)
Providing an algebra of the correct structure lets ‘tailL’ fall out as follows:
tailL :: List a -> List a
tailL = para alg where
alg NilF = nil
alg (ConsF _ (l, _)) = l
You can check that ‘tailL’ indeed returns the tail of its argument.
Hylomorphisms
Hylomorphisms can express general computation — corecursive production followed
by recursive consumption. Compared to the other type signatures in
recursion-schemes, ‘hylo’ is quite simple:
hylo :: Functor f => (f b -> b) -> (a -> f a) -> a -> b
It doesn’t even require the full structure built up for e.g. ‘cata’ and ‘ana’;
just very simple F-{co}algebras.
My favourite example hylomorphism is an absolutely beautiful implementation of
mergesort. I think it helps illustrate how recursion schemes can help tease out
incredibly simple structure in what could otherwise be a more involved problem.
Our input will be a Haskell list containing some orderable type. We’ll use it
to build a balanced binary tree via an anamorphism and then tear it down with a
catamorphism, merging lists together and sorting them as we go.
The resulting code looks like this:
mergeSort :: Ord a => [a] -> [a]
mergeSort = hylo alg coalg where
alg EmptyF = []
alg (LeafF c) = [c]
alg (NodeF l r) = merge l r
coalg [] = EmptyF
coalg [x] = LeafF x
coalg xs = NodeF l r where
(l, r) = splitAt (length xs `div` 2) xs
What’s more, the fusion achieved via this
technique is
really quite lovely.
Wrapping Up
Hopefully this article helps fuel any ‘programming via structured
recursion’ trend that might be ever-so-slowly growing.
When programming in a language like Haskell, a very natural pattern is to write
little embedded languages and mini-interpreters or compilers to accomplish
tasks. Typically these tiny embedded languages have a recursive structure, and
when you’re interpreting a recursive structure, you have use all these lovely
off-the-shelf strategies for recursion available to you to keep your programs
concise, modular, and efficient. The recursion-schemes library really has all
the built-in machinery you need to start using these things for real.
If you’re interested about hearing about using recursion schemes ‘for real’ I
recommend Tim Williams’s Exotic Tools For Exotic
Trades talk (for a motivating
example for the use of recursion schemes in production) or his talk on
recursion schemes from the London
Haskell User’s Group a few years ago.
So happy recursing! I’ve dropped the code from this article into a
gist.
Thanks to Maciej Woś for review and helpful comments.