An Introduction to Approximate Bayesian Computation

8 minute read

Introduction

With the term Markov Chain Monte Carlo, notably referred as MCMC, we refer to statistical methods that allow to generate a dependent sample from a probability distribution. MCMC utilises a Monte Carlo method based upon Markov Chains, and its main application is primarily in Bayesian statistics to provide samples from the posterior distribution. MCMC can be traced back to Metropolis et al. (1953) and Hastings (1970), but it became very popular after the publication of the Gibbs Sampler by Geman and Geman (1984).

An important aspect of the applicability of this family of methods is that the likelihood function must be somehow tractable. However, there are many situations where we are not able to write it down explicitly: in these situations, we therefore need to rely on different methods.

Approximate Bayesian Computation to the Resque!

Among many, Approximate Bayesian Computation (ABC) has the key benefit over MCMC of not requiring a tractable likelihood. It was introduced in the context of statistical analysis of population genetics by Tavare et al. (1997), and it became very popular thanks to Christian Robert and colleagues - for a general introduction, I strongly encourage to read the following paper.

The aim of this post is to expose you to the plain vanilla ABC algorithm. I will provide a practical example - written in Python - to give you a better understanding of the proposed methodology. In particular, we will try to replicate the example provided by Dempster et al. (1977) under the ABC framework.

The Framework

Let $x^{\ast}$ denote the observed data which is assumed to be generated by a generic model with parameters $\theta$. From Bayes’ Theorem, we all know that the posterior distribution is proportional to the likelihood times the prior, namely

\[\pi\left(\theta \vert x^{\ast} \right) = \frac{\pi(x^{\ast} \vert \theta) \pi(\theta)}{\pi(x^{\ast})} \propto \pi(x^{\ast} \vert \theta) \pi(\theta)\]

Under this framework, we are able to choose the prior distribution so that everything is tractable and easy to work with - namely, we are able to estimate the sufficient statistic. In particular, using a rejection algorithm, we can produce iid samples from $\pi\left(\theta \vert x^{\ast}\right)$ without the need to evaluate $\pi\left(x^{\ast} \vert \theta\right)$.

Standard statistical inference theory also tells us that to estimate the posterior distribution of $\theta$ all we need to know is the sufficient statistic of the data. Let T(x) denote a sufficient summary statistic of the data x.

The algorithm

Typically, when ABC is used it is not always possible to identify sufficient statistics, since the likelihood can not be written down.

Notably, however, to generate samples from the posterior distribution, we need to approximate it. Let $d(\cdot, \cdot)$ denote a distance metric. Then for $\epsilon \geq 0$, define the $\epsilon$ approximate posterior distribution, $\pi_{\epsilon}\left(\theta \vert x^{\ast}\right)$, as follows: $\pi_{\epsilon}\left(\theta \vert x^{\ast}\right) \propto \pi(\theta) \int_{\Lambda} \pi(T(x) \vert \theta) d\theta$

where $\Lambda := \left( x: x \in d(T(x), T(x^{\ast}) \leq \epsilon)\right)$.

To obtain a sample from $\pi_{\epsilon}\left(\theta \vert x^{\ast}\right)$, we use the following algorithm, that is defined here as plain-vanilla ABC Algorithm:

Simulate $\theta$ from $\pi\left(\theta\right)$, the prior on $\theta$.
Simulate data x using the model and parameters $\theta$.
If $d(T(x), T(x∗)) \leq \epsilon$, accept $\theta$ as an observation from $\pi_{\epsilon}\left(\theta \vert x^{\ast}\right)$. Otherwise reject $\theta$.
Repeat steps 1-3 until a sample of the required size has been obtained from $\pi_{\epsilon}\left(\theta \vert x^{\ast}\right)$ or a fixed number of iterations.

Note that in the case $\epsilon = 0$, the ABC algorithm reduces to a standard rejection algorithm, i.e. a Gibbs Sampler. Thus we accept not only those simulations which result in a summary statistic which agrees with $T(x^{\ast})$ but all those simulations within a distance $\epsilon$. There is a compromise in the choice of $\epsilon$: the larger $\epsilon$, the more efficient the algorithm is in terms of the acceptance probability but the greater the bias (the approximation in the posterior distribution). Choosing $\epsilon$ small is liable to have a large Monte-Carlo error because a small acceptance probability leads to a large variance of the corresponding estimator. This is because the number of accepted values, the sample used for estimating the parameter is small.

The Algorithm in Practice

A data set contains the genetic linkage of 197 animals, in which the animals are distributed into 4 categories, labeled 1 through to 4. Let $y_i$ denote the total number of animals in category i, and let $p_i$ be the probability that an animal belongs to category i, with $i=1,2,3,4$.

Suppose that $y = (y_1, y_2, y_3, y_4) = (125, 18, 20, 34)$, with category probabilities

\[(p1,p2,p3,p4) = \left(\left(\frac{1}{2} + \frac{\theta}{4}\right), \left(\frac{(1 - \theta)}{4}\right),\left(\frac{(1 - \theta)}{4}\right), \frac{\theta}{4} \right)\]

We are interested in the posterior distribution of the parameter $\theta$. We assume $\theta \sim U\left(0,1\right)$. It is possible to show, with simple algebra, that the likelihood is

\[L\left( \theta \vert y \right) \propto \left(2+\theta\right)^{y_1}\left(1−\theta\right)^{y_2+y_3}\theta^{y_4}\]

Therefore $(y_1, y_2 + y_3, y_4)$ are sufficient summary statistics. Note that it suffices to know $y_2 + y_3$ rather than the individual values $y_2$ and $y_3$.

Assume $d(x, y) = \sqrt{\sum_{i=1}^3(x_i − y_i)^2}$. Fix $\epsilon \geq 0$.

Simulate $\theta$ from $U(0,1)$.
Simulate data $x$ from Multinomial(197, p) where $p = (p1,p2,p3,p4)$ as defined above. Note that we have combined the probabilities for categories 2 and 3 for the previous argument on sufficient statistics.
If $d(x, x^{\ast}) \leq \epsilon$, accept $\theta$ as an observation from $\pi_{\epsilon}\left(\theta \vert x^{\ast}\right)$. Otherwise reject $\theta$.
Repeat steps 1-3 N times.

To perform this simulation using python, I have written a custom method called genetic_linkage that implements the ABC algorithm above. Please refer to the source file for details. The following table show the results for the first five iterations of the algorithm out of $N=100000$.

	theta	distance
0	0.771028	22.627417
1	0.492547	12.328828
2	0.248526	50.219518
3	0.087905	62.032250
4	0.205806	74.712783

The whole simulated $\theta$ and distance values are shown in the next plot:

In 61 of the 100000 simulations we simulated data with the same sufficient statistic as the observed data. If we set $\epsilon = 3$, 1043 simulations are accepted. For $\epsilon = 3$, the mean and standard deviation of the (approximate) posterior distribution are 0.62 and 0.051, respectively. These results are in close agreement with the results we could obtained using the Gibbs sampler - if you are curious, try as an exercise.

Conclusions

In this post, we have understood the importance of ABC algorithms. The example we have used is however is extremely simple since the likelihood is tractable.

It should be noted that is not always possible to identify sufficient summary statistics, and it is prohibitive to use $x^{\ast}$. In such cases a second level of approximation is often used, where a summary statistic say $S(x)$ is used in place of $T(x)$. Generally, $S(x)$ is chosen to capture key features observed in the data and hopefully T(x) is close to being sufficient. However, unless $S(x)$ is a sufficient statistic there will be an associated error. We will see in a future post the use of such an approximate summary statistics with a genetic mutation model. If you have enjoyed this post, and you want to get more, please feel free to subscribe here!

References:

Dempster et al (1977).Maximum Likelihood from Incomplete Data via the EM Algorithm
Geman and Geman (1984). Stochastic Relaxation, Gibbs Distributions, and the Bayesian Restoration of Images.
Hasting (1970). Monte Carlo sampling methods using Markov chains and their application.
Tavare et al (1997). Inferring coalescence times from DNA sequence data.

Share on

Twitter Facebook Google+ LinkedIn

Andrea Giussani, PhD

An Introduction to Approximate Bayesian Computation

Introduction

Approximate Bayesian Computation to the Resque!

The Framework

The algorithm

The Algorithm in Practice

Conclusions

References:

Share on

You may also enjoy

Normalizer and TF-IDF in NLP

Fixing Common Issues When Setting Up Jekyll & GitHub Pages on macOS

Video Games Sales Data Analysis

The Importance of Preprocessing in a Machine Learning Pipeline