Markov Chain Sampling
Motivation
Sampling is an important part of statistics - it’s used in simulation (to verify algorithms), stochastic integration, etc.
It’s quite common that we want to generate some random variables that follow a specific distribution. For example, we may wish to generate a uniform random variable or generate Bernoulli random variables or even generate a path of a stochastic process.
Sampling from popular distributions (e.g. normal, binomial, uniform, etc.) is easy - we can simply use the built-in functions for these.
But what if the distribution that we want to generate is not popular? e.g. generate a sample path of the simple random walk given .
In such cases, we can use other methods: rejection sampling, transformer, MCMC sampling, etc.
Here, we’re only going to focus on MCMC sampling. MCMC stands for “Markov Chain Monte Carlo”.
MCMC
Consider a simple example: suppose we wish to sample from distribution and further suppose that we do not have any built-in function to do so. The probability distribution function is:
We will try to generate under such a distribution, using what we’ve learnt about Markov Chains → this is why it’s called Markov Chain Monte Carlo sampling.
We can do this as such:
- Define a MC by specifying the state space, time index, and
- If the MC is “good” (i.e., useful for our sampling purposes), then converges to the limiting distribution .
- Then, we can take any for a large , which should follow
- If we can ensure that , then we’re done 🎉
Now, the question is this: how do we come up with a MC given a limiting distribution we wish to achieve (in this case, )?
- The state space should be the exactly same as the space of (the space of the distribution we wish to generate). In this case,
- The time span should be
The difficult part is the transition probability matrix . We need this to be true (this is our goal):
because under the limiting distribution, must hold. We already know and because they are the goals we want to reach → but we still need to find the values of for all (these are the unknowns).
Notice that we have unknowns, but only equations! This means it’s “too flexible” - there are too many possible ways to set the variables up 😟
We want a convenient way to set up the equations and solve for the unknowns, that can be generalized to any sampling distribution.
For this, we first need to digress to learn about global and local balanced equations.
Global Balanced Equations
Recall the equations for the limiting distribution are given by .
Element-wise, this means that:
Note that and and so, the above equation can be interpreted as: the probability of being at state in the next step is given by taking the sum over all the states of the probability that the process will jump from any state to state in the next step. This is necessary for the “equilibrium” to be stable. i.e., the limiting distribution to be invariant/stationary.
Another way to look at it is this:
- The LHS gives the total probability of jumping out state since and because it is a probability distribution. Note that the jump may return to but it still counts as a jump.
- The RHS gives us the total probability of entering state (as explained above).
So, we have that the total leaving probability from any state must be equal to the total entering probability of that state (from all states). This always needs to be true for a limiting distribution to exist.
These are called the global balanced equations. “Global” because they deal with any states → every state depends on every other state (in these equations).
This is very similar to the “flow conservation” constraint in Max-Flow algorithms.
Local Balanced Equations
But what if we not only consider the total probability of entering/leaving one state, but we also consider which state it jumps to/from.
The probability of jumping from is:
The probability of jumping from is:
We hope that:
Such a system of equations are called local balanced equations. “Local” because we’re considering the pairwise transitions between states, instead of looking at the entering probability from all the states together.
Here, we want the leaving probability from to be the same as that from leaving so they both effectively “cancel” each other and the overall distribution remains intact.
Hence, we’re imposing a stricter constraint than the global balanced equations. We not only care that the total leaving probability is equal to the total entering probability (as we did for the global balanced equations), we also care where the process comes from and goes to - and we want to match these probabilities.
The global balanced equations constraint that the following sums must be equal:
But the local balanced equations require that every corresponding term of the sum (on both sides) must be equal.
Clearly, if all the individual corresponding terms are equal, the sums on both sides must also be equal.
Local equations are also used to discussed the time-reversibility of stochastic processes.
Local Equation vs. Global Equations
For a MC with states:
- We have global balanced equations and we have local balanced equations
- If there is a solution that satisfies local balanced equations, then it is also a solution of the global balanced equations
- For an ergodic MC, there always exists a solution to the global balanced equations, but not necessarily to the local balanced equations (because the additional pairwise constraints are imposed by us, they are our assumptions)
Coming back to sampling, now we can set up such that we require to satisfy the local balanced equations as well, and now we have equations instead of , which is much better.
As a summary until now, our aim has been to sample from a distribution . So, we try to setup a MC such that the limiting distribution is exactly .
- , the state space of this MC, should be exactly , called the “support” of , which refers to all the possible values of the distribution we wish to sample from
- is the index set
- should be set up in order to satisfy the local balanced equations
Transition Probability Matrix Set Up
Let’s now look into how to set up the transition probability matrix.
Approach 1
We can start with a to set up all the , and then standardize each row to be with sum .
This might be difficult because we will have a lot of equations to solve simultaneously. Moreover, if we have states, this is not feasible. Or if we have a continuous set , i.e., is not discrete at all (so we can’t even talk about “state” in this context).
Approach 2
Begin with a general transition probability on the state space, and then “adjust” to . How do we “adjust” it?
First, we introduce the thinning algorithm: consider a simple example where and . Then, clearly and . We notice that compared to , has a larger probability of being zero. We call this the thinning on the distribution of .
Another example: suppose where is some arbitrary distribution where . We further define . Then, what is the distribution of
When , then:
And when ,
Clearly, for all . Again, we see the “thinning” effect.
Notice that these thinning algorithms do not have any restriction on the range of → even if is negative, we can always just take the absolute value of in the parameter of .
We use this “thinning algorithm” idea to adjust towards .
Given , sample a new random variable (the distribution of is given by the th row of ). Note that is not (i.e., it is not guaranteed that we will make the jump to → we will “thin” this probability of jumping to ensure we satisfy the local balanced equations).
Currently, the local balanced equations are:
which usually does not hold because we’ve chosen arbitrarily.
We want to add a new term so that:
Then,
This is easier to solve because s don’t have to sum to one. They can be any arbitrarily chosen numbers as long as they’re between 0 (exclusive) and 1 (inclusive)!
Now, see that is the probability of jumping from , and is the probability of jumping from . We want these two flows between the state to be equal.
We can easily get the 2 flows to be equal by choosing an appropriate value of .
We can first find the minimum of the two flows, i.e,. . We want to reduce the bigger one to be equal to the smaller one by multiplying the bigger one by some number (less than 1) and multiplying the other term by 1 (keep it constant).
Mathematically, we can write this as such: we desire . So, we set:
By setting up in such a manner, we can ensure the equality of incoming/outgoing flow for any pair of states.
We can summarize approach 2 in this way (and we call this Hastings-Metropolis Algorithm):
Set up so that the MC with transition probability matrix is irreducible, i.e., there is so that for any )
Define as
(You can see that this is equivalent to our above equation)
With and , we generate as
In other words, we redirect all the remaining flow (so that the sum remains one) to the same state. Observe that there is a thinning effect here: we’re reducing the flow out from every state to other states, and increasing the flow from the state to itself (to make up for the reduction to other states). Basically, we’re exploiting the fact that we can have any probability and the local balanced equation for will trivially hold - so we focus on the rest of the states first, and then we assign
Here, flow is equivalent to the probability.
Remarks
- is set up so that the MC is irreducible and simple to sample (basically we should be able to use built-in functions to sample otherwise you might end up needing to do MCMC sampling for as well - recursively! 😲).
- Approach 1: is set so that all the rows are the same (so the sampling distribution is the same for all the current states, )
- Approach 2: is set so that only the neighbors of the current state have a non-zero probability. This is common when we have states (or even, too many states). Here, we define neighbours however we want - depending on the “closeness” of the states in . Notice that we don’t have the MC - we’re creating the MC, so we can also decide which states to consider as neighbours or not. Example: if then one way to define neighbors of is and (for all ). We could’ve also chosen - it’s really up to us, and depends on the problem. This approach is common because we only deal with a small group of states at every step. The downside of this is that from one state, we can only jump to the nearby states - it’ll take many more steps to converge to the limiting distribution (because the “step size” is small).
- The coefficient can be viewed as the “thinning” of the jumping probability from . We don’t want to be too small, otherwise it’ll take very long to converge to the limiting distribution - so, we want to encourage jumps. This is why for every pair of vertices, at least one of or is true. It is definitely possible to scale the ’s even smaller (so the flow between and is even lower than ) but that means we’re more likely to stay at (and ) instead of jumping (recall that the total flow, since it is the probability, needs to be and we just dump the “remaining” outgoing flow to the state itself). So, we want to maximize while ensuring the local balanced equations hold.
- We don’t actually have to generate beforehand → we can do this “on the fly” by using directly (we’ll talk more about this soon).
This entire sampling and thinning strategy is called the Hastings-Metropolis Algorithm.
Example
Let’s come back to our example. We want to sample from . We know that
- Define
- Given , we sample → notice that this means we can only travel upto from any current state . Also, we have to use so that zero does not become an absorbing state. Note that this is just one of the ways we can setup . We have chosen it such that the process can only go to its neighbouring states. (e.g. from we can only go to a state in the interval , where the states are all non-negative integers). Clearly, the sum of all outgoing edges is 1 because the binomial distribution is also a valid probability distribution.
- Hence, let be the number of “neighbors” of state , the matrix follows:
- Notice that we don’t have to explicitly write out (and in this case, is infinite so we definitely cannot, but that’s not an issue for the algorithm). We can just define every row of to be a distribution that we choose, that defines how we will jump from the current state to every other state. We only need to ensure that the values in the same row sum to 1.
- Define Note that if , then so we need to make sure that even though → if the flow in one direction is zero, we need to force the flow in the other direction to also be zero. Observe that even though we have states in this MC, it is not a problem! We only generate the values that we need, based on the current state. See that there would be values of , which is also infinite, but we only compute a finite set of them, every time we need it. This is very useful! (And this is only because we set up cleverly to only depend on the neighbouring state).
Then, we simulate the MC for (where is very large, say 1000) times, and take as the sample. How exactly do we simulate it? To simulate it, do we need to set up ? What if is infinitely large?
We don’t need to explicitly generate at all! We can sample from only (in fact, we only use one row of at any step), then we use rejection sampling to decide whether to jump or not. We compare with .
- We start from any state (we don’t care because we’re only interested in the limiting distribution)
- Repeat 1000 times:
- Given the current state , generate . (Again note that is not - we have not yet made the jump yet)
- Based on the realization of (now a fixed constant for this iteration), obtain the value of .
- Now, we will accept this jump from with probability . So, we generate a uniform random variable in the range , and accept the jump if . If not, we reject the jump and we stay at .
Here’s a rough pseudocode:
TOTAL_STEPS = 5000 # large enough to ensure convergence
process = [] # track the path of the process
x = 1 # initial state
for step in 1...TOTAL_STEPS
obtain t from T ~ Binom(max(2 * x, 2), 1/2)
calculate alpha(X_n, t)
generate u from U ~ uniform(0, 1)
if (u < alpha) {
x = t # accept jump from X_n to y, i.e. X_{n+1} = t
} else {
x = x # no jump, thinning
}
process.add(x)
# now we have X_n for all 1 < n < 5000
# cut of the first 1000 steps
process = process[1001:]
# process now contains a sample from poisson distribution!
# done!
We cannot take the early stages of the simulation as part of the sample because they do not follow a poisson distribution. Recall that only the limiting distribution follows a poisson distribution. So, we need to wait until the MC reaches convergence (approximately at least) and then we can start sampling from the MC based on the value of and this will follow a poisson distribution. This is why we “cut off” the first portion of the process’ steps - so they don’t contaminate our poisson distribution.
Note that if we want an independent sample (i.e., we want to be independent poisson random variables), we cannot simply run the MC one time for iterations and then start taking the values from . This is because are not independent! (since is generated based on ) More formally, given , the distribution of does not follow - in fact, it follows (modified) .
If we want a truly independent sample, we can run the MC times (where is our desired sample size) and then take from each of the runs.
Alternatively, to make it less computationally expensive, we can leave some gaps in between. So, we sample from and we assume that the distribution after 100 steps becomes “independent” enough.
If we’re just interested in the mean of the distribution, we don’t need to ensure independence of the RVs in the sample. No matter whether the sample observations are independent or not, the mean is the same. But if we want to find the variance, we need to ensure independence among the observations in our sample.
Another good property of this MCMC sampling is that we don’t have to know the exact values of and to calculate . We only need the ratio of the two!
Recall that in Bayesian analysis, if is the prior distribution, then the posterior distribution is:
where is a constant that only depends on the realization of . In particular, the denominator does NOT depend on . So, to sample from , we do not need the denominator. The denominator is just a normalising factor to ensure that the sum of the conditional probabilities (for a given ) is . When we’re sampling, we’re not interested in the actual probabilities but their relative proportion (aka ratio).
Similarly, for , we have:
To find , we need but not the exact for .
Say that we have a vector on (here, vector is just an ordered collection of numbers - or you can think of it as a vector in a -dimensional space, and we’re only interested in the direction it points to, not really the exact lengths, since the direction tells us the relative ratios already) and a constant such that gives our desired distribution that we wish to sample from. Then,
In our example where , we observe that the constant is present in all the ’s. So, we can ignore it (since it cancels out anyway).
is called a kernel function, and is a normalising constant. The purpose of the kernel function is to tell us the relative ratios of the probabilities of values of the distribution, and the normalizing constant ensures that the sum of all the values is (making it a valid probability distribution).
Obviously, because we want .
In summary, to sample from MCMC, we do not need , we only need .
Summary of Hastings-Metropolis Algorithm
Goal: sample a random variable where and .
- Set up on where is irreducible (and every row of should be easy to sample from - we should not have to compute every row of manually beforehand - so we should use common distributions as rows of ).
- Define
- Set up such that and
- Since can be large (even infinite), we do this “on the fly” → sample from directly and then use rejection sampling to decide whether to jump or not (by comparing against ). The two are identical in their probabilistic structure.
- Simulate this MC for a long time, until convergence
- Truncate some beginning steps to make sure we only sample from the limiting distribution
Additional Remarks
For Hastings-Metropolis algorithm, there are many choices of . In general, we want to choose so that it is easy to generate (using built-in functions), and the probability of jumping, and is large so that the convergence is faster.
When and are very small, it might be difficult to calculate directly. Then, we can find first and then turn it into the fraction.
If is small, we can precompute itself, instead of relying on and performing rejection sampling in every iteration. This way, we only compute once for each entry - for a total of computations, which can be smaller than the total number of iterations we run the MC for.
Hastings-Metropolis algorithm is not the only way to set up , we can use other algorithms too.
MCMC in Bayesian Statistics
Suppose we have many bulbs from an unknown factory (all bulbs come from the same factory). There are 3 possible factories. Factory A, B, C produce bulbs with lifetimes , , respectively.
Originally, we assume that each factory is equally likely.
We examine 10 bulbs, and find the lifetime to be:
With the observed data, can we improve our guess about which factory produces the bulbs?
We can use bayes’ theorem to write down the probability for each factory as:
This gives us the posterior distribution (and is a more “accurate” guess since it incorporates more information about the data).
Right now, calculating is not hard because we only have 3 factories. But what if we had infinitely many factories?
Actually, we don’t even need to calculate . We only need an approximation of the probability. If we can generate enough samples, then the empirical distribution will give us that approximation.
So, we use MCMC to sample the distribution.
Here:
- State Space =
- Set every row of to be a uniform distribution on . This is how we will sample .
- Calculate the acceptance rate to be:Observe that we only need . So, if , then:Moreover, we can calculate this on the fly → as and when needed, instead of calculating it for every pair of in advance. Observe we no longer need
- Then, we simulate the MC for a large enough number of iterations, and obtain an empirical sample → since the limiting distribution is the proportion of time the process stays at a given state, in this example, it is equivalent to the probability of each factory given the data.
Example: Complicated Distribution
Consider a sampling question. We have a vector . We want to sample a permutation of it and that permutation must satisfy:
for a given constant .
For this distribution, we want to know .
Approach 1: if is small
- Generate all permutations (infeasible if is large, which is why we will use MCMC)
- Filter out the permutations that do not satisfy the condition.
- For each permutation that satisfies the condition, take the term .
- Find the average of of all these permutations to give you .
Approach 2: if is large
We know how to draw random permutations. We can keep the draws that satisfy the conditions only. However, if is large, we may spend a long time to find an accepted sample. So, using rejection sampling is also not feasible here 😔
We want to use MC to imrpove the efficiency.
Here’s the problem (stated formally):
- The outcome space is the set of all possible permutations that satisfy
- Let → clearly, can be very large and is unknown to us beforehand.
- Probability mass function: where is just a permutation.
- Observe that here, is the kernel function and is the normalizing constant.
- We don’t know (it would take us to find ) → but we actually don’t need to know since we’re only interested in the ratio (this is the benefit of MCMC!)
- Goal: to find
We can set up MCMC by the Hastings-Metropolis algorithm:
- State space all possible permutations such that
- The time index is
- The initial state is and so, it must be in (this is clearly the maximizer of and so, if this doesn’t satisfy the constraint, no other permutation can)
- Define :
- (Naive) Approach 1: What if we just set to be the uniform distribution over all permutations? We may have to reject many samples before we find one that belongs to (i.e., satisfies the constraints).
- (Better) Approach 2: Consider the neighbors of where the permutation is caused by an exchange of any two of the positions. Clearly, with this definition of , the chain is irreducible (we can reach every other permutation through a series of exchanges) and aperioidc (we can reach the original state in 2 steps, as well as 3 steps → period must be 1) with finite states ( is finite, so any number less than it is also finite). With this definition of , it is easier to calculate the one-step transition probability, and the probability of staying in is larger than the uniform distribution (since we’re not completely randomizing the permutation).
- Define
- Let be the number of neighbors of state
- (since as both and are valid permutations in , so the indicator functions are ; also, and )
Then, we can execute the following (pseudo-)code:
Set up the number of steps to run, say . Let be a zero-vector of length . This will store the path of throughout our MCMC sampling process.
Initialize
While
- Calculate , the number of (valid!) neighbors of . (this takes because we can swap any pair of indices, and calculate the new sum smartly (just subtract the original terms, and add the new ones) instead of recomputing the entire sum from scratch)
- Randomly select 2 (valid!) values of . Interchange them to get .
- Calculate , the number of neighbors of .
- Generate and obtain a realization of .
- If , set
- Set
- Increment by
Truncate the first (say,) samples of (since the initial state distribution is not yet limiting, it won’t be accurate if we include it). Then, our estimatior of is:
MCMC on Continuous Distribution
So far, we’ve discused the MCMC sampling on discrete random variables. But what if we want to sample from a continuous distribution? Specifically, we are considering discrete-time, continuous-state stochastic processes.
Can we still use MCMC sampling?
It turns out yes! Such processes with markovian property are also markov chains. Hence, the limiting distribution, global balanced equations, and local balanced equations still work.
Hence, we can generalize MCMC smapling on the continuous distribution as well.
For such a MC
- The time index does not change
- The state space is a continuous set
- The Markovian property can be represented by the conditional density (not probability, since the probabiity of any particular state is zero):So, the one-step transition porbability is the conditional density . Here and are taken from a continuous set.
- Recall that for discrete MCs, we have . Analogously, for the continuous case, we have: .
- If for all , we can find the conditional density for the next state being to be, (very similar to the discrete case)
- Let be the limiting distribution of such a MC (if it exists). We want (the limiting distribution is stationary - does not change after a transition). Then, for every state (given the distribution currently follows , we want it to be stationary, i.e., even the distribution in the next step should be ). This is the continuous counterpart to our (familiar) equation.
- Local Balanced Equations: where is the limiting distribution and is the conditional density. This is the continuous conuterpart of . So, the that we generate is also a continuous bivariate function which gives us the transition probabilities. Note that now, we cannot think of as a matrix - we have an uncountably infinite number of rows and columns! 😲 What would the th row th cell even mean?? That’s why matrices are for discrete (aka countable) stuff only.
- Using the local balanced equations, we can define such that . So, we can set .
Then, we can set up the Hastings-Metropolis Algorithm analogously.
The notation used below is different from that above! Below, is no longer the limiting distribution, but rather serves the role of . The (desired) limiting distribution below is .
Suppose we wanted to sample where (i.e., is the normalising constan t and is the kernel function). Then, we can:
- Set up the number of steps . Set to be a vector of length that will record the path.
- Set up one-step density functions for the transitions.
- Initialize and (any random )
- While
- Generate
- Generate
- If , let . Else,
- Set