# What is MCMC and when should I use it?

MCMC is just an algorithm that samples from the distribution.

This is just one of many algorithms. This term stands for "Markov Chain Monte Carlo" because it is a "Monte Carlo" (ie random) method using "Markov Chain" (which we will discuss later). MCMC is just one type of Monte Carlo method, although many other commonly used methods can be regarded as simple special cases of MCMC.

# Why should I sample from the distribution?

Taking samples from the distribution is the easiest way to solve some problems.

Perhaps the most commonly used method of MCMC is to draw samples from the posterior probability distribution of a model in Bayesian inference. Through these samples, you can ask some questions: "What is the average value and reliability of the parameters?".

If these samples are independent samples from the distribution, the estimated mean will converge on the true mean.

Suppose our target distribution is a normal distribution s with mean m and standard deviation.

As an example, consider using the mean m and standard deviation s to estimate the mean of the normal distribution (here, I will use the parameters corresponding to the standard normal distribution):

We can easily use this rnorm function to sample from this distribution

``` seasamples<-rn 000,m,s)
Copy code```

The average of the sample is very close to the true average (zero):

``` mean(sa es)

##  -0. 537
Copy code```

In fact, in this case, the expected variance of the \$n\$ sample estimate is \$1/n\$, so we expect most of the value to be in \$/pm 2/,//sqrt {n} = 0.02.

``` summary(re 0,mean(rnorm(10000,m,s))))

## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.03250 -0.00580 0.00046 0.00042 0.00673 0.03550
Copy code```

This function calculates the sum of cumulative averages.

``` cummean<-fun msum(x)/seq_along(x)

plot(cummaaSample",ylab="Cumulative mean",panel.aabline(h=0,col="red"),las=1)
Copy code```

Convert the x-axis to logarithmic coordinates and display another 30 random methods:

You can draw sample quantiles from your series of sampling points.

This is the point calculated by analysis, and the 2.5% of its probability density is lower than:

``` p<-0.025

a.true<-qnorm(p,m,s)

a.true

1##  -1.96
Copy code```

We can estimate this by direct integration in this case

```aion(x)
dnorm(x,m,s)
g<-function(a)
integrate(f,-Inf,a)\$value
a.int<-uniroot(function(x)g(a10,0))\$roota.int

1##  -1.96
Copy code```

And use Monte Carlo integration to estimate points:

```a.mc<-unnasamples,p))
a.mc

##  -2.023

a.true-a.mc

##  0.06329
Copy code```

However, this will converge within the limit where the sample size tends to infinity. In addition, it is possible to make statements about the nature of the error; if we repeat the sampling process 100 times, then we get a series of estimates of errors of the same magnitude as those near the mean:

``` a.mc<-replicate(anorm(10000,m,s),p))
summary(a.true-a.mc)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## -0.05840 -0.01640 -0.00572 -0.00024 0.01400 0.07880
Copy code```

This kind of thing is really common. In most Bayesian inference, the posterior distribution is a function of some (possibly large) parameter vector, and you want to reason about a subset of these parameters.

In a hierarchical model, you may have a large number of random effect terms to be fitted, but you most want to make inferences about a parameter. in

In the Bayesian framework, you can calculate the marginal distribution of the parameter you are interested in on all other parameters (this is what we have to do above).

# Why does "traditional statistics" not use Monte Carlo methods?

For many problems in traditional teaching statistics, instead of sampling from the distribution, the function can be maximized or maximized. So we need some function to describe the possibility and maximize it (maximum likelihood reasoning), or some function to calculate the sum of squares and minimize it.

However, the role of Monte Carlo method in Bayesian statistics is the same as the optimization procedure in frequency statistics, which is only an algorithm for performing inference. So, once you basically know what MCMC is doing, you can treat it like most people treat their optimization program as a black box, like a black box.

# Markov Chain Monte Carlo

Suppose we want to draw some target distributions, but we cannot draw independent samples as before. There is a solution that uses Markov Chain Monte Carlo (MCMC) to do this. First of all, we must define some things so that the next sentence makes sense: what we have to do is try to construct a Markov chain whose target distribution is sampled as its stationary distribution.

# definition

Suppose we have a three-state Markov process. Let P be the transition probability matrix in the chain:

``` P<-rbind(a(.2,.1,.7),c(.25,.25,.5))
P

## [,1] [,2] [,3]
## [1,] 0.50 0.25 0.25
## [2,] 0.20 0.10 0.70
## [3,] 0.25 0.25 0.50

rowSums(P)

##  1 1 1
Copy code```

P[i,j] gives the probability j from state i to state.

Please note that unlike rows, columns do not necessarily sum to 1:

``` colSums(P)

##  0.95 0.60 1.45
Copy code```

This function takes a state vector x (where x[i] is the probability of being in the state i), and iterates it P by multiplying it with the transition matrix, so that the system advances to n steps.

``` iterate.P<-function(x,P,n){
res<-matrix(NA,n+1,len
a<-xfor(iinseq_len(n))
res[i+1,]<-x<-x%*%P
res}
Copy code```

Start from the system in state 1 (the same is true for the x vector [1,0,0], which means that the probability of being in state 1 is 100% and not in any other state)

Similarly, for the other two possible starting states:

``` y2<-iterate.P(c(0,1,0),P,n)
y3<-iterate.P(c(0,0,1),P,n)
Copy code```

This shows the convergence of the stationary distribution.

```ma=1,xlab="Step",ylab="y",las=1)
matlines(0:n,y2,lty=2)
matlines(0:n,y3,lty=3)
Copy code```

We can use R's eigen function to extract the main feature vector of the system (t() here transposes the matrix to get the left feature vector).

``` v<-eigen(t(P)
ars[,1]
v<-v/sum(v)# Normalized feature vector
Copy code```

Then add a dot to the previous number to show how close we are to convergence:

```matplot(0:n,y1a3,lty=3)
points(rep(10,3),v,col=1:3)
Copy code```

The above process iterates over the overall probabilities of different states; rather than through the actual conversion of the system. So, let's iterate the system instead of probability vectors.

``` run<-function(i,P,n){
res<-integer(n)
for(a(n))
res[[t]]<-i<-sample(nrow(P),1,pr=P[i,])
res}
Copy code```

This chain runs 100 steps:

``` samples<-run(1,P,100)
ploaes,type="s",xlab="Step",ylab="State",las=1)
Copy code```

Instead of plotting the time scores that we change over time in each state:

``` plot(cummean(samplesa2)
lines(cummean(samples==3),col=3)
Copy code```

Run it again (5000 steps)

``` n<-5000
set.seed(1)
samples<-run(1,P,n)
plot(cummeanasamples==2),col=2)
lines(cummean(samples==3),col=3)
abline(h=v,lty=2,col=1:3)
Copy code```

So the key here is: Markov chains have some nice properties. Markov chains have a fixed distribution. If we run them long enough, we can see where the chain takes time and make a reasonable estimate of the stable distribution.

# Metropolis algorithm

This is the simplest MCMC algorithm.

# MCMC sampling 1d (single parameter) problem

This is the weighted sum of two normal distributions. This distribution is quite simple, and samples can be drawn from MCMC.

Here are some definitions of parameters and target density.

``` p<-0.4ma1,2)
sd<-c(.5,2)
f<-function(x)p*dnora],sd)+(1-p)*dnorm(x,mu,sd)
Copy code```

Probability density plotting

Let's define a very simple algorithm that samples from a normal distribution centered on the current point with a standard deviation of 4

And this only requires a few steps to run MCMC. It will return a matrix from point x with the same number of rows and columns in nsteps as the number of columns of x elements. If you operate on a scalar, x will return a vector.

``` run<-funagth(x))
for(iinseq_len(nsteps))
res[i,]<-x<-step(x,f,q)
drop(res)}
Copy code```

Here are the first 1000 steps of the Markov chain, with the target density on the right:

``` layout(matrix(ca,type="s",xpd=NA,ylab="Parameter",xlab="Sample",las=1)
usr<-par("usr")
xx<-seq(usr[a4],length=301)
plot(f(xx),xx,type="l",yaxs="i",axes=FALSE,xlab="")
Copy code```

```hist(res,5aALSE,main="",ylim=c(0,.4),las=1,xlab="x",ylab="Probability density")
Copy code```

Run longer, and the results start to look better:

```res.long<-run(-10,f,q,50000)
hist(res.long,100,freq=FALSE,main="",ylim=c(0,.4),las=1,xlab
Copy code```

Now, run different scenarios-one with a large standard deviation (33) and the other with a small standard deviation (3).

```res.fast<-run(-10action(x)
rnorm(1,x,33),1000)
res.slow<-run(-10,f,functanorm(1,x,.3),1000)
Copy code```

Note the different ways the three tracks are moving.

On the contrary, the red trace rejects most of the space.

The blue trail suggests small movements that tend to be accepted, but it walks randomly along most of the trajectory. It takes hundreds of iterations to reach most of the probability density.

You can see the effect of the different scheme steps in the autocorrelation in the subsequent parameters-these graphs show the attenuation of the autocorrelation coefficient between the different lag steps, and the blue line indicates statistical independence.

``` par(mfrow=c(1,3ain="Intermediate")
acf(res.fast,las=1,m
Copy code```

From this, the effective number of independent samples can be calculated:

```1coda::effectiveSize(res)

1 2## var1 ## 187

1coda::effectiveSize(res.fast)

1 2## var1 ## 33.19

1coda::effectiveSize(res.slow)

1 2## var1 ## 5.378
Copy code```

This more clearly shows the longer running time of the chain:

``` naun(-10,f,q,n))
xlim<-range(sapply(saa100)
hh<-lapply(samples,function(x)
hist(x,br,plot=FALSE))
ylim<-c(0,max(f(xx)))
Copy code```

Display 100, 1,000, 10,000 and 100,000 steps:

```for(hinhh){plot(h,main="",freq=a=300)}
Copy code```

# MCMC in two dimensions

Given a multivariate normal density, given a mean vector (the center of the distribution) and a variance-covariance matrix.

``` make.mvn<-function(mean,vcv){
logdet<-as.numeric(detea+logdet
vcv.i<-solve(vcv)function(x){
dx<-x-meanexp(-(tmp+rowSums((dx%*%vcv.i)*dx))/2)}}
Copy code```

As mentioned above, define the target density as the sum of two mvns (unweighted this time):

``` mu1<-c(-1,1) mu2<-c(2,-2)
vcv1<-ma5,.25,1.5),2,2)
vcv2<-matrix(c(2,-.5,-.5,2aunctioax)+f2(x)x<-seq(-5,6,length=71)
y<-seq(-7,6,lena-expand.grid(x=x,y=y)
z<-matrix(aaTRUE)
Copy code```

Sampling from a multivariate normal distribution is also fairly simple, but we will use MCMC to draw samples from it.

There are a few different strategies here-we can propose actions in both dimensions at the same time, or we can sample along each axis independently. Both strategies can work, although their mixing speed will be different.

Assuming that we don't actually know how to sample from mvn, let us propose a proposal distribution that is consistent in two dimensions, sampling from a square with width "d" on each side.

Compare the sampling distribution with the known distribution:

For example, what is the marginal distribution of parameter 1?

```hisales[,1],freq=FALSa",xlab="x",ylab="Probability density")
Copy code```

We need to integrate all possible values of the first parameter and the second parameter. Then, because the objective function itself is not standardized, we must decompose it into one-dimensional integral values.

``` m<-function(x1){
g<-Vectorize(function(x2)f(c(x1,ae(g,-Inf,Inf)\$value}
xx<-seq(mina]),max(sales[,1]),length=201)
yy<-s
ue
hist(samples[,1],freq=FALSE,ma,0.25))
lines(xx,yy/z,col="red")
Copy code```

Most popular insights