R language MCMC: Metropolis-Hastings sampling for Bayesian estimation of regression

R language MCMC: Metropolis-Hastings sampling for Bayesian estimation of regression

Original link: tecdat.cn/?p=19664 

Original Source: Topend Data Tribe


MCMC is a general technique for sampling from complex probability models.

  1. Monte Carlo
  2. Markov chain
  3. Metropolis-Hastings algorithm


If you need to calculate the average or expected value of the function f( ) of the random variable with complex posterior pdf p( | y).


You may need to calculate the maximum value of the posterior probability distribution p( ).


One way to solve the expected value is to draw N random samples from p( ). When N is large enough, we can approximate the expected value or maximum value by the following formula


The same strategy is applied to find argmaxp( | y) by sampling from p( | y) and sampling the maximum value in the set.


1.1 Direct simulation

1.2 Inverse CDF

1.3 Reject/accept sampling

If we don't know the exact/standardized pdf or are very complex, MCMC will come in handy.

Markov chain

In order to simulate the Markov chain, we must formulate a  transition kernel T(xi, xj). The transition core is the probability of transition from state xi to state xj.

 The convergence of a Markov chain means that it has a stationary distribution . The statistical distribution of the Markov chain is stable, so it means that the distribution will not change over time.

Metropolis algorithm

 For a Markov chain is smooth of . Basically means

The probability of being in state x and transitioning to state x'must be equal to the probability of being in state x'and transitioning to state x


The method is to divide the conversion into two sub-steps; Candidate and Accept Reject.

Let q(x'|x) denote the candidate density , we can use the probability   (x'|x) to adjust q.

The candidate distribution  Q(X'|X) is the conditional probability of the state X'of a given candidate X,

And Accept  the conditional probability of the distribution (x'|x) to accept the candidate state X'-X'. We designed the acceptance probability function to meet the detailed balance.

The  transition probability  can be written as:


Insert the previous equation, we have


Metropolis-Hastings algorithm 

The choice of A follows the following logic.

The transition from x to x'under q is too frequent. Therefore, we should choose (x | x')=1. However, in order to satisfy the meticulous and stable, we have


The next step is to choose an acceptance that meets the above conditions. Metropolis-Hastings is a common  choice :


That is, when the acceptance is greater than 1, we always accept, and when the acceptance is less than 1, we will reject accordingly. Therefore, the Metropolis-Hastings algorithm includes the following:

  1. Initialization: randomly select an initial state x;
  2. Randomly select a new state x'according to q(x'|x);

3. Accept the state according to (x'|x). If you do not accept it, no transfer will occur, so there is no need to update anything. Otherwise, transfer to x';

4. Transfer to 2 until the T state is generated;

5. Save state x and go to 2.

In principle, we extract saved states from the distribution P(x), because step 4 guarantees that they are not relevant. The value of T must be selected according to different factors such as the candidate distribution. Importantly, it is not clear which distribution q(x'|x) should be used; adjustments must be made to the specific problem at hand.


An interesting feature of the Metropolis-Hastings algorithm is that it  only depends on the ratio


Is the probability between the candidate sample x'and the previous sample xt,


Is the ratio of candidate densities in two directions (from xt to x'and vice versa). If the candidate density is symmetric, it is equal to 1.

The Markov chain starts from any initial value x0, and the algorithm runs multiple iterations until the "initial state" is "forgotten". These discarded samples are called burn-in. The remaining sets of acceptable values for x represent samples in the distribution P(x)


Metropolis sampling

A simple Metropolis-Hastings sampling

Let us look at  simulating arbitrary shape and scale parameters from the  gamma distribution , using the Metropolis-Hastings sampling algorithm.

The functions of the Metropolis-Hastings sampler are given below. The chain is initialized to zero, and N (a/b, a/(b * b)) candidates are recommended at each stage.

Metropolis-Hastings independent sampling based on normal distribution and the same gamma of mean and variance

  1. Start xt from a certain state. The x in the code.

  2. Propose a new state x'candidate in the code

  3. Calculate the "Probability of Acceptance"

  4. Get some uniformly distributed random number u from [0,1]; if u < accept this point, set xt + 1 = x'. Otherwise, reject it and set xt + 1 = xt.


MH visualization


set.seed(123) for (i in 2:n) { can <- rnorm(1, mu, sig) aprob <- min(1, (dgamma(can, a, b)/dgamma(x, a, b))/(dnorm(can, mu, sig)/dnorm(x, mu, sig))) u <- runif(1) if (u <aprob) x <- can vec[i] <- x Copy code


Setting parameters.


nrep<- 54000 burnin<- 4000 shape<- 2.5 rate<-2.6 Copy code

Modify the diagram to include only the chain after the burn-in period


vec=vec[-(1:burnin)] #vec=vec[burnin:length(vec)] Copy code


par(mfrow=c(2,1)) # Change the main frame, how many graphics are in a frame plot(ts(vec), xlab="Chain", ylab="Draws") abline(h = mean(vec), lwd="2", col="red") Copy code


Min. 1st Qu. Median Mean 3rd Qu. Max. 0.007013 0.435600 0.724800 0.843300 1.133000 3.149000 Copy code


var(vec[-(1:burnin)]) Copy code
[1] 0.2976507Copy code

Initial value

First sample 

 Is the initial/starting value of our chain. We can change it to see if the convergence has changed.


x <- 3*a/b vec[1] <- x Copy code


If the candidate density matches the shape of the target distribution P(x), that is, q(x'|xt) P(x')q(x'|), the algorithm works best. xt) P(x'). If the normal candidate density q is used, the variance parameter 2 must be adjusted during the burn-in period.

Usually, this is done by calculating the acceptance rate, which is the proportion of candidate samples accepted in the window of the last N samples.

If 2 is too large, the acceptance rate will be very low because the candidate may fall in an area with a much lower probability density, so a1 will be very small and the chain will converge very slowly.

Example 2: Bayesian estimation of regression

Metropolis-Hastings sampling is used in Bayesian estimation regression model.

Setting parameters


DGP and graphs


# Create an independent x value, approximately zero x <- (-(Size-1)/2):((Size-1)/2) # Create related values based on ax + b + N (0, sd) y <- trueA * x + trueB + rnorm(n=Size,mean=0,sd=trueSd) Copy code

Normal distribution


pred = a*x + b singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T) sumll = sum(singlelikelihoods) Copy code

Why use logarithm

The logarithm of the probability in the likelihood function, which is why I sum the probabilities of all data points (the logarithm of the product is equal to the sum of the logarithms).

Why do we do this? This is strongly recommended, because the multiplication of many small probabilities will become very small. At some stage, the computer program will get into numerical rounding or underflow problems.

So  when you write probabilities , always use logarithms

Example: Draw a likelihood curve with slope a


# Example: Draw the likelihood curve of slope a plot (seq(3, 7, by=.05), slopelikelihoods, type="l") Copy code

Prior distribution

The uniform distribution and normal distribution of these three parameters.


# Prior distribution # Change the priority, log is True, so these are all logs density/likelihood aprior = dunif(a, min=0, max=10, log = T) bprior = dnorm(b, sd = 2, log = T) sdprior = dunif(sd, min=0, max=30, log = T) Copy code


The product of the prior and the probability is the actual quantity that MCMC will process. This function is called the posterior function. Again, here we use sums because we use logarithms.


posterior <- function(param){ return (likelihood(param) + prior(param)) } Copy code

Metropolis algorithm

This algorithm is one of the most common Bayesian statistical applications for sampling from  posterior density .

The posterior defined above.

  1. Start with random parameter values
  2. According to the probability density of a candidate function, select a new parameter value close to the old value
  3. Jump to this new point with probability p(new)/p(old), where p is the objective function, and p>1 also means jump
  4. Note that we have a  symmetric jump/ candidate distribution  q(x'|x).

The standard deviation is fixed.

So the probability of acceptance is equal to


######## Metropolis Algorithm################ for (i in 1:iterations){ probab = exp(posterior(proposal)-posterior(chain[i,])) if (runif(1) <probab){ chain[i+1,] = proposal }else{ chain[i+1,] = chain[i,] } Copy code



(E) Output the accepted value and explain it.


chain = metrMCMC(startvalue, 5500) burnIn = 5000 accep = 1-mean(duplicated(chain[-(1:burnIn),])) Copy code

The first step of the algorithm may be biased by the initial value, so it is usually discarded for further analysis (burn-in period). The interesting output is the acceptance rate: How often is the candidate accepted and rejected by the algorithm? The candidate function affects the acceptance rate: generally, the closer the candidates, the greater the acceptance rate. However, a very high acceptance rate is usually unhelpful: it means that the algorithm "stays" on the same point, which leads to less than ideal handling of the parameter space (mixing).

We can also change the initial value to see if it changes the result/converges.


startvalue = c (4,0,10) copying the code



V1 V2 V3 Min. :4.068 Min. :-6.7072 Min.: 6.787 1st Qu.:4.913 1st Qu.:-2.6973 1st Qu.: 9.323 Median :5.052 Median :-1.7551 Median :10.178 Mean :5.052 Mean :-1.7377 Mean :10.385 3rd Qu.:5.193 3rd Qu.:-0.8134 3rd Qu.:11.166 Max. :5.989 Max.: 4.8425 Max. :19.223 Copy code


#Compare: summary(lm(y~x)) Copy code
Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -22.259 -6.032 -1.718 6.955 19.892 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) -3.1756 1.7566 -1.808 0.081. x 5.0469 0.1964 25.697 <2e-16 *** --- Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1 Residual standard error: 9.78 on 29 degrees of freedom Multiple R-squared: 0.9579, Adjusted R-squared: 0.9565 F-statistic: 660.4 on 1 and 29 DF, p-value: <2.2e-16 Copy code


summary (lm (y ~ x) ) $ sigma duplicated code
[1] 9.780494Copy code


coefficients(lm(y~x))[1] Copy code
(Intercept) -3.175555 Copy code


coefficients(lm(y~x))[2] Copy code
x 5.046873 Copy code



### summary: ####################### par(mfrow = c(2,3)) hist(chain[-(1:burnIn),1],prob=TRUE,nclass=30,col="109" abline(v = mean(chain[-(1:burnIn),1]), lwd="2") Copy code


Most popular insights

1. Use R language to simulate mixed queuing random service queuing system

2. Use queuing theory to predict waiting time in R language

3. Implement Markov chain Monte Carlo MCMC model in R language

4. Markov regime switching model in R language

5. Matlab Bayesian Hidden Markov hmm model

6. Use R language to simulate mixed queuing random service queuing system

7. Python portfolio optimization based on particle swarm optimization

8. R language Markov transformation model to study traffic casualties accident prediction

9. Use machine learning to recognize changing stock market conditions-the application of hidden Markov model