# R language MCMC: Metropolis-Hastings sampling for Bayesian estimation of regression

## Original Source: Topend Data Tribe

MCMC is a general technique for sampling from complex probability models.

1. Monte Carlo
2. Markov chain
3. Metropolis-Hastings algorithm

## problem

If you need to calculate the average or expected value of the function f( ) of the random variable with complex posterior pdf p( | y).

You may need to calculate the maximum value of the posterior probability distribution p( ).

One way to solve the expected value is to draw N random samples from p( ). When N is large enough, we can approximate the expected value or maximum value by the following formula

The same strategy is applied to find argmaxp( | y) by sampling from p( | y) and sampling the maximum value in the set.

### Solution

1.1 Direct simulation

1.2 Inverse CDF

1.3 Reject/accept sampling

If we don't know the exact/standardized pdf or are very complex, MCMC will come in handy.

### Markov chain

In order to simulate the Markov chain, we must formulate a  transition kernel T(xi, xj). The transition core is the probability of transition from state xi to state xj.

The convergence of a Markov chain means that it has a stationary distribution . The statistical distribution of the Markov chain is stable, so it means that the distribution will not change over time.

# Metropolis algorithm

For a Markov chain is smooth of . Basically means

The probability of being in state x and transitioning to state x'must be equal to the probability of being in state x'and transitioning to state x

or

The method is to divide the conversion into two sub-steps; Candidate and Accept Reject.

Let q(x'|x) denote the candidate density , we can use the probability   (x'|x) to adjust q.

The candidate distribution  Q(X'|X) is the conditional probability of the state X'of a given candidate X,

And Accept  the conditional probability of the distribution (x'|x) to accept the candidate state X'-X'. We designed the acceptance probability function to meet the detailed balance.

The  transition probability  can be written as:

Insert the previous equation, we have

# Metropolis-Hastings algorithm

The choice of A follows the following logic.

The transition from x to x'under q is too frequent. Therefore, we should choose (x | x')=1. However, in order to satisfy the meticulous and stable, we have

The next step is to choose an acceptance that meets the above conditions. Metropolis-Hastings is a common  choice :

That is, when the acceptance is greater than 1, we always accept, and when the acceptance is less than 1, we will reject accordingly. Therefore, the Metropolis-Hastings algorithm includes the following:

1. Initialization: randomly select an initial state x;
2. Randomly select a new state x'according to q(x'|x);

3. Accept the state according to (x'|x). If you do not accept it, no transfer will occur, so there is no need to update anything. Otherwise, transfer to x';

4. Transfer to 2 until the T state is generated;

5. Save state x and go to 2.

In principle, we extract saved states from the distribution P(x), because step 4 guarantees that they are not relevant. The value of T must be selected according to different factors such as the candidate distribution. Importantly, it is not clear which distribution q(x'|x) should be used; adjustments must be made to the specific problem at hand.

### Attributes

An interesting feature of the Metropolis-Hastings algorithm is that it  only depends on the ratio

Is the probability between the candidate sample x'and the previous sample xt,

Is the ratio of candidate densities in two directions (from xt to x'and vice versa). If the candidate density is symmetric, it is equal to 1.

The Markov chain starts from any initial value x0, and the algorithm runs multiple iterations until the "initial state" is "forgotten". These discarded samples are called burn-in. The remaining sets of acceptable values for x represent samples in the distribution P(x)

# Metropolis sampling

### A simple Metropolis-Hastings sampling

Let us look at  simulating arbitrary shape and scale parameters from the  gamma distribution , using the Metropolis-Hastings sampling algorithm.

The functions of the Metropolis-Hastings sampler are given below. The chain is initialized to zero, and N (a/b, a/(b * b)) candidates are recommended at each stage.

Metropolis-Hastings independent sampling based on normal distribution and the same gamma of mean and variance

1. Start xt from a certain state. The x in the code.

2. Propose a new state x'candidate in the code

3. Calculate the "Probability of Acceptance"

4. Get some uniformly distributed random number u from [0,1]; if u < accept this point, set xt + 1 = x'. Otherwise, reject it and set xt + 1 = xt.

# MH visualization

```set.seed(123)

for (i in 2:n) {
can <- rnorm(1, mu, sig)
aprob <- min(1, (dgamma(can, a, b)/dgamma(x,
a, b))/(dnorm(can, mu, sig)/dnorm(x,
mu, sig)))
u <- runif(1)
if (u <aprob)
x <- can
vec[i] <- x
Copy code```

### Drawing

Setting parameters.

```nrep<- 54000
burnin<- 4000
shape<- 2.5
rate<-2.6
Copy code```

Modify the diagram to include only the chain after the burn-in period

```vec=vec[-(1:burnin)]
#vec=vec[burnin:length(vec)]
Copy code```

```par(mfrow=c(2,1)) # Change the main frame, how many graphics are in a frame
plot(ts(vec), xlab="Chain", ylab="Draws")
abline(h = mean(vec), lwd="2", col="red")
Copy code```

```    Min. 1st Qu. Median Mean 3rd Qu. Max.
0.007013 0.435600 0.724800 0.843300 1.133000 3.149000
Copy code```

```var(vec[-(1:burnin)])
Copy code```
``` 0.2976507Copy
code```

### Initial value

First sample

vec
Is the initial/starting value of our chain. We can change it to see if the convergence has changed.

```        x <- 3*a/b
vec <- x
Copy code```

### Options

If the candidate density matches the shape of the target distribution P(x), that is, q(x'|xt) P(x')q(x'|), the algorithm works best. xt) P(x'). If the normal candidate density q is used, the variance parameter 2 must be adjusted during the burn-in period.

Usually, this is done by calculating the acceptance rate, which is the proportion of candidate samples accepted in the window of the last N samples.

If 2 is too large, the acceptance rate will be very low because the candidate may fall in an area with a much lower probability density, so a1 will be very small and the chain will converge very slowly.

# Example 2: Bayesian estimation of regression

Metropolis-Hastings sampling is used in Bayesian estimation regression model.

### DGP and graphs

```# Create an independent x value, approximately zero
x <- (-(Size-1)/2):((Size-1)/2)
# Create related values  based on ax + b + N (0, sd)
y <- trueA * x + trueB + rnorm(n=Size,mean=0,sd=trueSd)
Copy code```

### Normal distribution

```
pred = a*x + b
singlelikelihoods = dnorm(y, mean = pred, sd = sd, log = T)
sumll = sum(singlelikelihoods)
Copy code```

### Why use logarithm

The logarithm of the probability in the likelihood function, which is why I sum the probabilities of all data points (the logarithm of the product is equal to the sum of the logarithms).

Why do we do this? This is strongly recommended, because the multiplication of many small probabilities will become very small. At some stage, the computer program will get into numerical rounding or underflow problems.

So  when you write probabilities , always use logarithms

# Example: Draw a likelihood curve with slope a

```# Example: Draw the likelihood curve of slope a
plot (seq(3, 7, by=.05), slopelikelihoods, type="l")
Copy code```

### Prior distribution

The uniform distribution and normal distribution of these three parameters.

```# Prior distribution

# Change the priority, log is True, so these are all logs
density/likelihood
aprior = dunif(a, min=0, max=10, log = T)
bprior = dnorm(b, sd = 2, log = T)
sdprior = dunif(sd, min=0, max=30, log = T)
Copy code```

### Posterior

The product of the prior and the probability is the actual quantity that MCMC will process. This function is called the posterior function. Again, here we use sums because we use logarithms.

```posterior <- function(param){
return (likelihood(param) + prior(param))
}
Copy code```

# Metropolis algorithm

This algorithm is one of the most common Bayesian statistical applications for sampling from  posterior density .

The posterior defined above.

2. According to the probability density of a candidate function, select a new parameter value close to the old value
3. Jump to this new point with probability p(new)/p(old), where p is the objective function, and p>1 also means jump
4. Note that we have a  symmetric jump/ candidate distribution  q(x'|x).

The standard deviation is fixed.

So the probability of acceptance is equal to

```######## Metropolis Algorithm################

for (i in 1:iterations){

probab = exp(posterior(proposal)-posterior(chain[i,]))
if (runif(1) <probab){
chain[i+1,] = proposal
}else{
chain[i+1,] = chain[i,]
}
Copy code```

### Implement

(E) Output the accepted value and explain it.

```
chain = metrMCMC(startvalue, 5500)

burnIn = 5000
accep = 1-mean(duplicated(chain[-(1:burnIn),]))
Copy code```

The first step of the algorithm may be biased by the initial value, so it is usually discarded for further analysis (burn-in period). The interesting output is the acceptance rate: How often is the candidate accepted and rejected by the algorithm? The candidate function affects the acceptance rate: generally, the closer the candidates, the greater the acceptance rate. However, a very high acceptance rate is usually unhelpful: it means that the algorithm "stays" on the same point, which leads to less than ideal handling of the parameter space (mixing).

We can also change the initial value to see if it changes the result/converges.

```startvalue = c (4,0,10)
copying the code```

### summary

```       V1 V2 V3
Min. :4.068 Min. :-6.7072 Min.: 6.787
1st Qu.:4.913 1st Qu.:-2.6973 1st Qu.: 9.323
Median :5.052 Median :-1.7551 Median :10.178
Mean :5.052 Mean :-1.7377 Mean :10.385
3rd Qu.:5.193 3rd Qu.:-0.8134 3rd Qu.:11.166
Max. :5.989 Max.: 4.8425 Max. :19.223
Copy code```

```#Compare:
summary(lm(y~x))
Copy code```
```
Call:
lm(formula = y ~ x)

Residuals:
Min 1Q Median 3Q Max
-22.259 -6.032 -1.718 6.955 19.892

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.1756 1.7566 -1.808 0.081.
x 5.0469 0.1964 25.697 <2e-16 ***
---
Signif. codes: 0 ?**?0.001 ?*?0.01 ??0.05 ??0.1 ??1

Residual standard error: 9.78 on 29 degrees of freedom
Multiple R-squared: 0.9579, Adjusted R-squared: 0.9565
F-statistic: 660.4 on 1 and 29 DF, p-value: <2.2e-16
Copy code```

```summary (lm (y ~ x)
) \$ sigma duplicated code```
``` 9.780494Copy
code```

```coefficients(lm(y~x))
Copy code```
```(Intercept)
-3.175555
Copy code```

```coefficients(lm(y~x))
Copy code```
```       x
5.046873
Copy code```

### summary:

```### summary: #######################

par(mfrow = c(2,3))
hist(chain[-(1:burnIn),1],prob=TRUE,nclass=30,col="109"
abline(v = mean(chain[-(1:burnIn),1]), lwd="2")
Copy code```

Most popular insights