Video: Stan Probabilistic Programming MCMC Sampling Bayesian Model in R

Video: Stan Probabilistic Programming MCMC Sampling Bayesian Model in R

Original link: tecdat.cn/?p=11161

Original source: Tuoduan Data Tribe Official Account

 

Probabilistic programming allows us to implement statistical models without worrying about technical details. This is particularly useful for Bayesian models based on MCMC sampling. 

RStan Bayesian hierarchical model analysis example in R language

Introduction to stan

Stan is a C++ library for Bayesian inference. It is based on the No-U-Turn sampler (NUTS), which is used to estimate the posterior distribution based on the model and data specified by the user. Performing an analysis using Stan involves the following steps:

  1. Use the Stan modeling language to specify the statistical model. This is done through a dedicated .stan  file.
  2. Prepare the data to be provided to the model.
  3. Use this
    stan
     The function samples from the posterior distribution.
  4. Analyze the results.

In this article, I will show the usage of Stan through two hierarchical models. I will use the first model to discuss the basic functions of Stan, and use the second example to demonstrate more advanced applications.

 School data set

The first data set we will use is the   school data set   . This data set measures the impact of coaching programs on college entrance exams (the SAT test used in the United States). The data set is as follows:

As we have seen: For most of the eight schools, the short-term coaching program did improve SAT scores. For this data set, we are interested in estimating the effect of the actual coaching program related to each school. We consider two alternative methods. 1. we can assume that all schools are independent of each other. However, this will be difficult to explain because the school s posterior intervals overlap to a large extent due to the high standard deviation. 2. assuming that the real effects of all schools are the same, the data of all schools can be summarized. However, this is also unreasonable because the plan has different effects for schools (for example, different teachers and students should have different plans).

Therefore, another model is needed. The advantage of the hierarchical model is that it can merge information from all eight schools without assuming that they have a common real effect. We can specify the hierarchical Bayesian model in the following ways:

According to this model, the coaching effect follows a normal distribution, the mean is the true effect j, and its standard deviation is j (know from the data). The real influence j follows the normal distribution of the parameters and .

Define Stan model file

After specifying the model to be used, we can now discuss how to specify this model in Stan. Before defining the Stan program for the above model, let us look at the structure of the Stan modeling language.

variable

In Stan, variables can be defined in the following ways:

int<lower=0> n; # The lower bound is 0 int<upper=5> n; # The upper limit is 5 int<lower=0,upper=5> n; # The range of n is [0,5] Copy code

Note that if the variable is known a priori, the upper and lower bounds of the variable should be specified.

Multidimensional data can be specified by square brackets:

vector[n] numbers;//vector of length n real[n] numbers;//floating-point array of length n matrix[n,n] matrix;//n by n matrix Copy code

program 

The following procedures are used in Stan:

  • data : used to specify data subject to Bayesian rules
  • Transformed data : used to preprocess the data
  • Parameters  (required): used to specify the parameters of the model
  • Transformed parameters : used to calculate the parameter processing before the posterior
  • Model  (required): used to specify the model
  • Number of generations : used to post-process the results

For  model  blocks, the distribution can be specified in two equivalent ways. The first one, use the following statistical symbols:

y ~ normal (mu, sigma) ; # y normally distributed duplicated code

The second method uses a procedural notation based on the log probability density function (lpdf):

target + = normal_lpdf (y | mu , sigma); # lognormal density increased copy the code

Stan supports a large number of probability distributions. When specifying the model through Stan, the 

lookup
 The function will come in handy: it provides a mapping from the R function to the Stan function. Consider the following example:

library(rstan) # Load the stan package lookup(rnorm) Copy code
## StanFunction Arguments ReturnType Page ## 355 normal_rng (real mu, real sigma) real 494 Copy code

Here we see the

rnorm
 Equivalent to Stan
normal_rng
 .

model

Now that we understand the basics of the Stan modeling language, we can define the model and store it in a file called 

schools.stan
:

Note that will never appear in the parameters. This is because we did not explicitly model , but instead model (the standardization effect of each school). Then, is constructed according to the transformed parameters of , , and . This parameterization makes the sampler more efficient.

 

Prepare data for modeling

Before fitting the model, we need to encode the input data as a list whose parameters should correspond to the data part of the Stan model. For school data, the data are as follows:

schools.data <- list( n = 8, y = c(28, 8, -3, 7, -1, 1, 18, 12), sigma = c(15, 10, 16, 11, 9, 11, 10, 18) ) Copy code

Sampling from the posterior distribution

We can use

stan
 The function samples from the posterior distribution. The function performs the following three steps:

  1. It converts model specifications into C++ code.
  2. It compiles C++ code into shared objects.
  3. It samples from the posterior distribution according to the specified model, data and settings.

in case 

rstan_options(auto_write = TRUE)
, Subsequent calls to the same model will be much faster than the first call, because the 
stan
 The function then skips the first two steps (conversion and compilation model). In addition, we will set the number of cores to be used:

options(mc.cores = parallel::detectCores()) # Parallelization rstan_options(auto_write = TRUE) # Store the compiled stan model Copy code

Now, we can compile the model and samples from the posterior.

Model interpretation

We will first give a basic explanation of the model, and then study the MCMC program.

Basic model explanation

To perform inference using the fitted model, we can use 

print
 function.

print (fit1) # optional parameters: pars, probs copy the code
## Inference for Stan model: schools. ## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000. ## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat ## mu 7.67 0.15 5.14 -2.69 4.42 7.83 10.93 17.87 1185 1 ## tau 6.54 0.16 5.40 0.31 2.52 5.28 9.05 20.30 1157 1 ## eta[1] 0.42 0.01 0.92 -1.47 -0.18 0.44 1.03 2.18 4000 1 ## eta[2] 0.03 0.01 0.87 -1.74 -0.54 0.03 0.58 1.72 4000 1 ## eta[3] -0.18 0.02 0.92 -1.95 -0.81 -0.20 0.45 1.65 3690 1 ## eta[4] -0.03 0.01 0.92 -1.85 -0.64 -0.02 0.57 1.81 4000 1 ## eta[5] -0.33 0.01 0.86 -2.05 -0.89 -0.34 0.22 1.43 3318 1 ## eta[6] -0.20 0.01 0.87 -1.91 -0.80 -0.21 0.36 1.51 4000 1 ## eta[7] 0.37 0.02 0.87 -1.37 -0.23 0.37 0.96 2.02 3017 1 ## eta[8] 0.05 0.01 0.92 -1.77 -0.55 0.05 0.69 1.88 4000 1 ## theta[1] 11.39 0.15 8.09 -2.21 6.14 10.30 15.56 30.22 2759 1 ## theta[2] 7.92 0.10 6.25 -4.75 4.04 8.03 11.83 20.05 4000 1 ## theta[3] 6.22 0.14 7.83 -11.41 2.03 6.64 10.80 20.97 3043 1 ## theta[4] 7.58 0.10 6.54 -5.93 3.54 7.60 11.66 20.90 4000 1 ## theta[5] 5.14 0.10 6.30 -8.68 1.40 5.63 9.50 16.12 4000 1 ## theta[6] 6.08 0.10 6.62 -8.06 2.21 6.45 10.35 18.53 4000 1 ## theta[7] 10.60 0.11 6.70 -0.94 6.15 10.01 14.48 25.75 4000 1 ## theta[8] 8.19 0.14 8.18 -8.13 3.59 8.01 12.48 25.84 3361 1 ## lp__ -39.47 0.07 2.58 -45.21 -41.01 -39.28 -37.70 -34.99 1251 1 ## ## Samples were drawn using NUTS(diag_e) at Thu Nov 29 11:17:50 2018. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1). Copy code

Here, the row name indicates the estimated parameter: mu is the average of the posterior distribution, and tau is its standard deviation. The entries of eta and theta represent the estimated values of vectors and , respectively. These columns represent calculated values. The percentage represents the confidence interval. For example, the 95% confidence interval of the overall effect of the coaching program is [-1.27, 18.26]. Since we are not sure of the mean, the 95% confidence interval for j is also very wide. For example, for the first school, the 95% confidence interval is [ 2.19,32.33].

We can use the following

plot
 Function to visualize the uncertainty in the estimation:

The black line represents the 95% interval, and the red line represents the 80% interval. The circle represents the estimate of the average.

We can use the following

extract
 Function to get the generated sample:

# Get samples samples <- extract(fit1, permuted = TRUE) # 1000 samples per parameter Copy code

MCMC diagnosis

 By drawing the trajectory of the sampling process, we can determine whether there is a problem during the sampling period. For example, if the chain stays in one position for too long or takes too many steps in one direction, there will be problems. We can use

traceplot
 The function draws the trajectory of the four chains used in the model:

# Diagnosis: Copy code

 

To get samples from each Markov chain, we can

extract
 Use the function again:

## parameters ## chains mu tau eta[1] eta[2] eta[3] eta[4] ## chain:1 1.111120 2.729124 -0.1581242 -0.8498898 0.5025965 -1.9874554 ## chain:2 3.633421 2.588945 1.2058772 -1.1173221 1.4830778 0.4838649 ## chain:3 13.793056 3.144159 0.6023924 -1.1188243 -1.2393491 -0.6118482 ## chain:4 3.673380 13.889267 -0.0869434 1.1900236 -0.0378830 -0.2687284 ## parameters ## chains eta[5] eta[6] eta[7] eta[8] theta[1] ## chain:1 0.3367602 -1.1940843 0.5834020 -0.08371249 0.6795797 ## chain:2 -1.8057252 0.7429594 0.9517675 0.55907356 6.7553706 ## chain:3 -1.5867789 0.6334288 -0.4613463 -1.44533007 15.6870727 ## chain:4 0.1028605 0.3481214 0.9264762 0.45331024 2.4657999 ## parameters ## chains theta[2] theta[3] theta[4] theta[5] theta[6] theta[7] ## chain:1 -1.208335 2.482769 -4.31289292 2.030181 -2.147684 2.703297 ## chain:2 0.740736 7.473028 4.88612054 -1.041502 5.556902 6.097494 ## chain:3 10.275294 9.896345 11.86930758 8.803971 15.784656 12.342510 ## chain:4 20.201935 3.147213 -0.05906019 5.102037 8.508530 16.541455 ## parameters ## chains theta[8] lp__ ## chain:1 0.8826584 -41.21499 ## chain:2 5.0808317 -41.17178 ## chain:3 9.2487083 -40.35351 ## chain:4 9.9695268 -36.34043 Copy code

In order to perform a more advanced analysis of the sampling process, we can use the 

shinystan
 Software package. Using this software package, you can start the Shiny application to analyze the fitted model in the following ways:

library(shinystan) launch_shinystan(fit1) Copy code

Hierarchical regression

Now that we have a basic understanding of Stan, we can delve into more advanced applications: let's try hierarchical regression. In conventional regression, we model the relationship of the following form

This means that all samples are assumed to have the same distribution. If there is only one set of samples, then we will encounter problems because potential differences within and between groups will be ignored.

Another option is to build a regression model for each group. However, in this case, a small sample size can cause problems when estimating a single model.

Hierarchical regression is a compromise between two extremes. The model assumes that the groups are similar, but there are differences.

Assume that each sample belongs to one of the K groups. Then, the hierarchical regression is specified as follows:

Where Yk is the result of the k-th group, k is the intercept, Xk is the feature, and (k) is the weight. The hierarchical model is different from the model in which Yk fits each group separately, because it is assumed that the parameters k and (k) originate from a common distribution.

 data set

The classic example of hierarchical regression is the mouse data set. This data set contains mouse body weights measured within 5 weeks. Let's load the data:

## day8 day15 day22 day29 day36 ## 1 151 199 246 283 320 ## 2 145 199 249 293 354 ## 3 147 214 263 312 328 ## 4 155 200 237 272 297 ## 5 135 188 230 280 323 ## 6 159 210 252 298 331 Copy code

Let's investigate the data:

library(ggplot2) ggplot(ddf, aes(x = variable, y = value, group = Group)) + geom_line() + geom_point() Copy code

 The data shows that the linear growth trend is very similar for different rats. However, we have also seen that rats have different initial body weights, require different intercepts, and growth rates also require different slopes. Therefore, the hierarchical model seems appropriate.

Specification of Hierarchical Regression Model

The model can be specified as follows:

The intercept of the i-th rat is represented by i, and the slope is represented by i. Note that the center of the measurement time is x = 22, which is the median measurement value of the time series data (day 22).

Now we can specify the model and store it in the file named

rats.stan
In the file:

Please note that the model code estimates the variance (the  sigmasq  variable) rather than the standard deviation. 

Data preparation

In order to prepare the model data, we first extract the measurement points as numerical values, and then encode everything into a list structure:

data <- list(N = nrow(df), T = ncol(df), x = days, y = df, xbar = median(days)) Copy code

Fit regression model

Now, we can fit a Bayesian hierarchical regression model to the mouse weight data set:

# The model contains estimates of intercept (alpha) and slope (beta) Copy code

Hierarchical regression model prediction

After determining the and of each rat, we can now estimate the weight of a single rat at any point in time. Here, we look for the weight of the rat from day 0 to day 100.

 

ggplot(pred.df[pred.df$Rat %in% sel.rats, ], aes(x = Day, y = Weight, group = Rat, geom_line() + Copy code

Compared with the original data, the estimation of the model is smooth, because each curve follows a linear model. Studying the confidence interval shown in the last figure, we can see that the variance estimate is reasonable. We are confident in the weight of the mice at the time of sampling (days 8 to 36), but the uncertainty will increase as we leave the sampling area.


Most popular insights

1. Matlab uses Bayesian optimized deep learning

2. Matlab Bayesian Hidden Markov hmm model implementation

3. Bayesian simple linear regression simulation of Gibbs sampling in R language

4. Block Gibbs Gibbs sampling Bayesian multiple linear regression in R language

5. The Bayesian model of Stan probabilistic programming MCMC sampling in R language

6. Python uses PyMC3 to implement Bayesian linear regression model

7. R language uses Bayesian hierarchical model for spatial data analysis

8. R language random search variable selection SSVS estimation Bayesian vector autoregressive (BVAR) model

9. Matlab Bayesian Hidden Markov hmm model implementation