Original link: tecdat.cn/?p=11161
Original source: Tuoduan Data Tribe Official Account
Probabilistic programming allows us to implement statistical models without worrying about technical details. This is particularly useful for Bayesian models based on MCMC sampling.
RStan Bayesian hierarchical model analysis example in R language
Introduction to stan
Stan is a C++ library for Bayesian inference. It is based on the No-U-Turn sampler (NUTS), which is used to estimate the posterior distribution based on the model and data specified by the user. Performing an analysis using Stan involves the following steps:
- Use the Stan modeling language to specify the statistical model. This is done through a dedicated .stan file.
- Prepare the data to be provided to the model.
- Use thisstanThe function samples from the posterior distribution.
- Analyze the results.
In this article, I will show the usage of Stan through two hierarchical models. I will use the first model to discuss the basic functions of Stan, and use the second example to demonstrate more advanced applications.
School data set
The first data set we will use is the school data set . This data set measures the impact of coaching programs on college entrance exams (the SAT test used in the United States). The data set is as follows:
As we have seen: For most of the eight schools, the short-term coaching program did improve SAT scores. For this data set, we are interested in estimating the effect of the actual coaching program related to each school. We consider two alternative methods. 1. we can assume that all schools are independent of each other. However, this will be difficult to explain because the school s posterior intervals overlap to a large extent due to the high standard deviation. 2. assuming that the real effects of all schools are the same, the data of all schools can be summarized. However, this is also unreasonable because the plan has different effects for schools (for example, different teachers and students should have different plans).
Therefore, another model is needed. The advantage of the hierarchical model is that it can merge information from all eight schools without assuming that they have a common real effect. We can specify the hierarchical Bayesian model in the following ways:
According to this model, the coaching effect follows a normal distribution, the mean is the true effect j, and its standard deviation is j (know from the data). The real influence j follows the normal distribution of the parameters and .
Define Stan model file
After specifying the model to be used, we can now discuss how to specify this model in Stan. Before defining the Stan program for the above model, let us look at the structure of the Stan modeling language.
In Stan, variables can be defined in the following ways:
int<lower=0> n; # The lower bound is 0 int<upper=5> n; # The upper limit is 5 int<lower=0,upper=5> n; # The range of n is [0,5] Copy code
Note that if the variable is known a priori, the upper and lower bounds of the variable should be specified.
Multidimensional data can be specified by square brackets:
vector[n] numbers;//vector of length n real[n] numbers;//floating-point array of length n matrix[n,n] matrix;//n by n matrix Copy code
The following procedures are used in Stan:
- data : used to specify data subject to Bayesian rules
- Transformed data : used to preprocess the data
- Parameters (required): used to specify the parameters of the model
- Transformed parameters : used to calculate the parameter processing before the posterior
- Model (required): used to specify the model
- Number of generations : used to post-process the results
For model blocks, the distribution can be specified in two equivalent ways. The first one, use the following statistical symbols:
y ~ normal (mu, sigma) ; # y normally distributed duplicated code
The second method uses a procedural notation based on the log probability density function (lpdf):
target + = normal_lpdf (y | mu , sigma); # lognormal density increased copy the code
Stan supports a large number of probability distributions. When specifying the model through Stan, the
library(rstan) # Load the stan package lookup(rnorm) Copy code
## StanFunction Arguments ReturnType Page ## 355 normal_rng (real mu, real sigma) real 494 Copy code
Here we see the
Now that we understand the basics of the Stan modeling language, we can define the model and store it in a file called
Note that will never appear in the parameters. This is because we did not explicitly model , but instead model (the standardization effect of each school). Then, is constructed according to the transformed parameters of , , and . This parameterization makes the sampler more efficient.
Prepare data for modeling
Before fitting the model, we need to encode the input data as a list whose parameters should correspond to the data part of the Stan model. For school data, the data are as follows:
schools.data <- list( n = 8, y = c(28, 8, -3, 7, -1, 1, 18, 12), sigma = c(15, 10, 16, 11, 9, 11, 10, 18) ) Copy code
Sampling from the posterior distribution
We can use
- It converts model specifications into C++ code.
- It compiles C++ code into shared objects.
- It samples from the posterior distribution according to the specified model, data and settings.
options(mc.cores = parallel::detectCores()) # Parallelization rstan_options(auto_write = TRUE) # Store the compiled stan model Copy code
Now, we can compile the model and samples from the posterior.
We will first give a basic explanation of the model, and then study the MCMC program.
Basic model explanation
To perform inference using the fitted model, we can use
print (fit1) # optional parameters: pars, probs copy the code
## Inference for Stan model: schools. ## 4 chains, each with iter=2000; warmup=1000; thin=1; ## post-warmup draws per chain=1000, total post-warmup draws=4000. ## ## mean se_mean sd 2.5% 25% 50% 75% 97.5% n_eff Rhat ## mu 7.67 0.15 5.14 -2.69 4.42 7.83 10.93 17.87 1185 1 ## tau 6.54 0.16 5.40 0.31 2.52 5.28 9.05 20.30 1157 1 ## eta 0.42 0.01 0.92 -1.47 -0.18 0.44 1.03 2.18 4000 1 ## eta 0.03 0.01 0.87 -1.74 -0.54 0.03 0.58 1.72 4000 1 ## eta -0.18 0.02 0.92 -1.95 -0.81 -0.20 0.45 1.65 3690 1 ## eta -0.03 0.01 0.92 -1.85 -0.64 -0.02 0.57 1.81 4000 1 ## eta -0.33 0.01 0.86 -2.05 -0.89 -0.34 0.22 1.43 3318 1 ## eta -0.20 0.01 0.87 -1.91 -0.80 -0.21 0.36 1.51 4000 1 ## eta 0.37 0.02 0.87 -1.37 -0.23 0.37 0.96 2.02 3017 1 ## eta 0.05 0.01 0.92 -1.77 -0.55 0.05 0.69 1.88 4000 1 ## theta 11.39 0.15 8.09 -2.21 6.14 10.30 15.56 30.22 2759 1 ## theta 7.92 0.10 6.25 -4.75 4.04 8.03 11.83 20.05 4000 1 ## theta 6.22 0.14 7.83 -11.41 2.03 6.64 10.80 20.97 3043 1 ## theta 7.58 0.10 6.54 -5.93 3.54 7.60 11.66 20.90 4000 1 ## theta 5.14 0.10 6.30 -8.68 1.40 5.63 9.50 16.12 4000 1 ## theta 6.08 0.10 6.62 -8.06 2.21 6.45 10.35 18.53 4000 1 ## theta 10.60 0.11 6.70 -0.94 6.15 10.01 14.48 25.75 4000 1 ## theta 8.19 0.14 8.18 -8.13 3.59 8.01 12.48 25.84 3361 1 ## lp__ -39.47 0.07 2.58 -45.21 -41.01 -39.28 -37.70 -34.99 1251 1 ## ## Samples were drawn using NUTS(diag_e) at Thu Nov 29 11:17:50 2018. ## For each parameter, n_eff is a crude measure of effective sample size, ## and Rhat is the potential scale reduction factor on split chains (at ## convergence, Rhat=1). Copy code
Here, the row name indicates the estimated parameter: mu is the average of the posterior distribution, and tau is its standard deviation. The entries of eta and theta represent the estimated values of vectors and , respectively. These columns represent calculated values. The percentage represents the confidence interval. For example, the 95% confidence interval of the overall effect of the coaching program is [-1.27, 18.26]. Since we are not sure of the mean, the 95% confidence interval for j is also very wide. For example, for the first school, the 95% confidence interval is [ 2.19,32.33].
We can use the following
The black line represents the 95% interval, and the red line represents the 80% interval. The circle represents the estimate of the average.
We can use the following
# Get samples samples <- extract(fit1, permuted = TRUE) # 1000 samples per parameter Copy code
By drawing the trajectory of the sampling process, we can determine whether there is a problem during the sampling period. For example, if the chain stays in one position for too long or takes too many steps in one direction, there will be problems. We can use
# Diagnosis: Copy code
To get samples from each Markov chain, we can
## parameters ## chains mu tau eta eta eta eta ## chain:1 1.111120 2.729124 -0.1581242 -0.8498898 0.5025965 -1.9874554 ## chain:2 3.633421 2.588945 1.2058772 -1.1173221 1.4830778 0.4838649 ## chain:3 13.793056 3.144159 0.6023924 -1.1188243 -1.2393491 -0.6118482 ## chain:4 3.673380 13.889267 -0.0869434 1.1900236 -0.0378830 -0.2687284 ## parameters ## chains eta eta eta eta theta ## chain:1 0.3367602 -1.1940843 0.5834020 -0.08371249 0.6795797 ## chain:2 -1.8057252 0.7429594 0.9517675 0.55907356 6.7553706 ## chain:3 -1.5867789 0.6334288 -0.4613463 -1.44533007 15.6870727 ## chain:4 0.1028605 0.3481214 0.9264762 0.45331024 2.4657999 ## parameters ## chains theta theta theta theta theta theta ## chain:1 -1.208335 2.482769 -4.31289292 2.030181 -2.147684 2.703297 ## chain:2 0.740736 7.473028 4.88612054 -1.041502 5.556902 6.097494 ## chain:3 10.275294 9.896345 11.86930758 8.803971 15.784656 12.342510 ## chain:4 20.201935 3.147213 -0.05906019 5.102037 8.508530 16.541455 ## parameters ## chains theta lp__ ## chain:1 0.8826584 -41.21499 ## chain:2 5.0808317 -41.17178 ## chain:3 9.2487083 -40.35351 ## chain:4 9.9695268 -36.34043 Copy code
In order to perform a more advanced analysis of the sampling process, we can use the
library(shinystan) launch_shinystan(fit1) Copy code
Now that we have a basic understanding of Stan, we can delve into more advanced applications: let's try hierarchical regression. In conventional regression, we model the relationship of the following form
This means that all samples are assumed to have the same distribution. If there is only one set of samples, then we will encounter problems because potential differences within and between groups will be ignored.
Another option is to build a regression model for each group. However, in this case, a small sample size can cause problems when estimating a single model.
Hierarchical regression is a compromise between two extremes. The model assumes that the groups are similar, but there are differences.
Assume that each sample belongs to one of the K groups. Then, the hierarchical regression is specified as follows:
Where Yk is the result of the k-th group, k is the intercept, Xk is the feature, and (k) is the weight. The hierarchical model is different from the model in which Yk fits each group separately, because it is assumed that the parameters k and (k) originate from a common distribution.
The classic example of hierarchical regression is the mouse data set. This data set contains mouse body weights measured within 5 weeks. Let's load the data:
## day8 day15 day22 day29 day36 ## 1 151 199 246 283 320 ## 2 145 199 249 293 354 ## 3 147 214 263 312 328 ## 4 155 200 237 272 297 ## 5 135 188 230 280 323 ## 6 159 210 252 298 331 Copy code
Let's investigate the data:
library(ggplot2) ggplot(ddf, aes(x = variable, y = value, group = Group)) + geom_line() + geom_point() Copy code
The data shows that the linear growth trend is very similar for different rats. However, we have also seen that rats have different initial body weights, require different intercepts, and growth rates also require different slopes. Therefore, the hierarchical model seems appropriate.
Specification of Hierarchical Regression Model
The model can be specified as follows:
The intercept of the i-th rat is represented by i, and the slope is represented by i. Note that the center of the measurement time is x = 22, which is the median measurement value of the time series data (day 22).
Now we can specify the model and store it in the file named
Please note that the model code estimates the variance (the sigmasq variable) rather than the standard deviation.
In order to prepare the model data, we first extract the measurement points as numerical values, and then encode everything into a list structure:
data <- list(N = nrow(df), T = ncol(df), x = days, y = df, xbar = median(days)) Copy code
Fit regression model
Now, we can fit a Bayesian hierarchical regression model to the mouse weight data set:
# The model contains estimates of intercept (alpha) and slope (beta) Copy code
Hierarchical regression model prediction
After determining the and of each rat, we can now estimate the weight of a single rat at any point in time. Here, we look for the weight of the rat from day 0 to day 100.
ggplot(pred.df[pred.df$Rat %in% sel.rats, ], aes(x = Day, y = Weight, group = Rat, geom_line() + Copy code
Compared with the original data, the estimation of the model is smooth, because each curve follows a linear model. Studying the confidence interval shown in the last figure, we can see that the variance estimate is reasonable. We are confident in the weight of the mice at the time of sampling (days 8 to 36), but the uncertainty will increase as we leave the sampling area.
Most popular insights