Original link: tecdat.cn/?p=21641
Original source: Tuoduan Data Tribe Official Account
Salary model
In the field of labor economics, the study of income and wages provides insights into issues ranging from gender discrimination to higher education. In this article, we will analyze crosssectional wage data in order to use Bayesian methods such as BIC and Bayesian models to build wage forecasting models in practice.
Load package
In this experiment, we will use the dplyr package to explore the data, and use the ggplot2 package for data visualization. We can also use the MASS package in one of the exercises to implement stepwise linear regression.
We will use BAS.LM in this software package later in the lab to implement the Bayesian model.
data
The data that this laboratory will use is randomly selected from 935 interviewees across the country.
variable  description 

wage  Weekly income 
hours  Average working hours per week 
IQ  IQ score 
kww  Work knowledge score 
educ  Years of education 
exper  work experience 
tenure  Have worked for the current employer for many years 
age  age 
married  = 1, if married 
black  = 1 (if black) 
south  = 1, if you live in the south 
urban  = 1, if you live in a city 
sibs  Number of siblings 
brthord  Birth order 
meduc  Mother's education 
feduc  Father's education 
lwage  Natural logarithm of wages 
Is this an observational study or an experiment?
 Observation Research
Explore the data
As with any new data set, standard exploratory data analysis is a good start. We will start with the wage variable because it is the dependent variable in our model.

Regarding wages, which of the following statements is wrong?
 7 respondents earn less than 300 yuan a week
summary(wage) Copy code
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 115.0 669.0 905.0 957.9 1160.0 3078.0 Copy code
Since salary is our dependent variable, we want to explore the relationship between other variables as a forecast.
Exercise: Exclude wages and length of service, and choose two other variables that you think are good predictors of wages. Use appropriate diagrams to visualize their relationship with wages.
Education level and working hours seem to be good predictors of workers wages.
ggplot(data = wage, aes(y=wage, x=exper))+geom_point() Copy code
ggplot (data = wage, aes ( y = wage, x = educ)) + geom_point () to copy the code
Simple linear regression
One possible explanation for the wage differences we see in the data is that smarter people make more money. The figure below shows a scatter plot between weekly salary and IQ score.
ggplot(data = wage, aes(x = iq, y = wage)) + geom_point() Copy code
This picture is quite messy. Although there may be a slight positive linear relationship between IQ scores and wages, IQ is at best a rough predictor of wages. We can quantify this by fitting a simple linear regression.
m_wage_iq$coefficientsCopy code
## (Intercept) iq ## 116.991565 8.303064 Copy code
## [1] 384.7667Copy code
Recall that under the model
If you use and the reference prior, then the Bayesian posterior mean and standard deviation are equal to the frequency estimate and standard deviation, respectively.
The Bayesian model specification assumes that the errors are normally distributed and the variance is constant. Like the frequency method, we test this hypothesis by checking the residual distribution of the model. If the residual is highly nonnormal or skewed, the assumption is violated and any subsequent inferences are invalid.

Check the residual of m_wage_iq. Is the assumption of normal distribution error valid?
 No, because the residual distribution of the model is skewed to the right.
qqnorm(m_wage_iq$residuals) qqline(m_wage_iq$residuals) Copy code
Exercise: Readjust the model, this time using educ (education) as the independent variable. Did your answer to the previous exercise change?
## (Intercept) educ ## 146.95244 60.21428 Copy code
summary (m_wage_educ) $ sigma copy the code
## [1] 382.3203Copy code
The same conclusion is that the residual of the linear model is approximately normally distributed with i N(0, 2), so further inferences can be made on the basis of the linear model.
Variable conversion
One way to adapt to the right skew of the data is to (naturally) transform the dependent variable. Note that this is only possible when the variable is strictly positive, because the logarithm of the negative value is not defined, and log(0)= . We try to fit a linear model with logarithmic wages as the dependent variable. Question 4 will be based on this logarithmic transformation model.
m_lwage_iq = lm (lwage ~ iq, data = wage) copying the code
Exercise: Check the residuals of the model. Is the residual of the assumption of a normal distribution reasonable?
Based on the above residual plot, it can be assumed that the logarithmic wage linear model and the normal distribution of iq.
Recall that the posterior distributions of and for a given 2 are normal, but slightly follow a t distribution with n p 1 degrees of freedom. In this case, p=1, because IQ is the only predictor of logarithmic wages in our model. Therefore, the posterior probabilities of and both follow the t distribution of 933 degrees of freedom. Because df is very large, these distributions are actually approximately normal.

Under the reference prior p( , , 2) 1/2, the 95% posterior confidence interval of , namely the IQ coefficient, is given.
 (0.00709, 0.01050)
# Extract the coefficient value from the linear model m_lwage_iq qnorm(c(0.025, 0.975), mean = iq_mean_estimate, sd=iq_sd) Copy code
## [1] 0.007103173 0.010511141Copy code
Practice: The IQ coefficient is very small, which is expected, because an increase in IQ scores by one point can hardly have a high multiplier effect on wages. One way to make the coefficients easier to interpret is to normalize the IQ before putting it into the model. From the perspective of this new model, an IQ increase of 1 standard deviation (15 points) is estimated by what percentage of the salary increase?
IQ is standardized by the scale function. A 15point increase in IQ will cause an increase in wages.
coef(summary(m_lwage_scaled_iq))["siq", "Estimate"]*15+coef(summary(m_lwage_scaled_iq))["(Intercept)", "Estimate"] Copy code
## [1] 8.767568Copy code
Multiple linear regression
Obviously, wages can be explained by many predictive factors, such as experience, education and IQ. We can include all relevant covariates in the regression model, trying to explain as many wage changes as possible.
The use of. in lm tells R to include all covariates in the model, and then use wage to further modify it, and then exclude the wage variable from the model.
By default, the lm function performs a complete case analysis, so it deletes observations with missing (NA) values in one or more predictor variables.
Because of these missing values, we must make an additional assumption so that our inferences are valid. In other words, our data must be missing at random. For example, if all the first born children do not report their birth order, the data will not be randomly lost. In the absence of any additional information, we will assume that this is reasonable and use 663 complete observations (as opposed to the original 935) to fit the model. Both Bayesian and frequentist methods exist on data sets that deal with missing data, but they are beyond the scope of this course.

From this model, who earns more: married blacks or single nonblacks?
 Married black
Compared with a single nonblack man, all other equal, married blacks will receive the following multipliers.
married_black < married_coef*1+black_coef*1 married_black Copy code
## [1] 0.09561888Copy code
From the quick summary of the linear model, it can be seen that many coefficients of the independent variables are not statistically significant. You can select variables based on adjusted R2. This article introduces Bayesian Information Criterion (BIC), which is a metric that can be used for model selection. BIC is based on model fitting and penalizes the number of parameters proportionally based on the sample size. We can use the following command to calculate the BIC of the fully linear model:
BIC(m_lwage_full) Copy code
## [1] 586.3732Copy code
We can compare the BIC of the complete model and the simplified model. Let's try to remove the birth order from the model. To ensure that the observations remain the same, you can specify the data set as na.omit(wage), which only contains observations with no missing values.
m_lwage_nobrthord = lm (lwage ~. wage brthord, data = na.omit (wage)) copying the code
## [1] 582.4815Copy code
As you can see, removing the birth order from the regression reduces BIC, and we tried to minimize BIC by choosing a model.

Which variable is eliminated from the full model to get the lowest BIC?
 feduc
BIC(m_lwage_sibs) Copy code
## [1] 581.4031Copy code
BIC(m_lwage_feduc) Copy code
## [1] 580.9743Copy code
BIC(m_lwage_meduc) Copy code
## [1] 582.3722Copy code
Exercise: R has a function stepAIC, which will run backwards in the model space, deleting variables until the BIC no longer decreases. It takes a complete model and a penalty parameter k as input. Find the best model according to BIC (in this case k=log(n)k=log(n)).
#For AIC, the penalty factor is a contact value k. For step BIC, we will use the stepAIC function and set k to log(n) step(m_lwage_full1, direction = "backward", k=log(n)) Copy code
## Residuals: ## Min 1Q Median 3Q Max ## 172.57 63.43 35.43 23.39 1065.78 ## ## Coefficients: ## Estimate Std. Error t value Pr(>t) ## (Intercept) 5546.2382 84.7839 65.416 <2e16 *** ## hours 1.9072 0.6548 2.913 0.0037 ** ## tenure 4.1285 0.9372 4.405 1.23e05 *** ## lwage 951.0113 11.5041 82.667 <2e16 *** ##  ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 ## ## Residual standard error: 120.1 on 659 degrees of freedom ## Multiple Rsquared: 0.9131, Adjusted Rsquared: 0.9127 ## Fstatistic: 2307 on 3 and 659 DF, pvalue: <2.2e16 Copy code
Bayesian model average
Usually, several models are equally credible, and choosing only one model ignores the inherent uncertainty involved in choosing the variables included in the model. One way to solve this problem is to implement Bayesian model averaging (BMA), that is, to average multiple models to obtain the posterior and predicted values of coefficients from new data. We can use it to implement BMA or select models. We first apply BMA to salary data.
bma(lwage ~. wage, data = wage_no_na, prior = "BIC", modelprior = uniform()) Copy code
## ## Marginal Posterior Inclusion Probabilities: ## Intercept hours iq kww educ exper ## 1.00000 0.85540 0.89732 0.34790 0.99887 0.70999 ## tenure age married1 black1 south1 urban1 ## 0.70389 0.52468 0.99894 0.34636 0.32029 1.00000 ## sibs brthord meduc feduc ## 0.04152 0.12241 0.57339 0.23274 Copy code
summary(bma_lwage) Copy code
## P(B != 0  Y) model 1 model 2 model 3 ## Intercept 1.00000000 1.0000 1.0000000 1.0000000 ## hours 0.85540453 1.0000 1.0000000 1.0000000 ## iq 0.89732383 1.0000 1.0000000 1.0000000 ## kww 0.34789688 0.0000 0.0000000 0.0000000 ## educ 0.99887165 1.0000 1.0000000 1.0000000 ## exper 0.70999255 0.0000 1.0000000 1.0000000 ## tenure 0.70388781 1.0000 1.0000000 1.0000000 ## age 0.52467710 1.0000 1.0000000 0.0000000 ## married1 0.99894488 1.0000 1.0000000 1.0000000 ## black1 0.34636467 0.0000 0.0000000 0.0000000 ## south1 0.32028825 0.0000 0.0000000 0.0000000 ## urban1 0.99999983 1.0000 1.0000000 1.0000000 ## sibs 0.04152242 0.0000 0.0000000 0.0000000 ## brthord 0.12241286 0.0000 0.0000000 0.0000000 ## meduc 0.57339302 1.0000 1.0000000 1.0000000 ## feduc 0.23274084 0.0000 0.0000000 0.0000000 ## BF NA 1.0000 0.5219483 0.5182769 ## PostProbs NA 0.0455 0.0237000 0.0236000 ## R2 NA 0.2710 0.2767000 0.2696000 ## dim NA 9.0000 10.0000000 9.0000000 ## logmarg NA 1490.0530 1490.7032349 1490.7102938 ## model 4 model 5 ## Intercept 1.0000000 1.0000000 ## hours 1.0000000 1.0000000 ## iq 1.0000000 1.0000000 ## kww 1.0000000 0.0000000 ## educ 1.0000000 1.0000000 ## exper 1.0000000 0.0000000 ## tenure 1.0000000 1.0000000 ## age 0.0000000 1.0000000 ## married1 1.0000000 1.0000000 ## black1 0.0000000 1.0000000 ## south1 0.0000000 0.0000000 ## urban1 1.0000000 1.0000000 ## sibs 0.0000000 0.0000000 ## brthord 0.0000000 0.0000000 ## meduc 1.0000000 1.0000000 ## feduc 0.0000000 0.0000000 ## BF 0.4414346 0.4126565 ## PostProbs 0.0201000 0.0188000 ## R2 0.2763000 0.2762000 ## dim 10.0000000 10.0000000 ## logmarg 1490.8707736 1490.9381880 Copy code
The output model object and the summary command provide us with the posterior model inclusion probability and the most probable model for each variable. For example, the posterior probability of including hours in the model is 0.855. In addition, the most likely model with a posterior probability of 0.0455 includes intercept, working hours, IQ, education level, age, marital status, urban living status, and mother s education level. Although the posterior probability of 0.0455 sounds small, it is much larger than the unified prior probability assigned to it because there are 216 possible models.
Under the model average method, the posterior distribution of the coefficients can also be visualized. We draw the posterior distribution of the IQ coefficient as follows.
plot(coef_lwage, subset = c(3,13), ask=FALSE) Copy code
We can also provide 95% confidence intervals for these coefficients:
## 2.5% 97.5% beta ## Intercept 6.787648e+00 6.841901e+00 6.8142970694 ## hours 9.213871e03 0.000000e+00 0.0053079979 ## iq 0.000000e+00 6.259498e03 0.0037983313 ## kww 0.000000e+00 8.290187e03 0.0019605787 ## educ 2.247901e02 6.512858e02 0.0440707549 ## exper 0.000000e+00 2.100097e02 0.0100264057 ## tenure 0.000000e+00 1.288251e02 0.0059357058 ## age 0.000000e+00 2.541561e02 0.0089659753 ## married1 1.173088e01 3.015797e01 0.2092940731 ## black1 1.900452e01 0.000000e+00 0.0441863361 ## south1 1.021332e01 1.296992e05 0.0221757978 ## urban1 1.374767e01 2.621311e01 0.1981221313 ## sibs 0.000000e+00 0.000000e+00 0.0000218455 ## brthord 2.072393e02 0.000000e+00 0.0019470674 ## meduc 0.000000e+00 2.279918e02 0.0086717156 ## feduc 7.996880e06 1.558576e02 0.0025125930 ## attr(,"Probability") ## [1] 0.95 ## attr(,"class") ## [1] "confint.bas" Copy code
For questions 78, we will use a simplified data set that does not include the number of siblings, birth order, and parent education.
wage_red = wage%>% dplyr :: select (sibs, brthord, meduc, feduc) copy the code

Based on this simplified data set, according to Bayesian model averaging, which of the following variables has the lowest marginal posterior inclusion probability?
 age
## ## Call: ## ## ## Marginal Posterior Inclusion Probabilities: ## Intercept hours iq kww educ exper ## 1.0000 0.8692 0.9172 0.3217 1.0000 0.9335 ## tenure age married1 black1 south1 urban1 ## 0.9980 0.1786 0.9999 0.9761 0.8149 1.0000 Copy code
## P(B != 0  Y) model 1 model 2 model 3 ## Intercept 1.0000000 1.0000 1.0000000 1.0000000 ## hours 0.8691891 1.0000 1.0000000 1.0000000 ## iq 0.9171607 1.0000 1.0000000 1.0000000 ## kww 0.3216992 0.0000 1.0000000 0.0000000 ## educ 1.0000000 1.0000 1.0000000 1.0000000 ## exper 0.9334844 1.0000 1.0000000 1.0000000 ## tenure 0.9980015 1.0000 1.0000000 1.0000000 ## age 0.1786252 0.0000 0.0000000 0.0000000 ## married1 0.9999368 1.0000 1.0000000 1.0000000 ## black1 0.9761347 1.0000 1.0000000 1.0000000 ## south1 0.8148861 1.0000 1.0000000 0.0000000 ## urban1 1.0000000 1.0000 1.0000000 1.0000000 ## BF NA 1.0000 0.5089209 0.2629789 ## PostProbs NA 0.3311 0.1685000 0.0871000 ## R2 NA 0.2708 0.2751000 0.2634000 ## dim NA 10.0000 11.0000000 9.0000000 ## logmarg NA 2275.4209 2276.0963811 2276.7565998 ## model 4 model 5 ## Intercept 1.0000000 1.0000000 ## hours 1.0000000 0.0000000 ## iq 1.0000000 1.0000000 ## kww 0.0000000 0.0000000 ## educ 1.0000000 1.0000000 ## exper 1.0000000 1.0000000 ## tenure 1.0000000 1.0000000 ## age 1.0000000 0.0000000 ## married1 1.0000000 1.0000000 ## black1 1.0000000 1.0000000 ## south1 1.0000000 1.0000000 ## urban1 1.0000000 1.0000000 ## BF 0.2032217 0.1823138 ## PostProbs 0.0673000 0.0604000 ## R2 0.2737000 0.2628000 ## dim 11.0000000 9.0000000 ## logmarg 2277.0143763 2277.1229445 Copy code

Judgment: The posterior probability of the naive model including all variables is greater than 0.5. (The coefficient uses ZellnerSiow zero prior, and the model uses binomial (1,1) prior)
 Really
bma_lwage_fullCopy code
## ## Call: ## ## ## Marginal Posterior Inclusion Probabilities: ## Intercept hours iq kww educ exper ## 1.0000 0.9792 0.9505 0.6671 0.9998 0.8951 ## tenure age married1 black1 south1 urban1 ## 0.9040 0.7093 0.9998 0.7160 0.6904 1.0000 ## sibs brthord meduc feduc ## 0.3939 0.5329 0.7575 0.5360 Copy code
## P(B != 0  Y) model 1 model 2 model 3 model 4 ## Intercept 1.0000000 1.00000000 1.00000000 1.00000000 1.0000 ## hours 0.9791805 1.00000000 1.00000000 1.00000000 1.0000 ## iq 0.9504649 1.00000000 1.00000000 1.00000000 1.0000 ## kww 0.6670580 1.00000000 1.00000000 1.00000000 1.0000 ## educ 0.9998424 1.00000000 1.00000000 1.00000000 1.0000 ## exper 0.8950911 1.00000000 1.00000000 1.00000000 1.0000 ## tenure 0.9040156 1.00000000 1.00000000 1.00000000 1.0000 ## age 0.7092839 1.00000000 1.00000000 1.00000000 1.0000 ## married1 0.9997881 1.00000000 1.00000000 1.00000000 1.0000 ## black1 0.7160065 1.00000000 1.00000000 1.00000000 1.0000 ## south1 0.6903763 1.00000000 1.00000000 1.00000000 1.0000 ## urban1 1.0000000 1.00000000 1.00000000 1.00000000 1.0000 ## sibs 0.3938833 1.00000000 1.00000000 0.00000000 0.0000 ## brthord 0.5329258 1.00000000 1.00000000 1.00000000 0.0000 ## meduc 0.7575462 1.00000000 1.00000000 1.00000000 1.0000 ## feduc 0.5359832 1.00000000 0.00000000 1.00000000 0.0000 ## BF NA 0.01282537 0.06040366 0.04899546 1.0000 ## PostProbs NA 0.07380000 0.02320000 0.01880000 0.0126 ## R2 NA 0.29250000 0.29140000 0.29090000 0.2882 ## dim NA 16.00000000 15.00000000 15.00000000 13.0000 ## logmarg NA 76.00726935 77.55689364 77.34757164 80.3636 ## model 5 ## Intercept 1.000000 ## hours 1.000000 ## iq 1.000000 ## kww 1.000000 ## educ 1.000000 ## exper 1.000000 ## tenure 1.000000 ## age 1.000000 ## married1 1.000000 ## black1 1.000000 ## south1 1.000000 ## urban1 1.000000 ## sibs 0.000000 ## brthord 1.000000 ## meduc 1.000000 ## feduc 0.000000 ## BF 0.227823 ## PostProbs 0.012500 ## R2 0.289600 ## dim 14.000000 ## logmarg 78.884413 Copy code
Exercise: Use a data set to plot the posterior distribution of the age coefficient.
plot(coef_bma_wage_red, ask = FALSE) Copy code
prediction
A major advantage of Bayesian statistics is the probabilistic interpretation of predictions and predictions. Most Bayesian predictions are done using simulation techniques. This is usually used in regression modeling, although we will analyze it through an example that contains only the intercept term.
Suppose you observe four numerical observations of y, namely 2, 2, 0, and 0, the sample mean y =1, and the sample variance s2=4/3. Assuming y N( , 2), under the reference prior p( , 2) 1/2, our posterior probability becomes
Centered on sample mean
Where a=(n1)/2 and b=s2(n1)/2=2.
In order to get the predicted distribution of y5, we can first simulate from the posterior point of 2, and then simulate y5 from . Our forecast for y5 years will come from the posterior forecast distribution of a new observation. The following example extracts 100,000 times from the posterior prediction distribution of y5.
set.seed(314) N = 100000 y_5 = rnorm(N, mu, sqrt(sigma2)) Copy code
We can view the estimated value of the predicted distribution by observing the smoothed version of the histogram of the simulated data.
The 95% central confidence interval of the new observation is that in this case, L is the 0.025 quantile and U is the 0.975 quantile. We can use the quantile function to obtain these values to find the sample quantiles of 0.025 and 0.975 for tracy5.

Estimate a new observation y595% confidence interval
 (3.11, 5.13)
quantile(y_5, probs = c(0.025, 0.975)) Copy code
## 2.5% 97.5% ## 3.109585 5.132511 Copy code
Exercise: In the simple example above, you can use integral to analyze and calculate the posterior prediction value. In this case, it is a t distribution with 3 degrees of freedom (n 1). Plot the empirical density of y and the actual density of the t distribution. What is the comparison between them?
plot(den_of_y) Copy code
BAS forecast
In BAS, the Bayesian model averaging method is used to construct the prediction interval through simulation. In the case of model selection, it is often feasible to use the prediction interval for precise reasoning.
Returning to the salary data set, let us find the predicted value under the best prediction model, that is, the model with the predicted value closest to the BMA and the corresponding posterior standard deviation.
predict (bma_lwage, estimator = "BPM ") copy the code
## [1] "Intercept" "hours" "iq" "kww" "educ" ## [6] "exper" "tenure" "age" "married1" "urban1" ## [11] "meduc" Copy code
We can compare it with the highest probability model and median probability model (MPM) we have previously discovered
predict (bma_lwage, estimator = "MPM ") copy the code
## [1] "Intercept" "hours" "iq" "educ" "exper" ## [6] "tenure" "age" "married1" "urban1" "meduc" Copy code
In addition to all the variables in HPM, MPM also contains exper, while BPM also contains kwh in addition to all variables in MPM.
Exercise: Using simplified data, what covariates are included in the best predictive model, median probability model, and highest posterior probability model?
Let's take a look at which features in the BPM model affect the maximum salary.
## [,1] ## wage "1586" ## hours "40" ## iq "127" ## kww "48" ## educ "16" ## exper "16" ## tenure "12" ## age "37" ## married "1" ## black "0" ## south "0" ## urban "1" ## sibs "4" ## brthord "4" ## meduc "16" ## feduc "16" ## lwage "7.36897" Copy code
A 95% confidence interval for predicting logarithmic wages can be obtained
## 2.5% 97.5% pred ## 6.661869 8.056452 7.359160 Copy code
Converted to wages, we can index the interval.
## 2.5% 97.5% pred ## 782.0108 3154.0780 1570.5169 Copy code
Get a salary within a 95% prediction interval.
If BMA is used, the interval is
## 2.5% 97.5% pred ## 733.3446 2989.2076 1494.9899 Copy code
Exercise: Using the simplified data, construct a 95% prediction interval for the individual with the highest predicted salary under BPM.
references
Wooldridge, Jeffrey. 2000. Introductory Econometrics A Modern Approach . SouthWestern College Publishing.
Most popular insights
1. Matlab uses Bayesian optimized deep learning
2. Matlab Bayesian Hidden Markov hmm model implementation
3. Bayesian simple linear regression simulation of Gibbs sampling in R language
4. Block Gibbs Gibbs sampling Bayesian multiple linear regression in R language
5. The Bayesian model of Stan probabilistic programming MCMC sampling in R language
6. Python uses PyMC3 to implement Bayesian linear regression model
7. R language uses Bayesian hierarchical model for spatial data analysis