R language Bayesian linear regression and multiple linear regression to construct a salary forecasting model

R language Bayesian linear regression and multiple linear regression to construct a salary forecasting model

Original link: tecdat.cn/?p=21641 

Original source: Tuoduan Data Tribe Official Account

 

Salary model

In the field of labor economics, the study of income and wages provides insights into issues ranging from gender discrimination to higher education. In this article, we will analyze cross-sectional wage data in order to use Bayesian methods such as BIC and Bayesian models to build wage forecasting models in practice.

Load package

In this experiment, we will use the dplyr package to explore the data, and use the ggplot2 package for data visualization. We can also use the MASS package in one of the exercises to implement stepwise linear regression.

We will use BAS.LM in this software package later in the lab to implement the Bayesian model.

data

The data that this laboratory will use is randomly selected from 935 interviewees across the country.

variabledescription
wage
Weekly income
hours
Average working hours per week
IQ
IQ score
kww
Work knowledge score
educ
Years of education
exper
work experience
tenure
Have worked for the current employer for many years
age
age
married
= 1, if married
black
= 1 (if black)
south
= 1, if you live in the south
urban
= 1, if you live in a city
sibs
Number of siblings
brthord
Birth order
meduc
Mother's education
feduc
Father's education
lwage
Natural logarithm of wages 

Is this an observational study or an experiment?

  1. Observation Research

 

Explore the data

As with any new data set, standard exploratory data analysis is a good start. We will start with the wage variable because it is the dependent variable in our model.

  1. Regarding wages, which of the following statements is wrong?

    1. 7 respondents earn less than 300 yuan a week
summary(wage) Copy code
## Min. 1st Qu. Median Mean 3rd Qu. Max. ## 115.0 669.0 905.0 957.9 1160.0 3078.0 Copy code

Since salary is our dependent variable, we want to explore the relationship between other variables as a forecast.
Exercise: Exclude wages and length of service, and choose two other variables that you think are good predictors of wages. Use appropriate diagrams to visualize their relationship with wages.
Education level and working hours seem to be good predictors of workers wages.

ggplot(data = wage, aes(y=wage, x=exper))+geom_point() Copy code

ggplot (data = wage, aes ( y = wage, x = educ)) + geom_point () to copy the code

Simple linear regression

One possible explanation for the wage differences we see in the data is that smarter people make more money. The figure below shows a scatter plot between weekly salary and IQ score.

ggplot(data = wage, aes(x = iq, y = wage)) + geom_point() Copy code

This picture is quite messy. Although there may be a slight positive linear relationship between IQ scores and wages, IQ is at best a rough predictor of wages. We can quantify this by fitting a simple linear regression.

m_wage_iq$coefficientsCopy code
## (Intercept) iq ## 116.991565 8.303064 Copy code
## [1] 384.7667Copy code

Recall that under the model

If you use and the reference prior, then the Bayesian posterior mean and standard deviation are equal to the frequency estimate and standard deviation, respectively.

The Bayesian model specification assumes that the errors are normally distributed and the variance is constant. Like the frequency method, we test this hypothesis by checking the residual distribution of the model. If the residual is highly non-normal or skewed, the assumption is violated and any subsequent inferences are invalid.

  1. Check the residual of m_wage_iq. Is the assumption of normal distribution error valid?

    1. No, because the residual distribution of the model is skewed to the right.
qqnorm(m_wage_iq$residuals) qqline(m_wage_iq$residuals) Copy code

Exercise: Re-adjust the model, this time using educ (education) as the independent variable. Did your answer to the previous exercise change?

## (Intercept) educ ## 146.95244 60.21428 Copy code
summary (m_wage_educ) $ sigma copy the code
## [1] 382.3203Copy code

The same conclusion is that the residual of the linear model is approximately normally distributed with i N(0, 2), so further inferences can be made on the basis of the linear model.

Variable conversion

One way to adapt to the right skew of the data is to (naturally) transform the dependent variable. Note that this is only possible when the variable is strictly positive, because the logarithm of the negative value is not defined, and log(0)= . We try to fit a linear model with logarithmic wages as the dependent variable. Question 4 will be based on this logarithmic transformation model.

m_lwage_iq = lm (lwage ~ iq, data = wage) copying the code

Exercise: Check the residuals of the model. Is the residual of the assumption of a normal distribution reasonable?

Based on the above residual plot, it can be assumed that the logarithmic wage linear model and the normal distribution of iq.

Recall that the posterior distributions of and for a given 2 are normal, but slightly follow a t distribution with n p 1 degrees of freedom. In this case, p=1, because IQ is the only predictor of logarithmic wages in our model. Therefore, the posterior probabilities of and both follow the t distribution of 933 degrees of freedom. Because df is very large, these distributions are actually approximately normal.

  1. Under the reference prior p( , , 2) 1/2, the 95% posterior confidence interval of , namely the IQ coefficient, is given.

    1. (0.00709, 0.01050)
# Extract the coefficient value from the linear model m_lwage_iq qnorm(c(0.025, 0.975), mean = iq_mean_estimate, sd=iq_sd) Copy code
## [1] 0.007103173 0.010511141Copy code

Practice: The IQ coefficient is very small, which is expected, because an increase in IQ scores by one point can hardly have a high multiplier effect on wages. One way to make the coefficients easier to interpret is to normalize the IQ before putting it into the model. From the perspective of this new model, an IQ increase of 1 standard deviation (15 points) is estimated by what percentage of the salary increase?

IQ is standardized by the scale function. A 15-point increase in IQ will cause an increase in wages.

coef(summary(m_lwage_scaled_iq))["siq", "Estimate"]*15+coef(summary(m_lwage_scaled_iq))["(Intercept)", "Estimate"] Copy code
## [1] 8.767568Copy code

Multiple linear regression

Obviously, wages can be explained by many predictive factors, such as experience, education and IQ. We can include all relevant covariates in the regression model, trying to explain as many wage changes as possible.

The use of. in lm tells R to include all covariates in the model, and then use -wage to further modify it, and then exclude the wage variable from the model.
By default, the lm function performs a complete case analysis, so it deletes observations with missing (NA) values in one or more predictor variables.

Because of these missing values, we must make an additional assumption so that our inferences are valid. In other words, our data must be missing at random. For example, if all the first born children do not report their birth order, the data will not be randomly lost. In the absence of any additional information, we will assume that this is reasonable and use 663 complete observations (as opposed to the original 935) to fit the model. Both Bayesian and frequentist methods exist on data sets that deal with missing data, but they are beyond the scope of this course.

  1. From this model, who earns more: married blacks or single non-blacks?

    1. Married black

Compared with a single non-black man, all other equal, married blacks will receive the following multipliers.

married_black <- married_coef*1+black_coef*1 married_black Copy code
## [1] 0.09561888Copy code

From the quick summary of the linear model, it can be seen that many coefficients of the independent variables are not statistically significant. You can select variables based on adjusted R2. This article introduces Bayesian Information Criterion (BIC), which is a metric that can be used for model selection. BIC is based on model fitting and penalizes the number of parameters proportionally based on the sample size. We can use the following command to calculate the BIC of the fully linear model:

BIC(m_lwage_full) Copy code
## [1] 586.3732Copy code

We can compare the BIC of the complete model and the simplified model. Let's try to remove the birth order from the model. To ensure that the observations remain the same, you can specify the data set as na.omit(wage), which only contains observations with no missing values.

m_lwage_nobrthord = lm (lwage ~. -wage -brthord, data = na.omit (wage)) copying the code
## [1] 582.4815Copy code

As you can see, removing the birth order from the regression reduces BIC, and we tried to minimize BIC by choosing a model.

  1. Which variable is eliminated from the full model to get the lowest BIC?

    1. feduc
BIC(m_lwage_sibs) Copy code
## [1] 581.4031Copy code
BIC(m_lwage_feduc) Copy code
## [1] 580.9743Copy code
BIC(m_lwage_meduc) Copy code
## [1] 582.3722Copy code

Exercise: R has a function stepAIC, which will run backwards in the model space, deleting variables until the BIC no longer decreases. It takes a complete model and a penalty parameter k as input. Find the best model according to BIC (in this case k=log(n)k=log(n)).

#For AIC, the penalty factor is a contact value k. For step BIC, we will use the stepAIC function and set k to log(n) step(m_lwage_full1, direction = "backward", k=log(n)) Copy code
## Residuals: ## Min 1Q Median 3Q Max ## -172.57 -63.43 -35.43 23.39 1065.78 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -5546.2382 84.7839 -65.416 <2e-16 *** ## hours 1.9072 0.6548 2.913 0.0037 ** ## tenure -4.1285 0.9372 -4.405 1.23e-05 *** ## lwage 951.0113 11.5041 82.667 <2e-16 *** ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 ## ## Residual standard error: 120.1 on 659 degrees of freedom ## Multiple R-squared: 0.9131, Adjusted R-squared: 0.9127 ## F-statistic: 2307 on 3 and 659 DF, p-value: <2.2e-16 Copy code

Bayesian model average

Usually, several models are equally credible, and choosing only one model ignores the inherent uncertainty involved in choosing the variables included in the model. One way to solve this problem is to implement Bayesian model averaging (BMA), that is, to average multiple models to obtain the posterior and predicted values of coefficients from new data. We can use it to implement BMA or select models. We first apply BMA to salary data.

bma(lwage ~. -wage, data = wage_no_na, prior = "BIC", modelprior = uniform()) Copy code
## ## Marginal Posterior Inclusion Probabilities: ## Intercept hours iq kww educ exper ## 1.00000 0.85540 0.89732 0.34790 0.99887 0.70999 ## tenure age married1 black1 south1 urban1 ## 0.70389 0.52468 0.99894 0.34636 0.32029 1.00000 ## sibs brthord meduc feduc ## 0.04152 0.12241 0.57339 0.23274 Copy code
summary(bma_lwage) Copy code
## P(B != 0 | Y) model 1 model 2 model 3 ## Intercept 1.00000000 1.0000 1.0000000 1.0000000 ## hours 0.85540453 1.0000 1.0000000 1.0000000 ## iq 0.89732383 1.0000 1.0000000 1.0000000 ## kww 0.34789688 0.0000 0.0000000 0.0000000 ## educ 0.99887165 1.0000 1.0000000 1.0000000 ## exper 0.70999255 0.0000 1.0000000 1.0000000 ## tenure 0.70388781 1.0000 1.0000000 1.0000000 ## age 0.52467710 1.0000 1.0000000 0.0000000 ## married1 0.99894488 1.0000 1.0000000 1.0000000 ## black1 0.34636467 0.0000 0.0000000 0.0000000 ## south1 0.32028825 0.0000 0.0000000 0.0000000 ## urban1 0.99999983 1.0000 1.0000000 1.0000000 ## sibs 0.04152242 0.0000 0.0000000 0.0000000 ## brthord 0.12241286 0.0000 0.0000000 0.0000000 ## meduc 0.57339302 1.0000 1.0000000 1.0000000 ## feduc 0.23274084 0.0000 0.0000000 0.0000000 ## BF NA 1.0000 0.5219483 0.5182769 ## PostProbs NA 0.0455 0.0237000 0.0236000 ## R2 NA 0.2710 0.2767000 0.2696000 ## dim NA 9.0000 10.0000000 9.0000000 ## logmarg NA -1490.0530 -1490.7032349 -1490.7102938 ## model 4 model 5 ## Intercept 1.0000000 1.0000000 ## hours 1.0000000 1.0000000 ## iq 1.0000000 1.0000000 ## kww 1.0000000 0.0000000 ## educ 1.0000000 1.0000000 ## exper 1.0000000 0.0000000 ## tenure 1.0000000 1.0000000 ## age 0.0000000 1.0000000 ## married1 1.0000000 1.0000000 ## black1 0.0000000 1.0000000 ## south1 0.0000000 0.0000000 ## urban1 1.0000000 1.0000000 ## sibs 0.0000000 0.0000000 ## brthord 0.0000000 0.0000000 ## meduc 1.0000000 1.0000000 ## feduc 0.0000000 0.0000000 ## BF 0.4414346 0.4126565 ## PostProbs 0.0201000 0.0188000 ## R2 0.2763000 0.2762000 ## dim 10.0000000 10.0000000 ## logmarg -1490.8707736 -1490.9381880 Copy code

The output model object and the summary command provide us with the posterior model inclusion probability and the most probable model for each variable. For example, the posterior probability of including hours in the model is 0.855. In addition, the most likely model with a posterior probability of 0.0455 includes intercept, working hours, IQ, education level, age, marital status, urban living status, and mother s education level. Although the posterior probability of 0.0455 sounds small, it is much larger than the unified prior probability assigned to it because there are 216 possible models.
Under the model average method, the posterior distribution of the coefficients can also be visualized. We draw the posterior distribution of the IQ coefficient as follows.

plot(coef_lwage, subset = c(3,13), ask=FALSE) Copy code

We can also provide 95% confidence intervals for these coefficients:

## 2.5% 97.5% beta ## Intercept 6.787648e+00 6.841901e+00 6.8142970694 ## hours -9.213871e-03 0.000000e+00 -0.0053079979 ## iq 0.000000e+00 6.259498e-03 0.0037983313 ## kww 0.000000e+00 8.290187e-03 0.0019605787 ## educ 2.247901e-02 6.512858e-02 0.0440707549 ## exper 0.000000e+00 2.100097e-02 0.0100264057 ## tenure 0.000000e+00 1.288251e-02 0.0059357058 ## age 0.000000e+00 2.541561e-02 0.0089659753 ## married1 1.173088e-01 3.015797e-01 0.2092940731 ## black1 -1.900452e-01 0.000000e+00 -0.0441863361 ## south1 -1.021332e-01 1.296992e-05 -0.0221757978 ## urban1 1.374767e-01 2.621311e-01 0.1981221313 ## sibs 0.000000e+00 0.000000e+00 0.0000218455 ## brthord -2.072393e-02 0.000000e+00 -0.0019470674 ## meduc 0.000000e+00 2.279918e-02 0.0086717156 ## feduc -7.996880e-06 1.558576e-02 0.0025125930 ## attr(,"Probability") ## [1] 0.95 ## attr(,"class") ## [1] "confint.bas" Copy code

For questions 7-8, we will use a simplified data set that does not include the number of siblings, birth order, and parent education.

wage_red = wage%>% dplyr :: select (-sibs, -brthord, -meduc, -feduc) copy the code
  1. Based on this simplified data set, according to Bayesian model averaging, which of the following variables has the lowest marginal posterior inclusion probability?

    1. age
## ## Call: ## ## ## Marginal Posterior Inclusion Probabilities: ## Intercept hours iq kww educ exper ## 1.0000 0.8692 0.9172 0.3217 1.0000 0.9335 ## tenure age married1 black1 south1 urban1 ## 0.9980 0.1786 0.9999 0.9761 0.8149 1.0000 Copy code
## P(B != 0 | Y) model 1 model 2 model 3 ## Intercept 1.0000000 1.0000 1.0000000 1.0000000 ## hours 0.8691891 1.0000 1.0000000 1.0000000 ## iq 0.9171607 1.0000 1.0000000 1.0000000 ## kww 0.3216992 0.0000 1.0000000 0.0000000 ## educ 1.0000000 1.0000 1.0000000 1.0000000 ## exper 0.9334844 1.0000 1.0000000 1.0000000 ## tenure 0.9980015 1.0000 1.0000000 1.0000000 ## age 0.1786252 0.0000 0.0000000 0.0000000 ## married1 0.9999368 1.0000 1.0000000 1.0000000 ## black1 0.9761347 1.0000 1.0000000 1.0000000 ## south1 0.8148861 1.0000 1.0000000 0.0000000 ## urban1 1.0000000 1.0000 1.0000000 1.0000000 ## BF NA 1.0000 0.5089209 0.2629789 ## PostProbs NA 0.3311 0.1685000 0.0871000 ## R2 NA 0.2708 0.2751000 0.2634000 ## dim NA 10.0000 11.0000000 9.0000000 ## logmarg NA -2275.4209 -2276.0963811 -2276.7565998 ## model 4 model 5 ## Intercept 1.0000000 1.0000000 ## hours 1.0000000 0.0000000 ## iq 1.0000000 1.0000000 ## kww 0.0000000 0.0000000 ## educ 1.0000000 1.0000000 ## exper 1.0000000 1.0000000 ## tenure 1.0000000 1.0000000 ## age 1.0000000 0.0000000 ## married1 1.0000000 1.0000000 ## black1 1.0000000 1.0000000 ## south1 1.0000000 1.0000000 ## urban1 1.0000000 1.0000000 ## BF 0.2032217 0.1823138 ## PostProbs 0.0673000 0.0604000 ## R2 0.2737000 0.2628000 ## dim 11.0000000 9.0000000 ## logmarg -2277.0143763 -2277.1229445 Copy code
  1. Judgment: The posterior probability of the naive model including all variables is greater than 0.5. (The coefficient uses Zellner-Siow zero prior, and the model uses binomial (1,1) prior)

    1. Really
bma_lwage_fullCopy code
## ## Call: ## ## ## Marginal Posterior Inclusion Probabilities: ## Intercept hours iq kww educ exper ## 1.0000 0.9792 0.9505 0.6671 0.9998 0.8951 ## tenure age married1 black1 south1 urban1 ## 0.9040 0.7093 0.9998 0.7160 0.6904 1.0000 ## sibs brthord meduc feduc ## 0.3939 0.5329 0.7575 0.5360 Copy code
## P(B != 0 | Y) model 1 model 2 model 3 model 4 ## Intercept 1.0000000 1.00000000 1.00000000 1.00000000 1.0000 ## hours 0.9791805 1.00000000 1.00000000 1.00000000 1.0000 ## iq 0.9504649 1.00000000 1.00000000 1.00000000 1.0000 ## kww 0.6670580 1.00000000 1.00000000 1.00000000 1.0000 ## educ 0.9998424 1.00000000 1.00000000 1.00000000 1.0000 ## exper 0.8950911 1.00000000 1.00000000 1.00000000 1.0000 ## tenure 0.9040156 1.00000000 1.00000000 1.00000000 1.0000 ## age 0.7092839 1.00000000 1.00000000 1.00000000 1.0000 ## married1 0.9997881 1.00000000 1.00000000 1.00000000 1.0000 ## black1 0.7160065 1.00000000 1.00000000 1.00000000 1.0000 ## south1 0.6903763 1.00000000 1.00000000 1.00000000 1.0000 ## urban1 1.0000000 1.00000000 1.00000000 1.00000000 1.0000 ## sibs 0.3938833 1.00000000 1.00000000 0.00000000 0.0000 ## brthord 0.5329258 1.00000000 1.00000000 1.00000000 0.0000 ## meduc 0.7575462 1.00000000 1.00000000 1.00000000 1.0000 ## feduc 0.5359832 1.00000000 0.00000000 1.00000000 0.0000 ## BF NA 0.01282537 0.06040366 0.04899546 1.0000 ## PostProbs NA 0.07380000 0.02320000 0.01880000 0.0126 ## R2 NA 0.29250000 0.29140000 0.29090000 0.2882 ## dim NA 16.00000000 15.00000000 15.00000000 13.0000 ## logmarg NA 76.00726935 77.55689364 77.34757164 80.3636 ## model 5 ## Intercept 1.000000 ## hours 1.000000 ## iq 1.000000 ## kww 1.000000 ## educ 1.000000 ## exper 1.000000 ## tenure 1.000000 ## age 1.000000 ## married1 1.000000 ## black1 1.000000 ## south1 1.000000 ## urban1 1.000000 ## sibs 0.000000 ## brthord 1.000000 ## meduc 1.000000 ## feduc 0.000000 ## BF 0.227823 ## PostProbs 0.012500 ## R2 0.289600 ## dim 14.000000 ## logmarg 78.884413 Copy code

Exercise: Use a data set to plot the posterior distribution of the age coefficient.

plot(coef_bma_wage_red, ask = FALSE) Copy code

prediction

A major advantage of Bayesian statistics is the probabilistic interpretation of predictions and predictions. Most Bayesian predictions are done using simulation techniques. This is usually used in regression modeling, although we will analyze it through an example that contains only the intercept term.

Suppose you observe four numerical observations of y, namely 2, 2, 0, and 0, the sample mean y =1, and the sample variance s2=4/3. Assuming y N( , 2), under the reference prior p( , 2) 1/2, our posterior probability becomes

Centered on sample mean

Where a=(n-1)/2 and b=s2(n-1)/2=2.

In order to get the predicted distribution of y5, we can first simulate from the posterior point of 2, and then simulate y5 from . Our forecast for y5 years will come from the posterior forecast distribution of a new observation. The following example extracts 100,000 times from the posterior prediction distribution of y5.

set.seed(314) N = 100000 y_5 = rnorm(N, mu, sqrt(sigma2)) Copy code

We can view the estimated value of the predicted distribution by observing the smoothed version of the histogram of the simulated data.

The 95% central confidence interval of the new observation is that in this case, L is the 0.025 quantile and U is the 0.975 quantile. We can use the quantile function to obtain these values to find the sample quantiles of 0.025 and 0.975 for tracy5.

  1. Estimate a new observation y595% confidence interval

    1. (-3.11, 5.13)
quantile(y_5, probs = c(0.025, 0.975)) Copy code
## 2.5% 97.5% ## -3.109585 5.132511 Copy code

Exercise: In the simple example above, you can use integral to analyze and calculate the posterior prediction value. In this case, it is a t distribution with 3 degrees of freedom (n 1). Plot the empirical density of y and the actual density of the t distribution. What is the comparison between them?

plot(den_of_y) Copy code

BAS forecast

In BAS, the Bayesian model averaging method is used to construct the prediction interval through simulation. In the case of model selection, it is often feasible to use the prediction interval for precise reasoning.

Returning to the salary data set, let us find the predicted value under the best prediction model, that is, the model with the predicted value closest to the BMA and the corresponding posterior standard deviation.

predict (bma_lwage, estimator = "BPM ") copy the code
## [1] "Intercept" "hours" "iq" "kww" "educ" ## [6] "exper" "tenure" "age" "married1" "urban1" ## [11] "meduc" Copy code

We can compare it with the highest probability model and median probability model (MPM) we have previously discovered

predict (bma_lwage, estimator = "MPM ") copy the code
## [1] "Intercept" "hours" "iq" "educ" "exper" ## [6] "tenure" "age" "married1" "urban1" "meduc" Copy code

In addition to all the variables in HPM, MPM also contains exper, while BPM also contains kwh in addition to all variables in MPM.
Exercise: Using simplified data, what covariates are included in the best predictive model, median probability model, and highest posterior probability model?
Let's take a look at which features in the BPM model affect the maximum salary.

## [,1] ## wage "1586" ## hours "40" ## iq "127" ## kww "48" ## educ "16" ## exper "16" ## tenure "12" ## age "37" ## married "1" ## black "0" ## south "0" ## urban "1" ## sibs "4" ## brthord "4" ## meduc "16" ## feduc "16" ## lwage "7.36897" Copy code

A 95% confidence interval for predicting logarithmic wages can be obtained

## 2.5% 97.5% pred ## 6.661869 8.056452 7.359160 Copy code

Converted to wages, we can index the interval.

## 2.5% 97.5% pred ## 782.0108 3154.0780 1570.5169 Copy code

Get a salary within a 95% prediction interval.
If BMA is used, the interval is

## 2.5% 97.5% pred ## 733.3446 2989.2076 1494.9899 Copy code

Exercise: Using the simplified data, construct a 95% prediction interval for the individual with the highest predicted salary under BPM.

references

Wooldridge, Jeffrey. 2000.  Introductory Econometrics- A Modern Approach . South-Western College Publishing. 


Most popular insights

1. Matlab uses Bayesian optimized deep learning

2. Matlab Bayesian Hidden Markov hmm model implementation

3. Bayesian simple linear regression simulation of Gibbs sampling in R language

4. Block Gibbs Gibbs sampling Bayesian multiple linear regression in R language

5. The Bayesian model of Stan probabilistic programming MCMC sampling in R language

6. Python uses PyMC3 to implement Bayesian linear regression model

7. R language uses Bayesian hierarchical model for spatial data analysis

8. R language random search variable selection SSVS estimation Bayesian vector autoregressive (BVAR) model

9. Matlab Bayesian Hidden Markov hmm model implementation