R language ISLR salary data for polynomial regression and spline regression analysis

R language ISLR salary data for polynomial regression and spline regression analysis

Original link: tecdat.cn/?p=8531

Original source: Tuoduan Data Tribe Official Account

 

  1. Perform polynomial regression using
    age
    prediction
    wage
    . Use cross-validation to select the best degree for the polynomial. To what extent is selected, which is related to the use of
    ANOVA
    How do the results of the hypothesis test compare? Plot the obtained polynomial fitting data.

Load the salary data set. Keep an array of all cross-validation errors. We perform K=10 K-fold cross-validation.

rm( list = ls()) set.seed( 1 ) # Test error cv.MSE <- NA # Loop through the age for (i in 1 : 15 ) { glm.fit <- glm(wage ~ poly(age, i), data = Wage) # We use the cross-validation of cv.glm and keep the cv error cv.MSE[i] <- cv.glm(Wage, glm.fit, K = 10 )$delta[ 1 ] } # Check the result object cv.MSE Copy code

 

## [1] 1675.837 1601.012 1598.801 1594.217 1594.625 1594.888 1595.500 ## [8] 1595.436 1596.335 1595.835 1595.970 1597.971 1598.713 1599.253 ## [15] 1595.332 Copy code

We draw

type = "b"
The graph of the relationship between the point and the line illustrates the result. 

# Illustrate the result with a line graph plot( x = 1 : 15 , y = cv.MSE, xlab = "power of age" , ylab = "CV error" , type = "b" , pch = 19 , lwd = 2 , bty = "n" , ylim = c ( min (cv.MSE)-sd(cv.MSE), max (cv.MSE) + sd(cv.MSE))) # Horizontal line abline(h = min (cv.MSE) + sd(cv.MSE), lty = "dotted" ) # Minimum Points (X = which.min (cv.MSE), Y = min (cv.MSE), COL = "Red" , PCH = "X-" , CEX = for 1.5 ) to copy the code

We again fit the model with a higher age weight for analysis of variance.

## Analysis of Deviance Table ## ## Model 1: wage ~ poly(age, a) ## Model 2: wage ~ poly(age, a) ## Model 3: wage ~ poly(age, a) ## Model 4: wage ~ poly(age, a) ## Model 5: wage ~ poly(age, a) ## Model 6: wage ~ poly(age, a) ## Model 7: wage ~ poly(age, a) ## Model 8: wage ~ poly(age, a) ## Model 9: wage ~ poly(age, a) ## Model 10: wage ~ poly(age, a) ## Model 11: wage ~ poly(age, a) ## Model 12: wage ~ poly(age, a) ## Model 13: wage ~ poly(age, a) ## Model 14: wage ~ poly(age, a) ## Model 15: wage ~ poly(age, a) ## Resid. Df Resid. Dev Df Deviance F Pr(>F) ## 1 2998 5022216 ## 2 2997 4793430 1 228786 143.5637 <2.2e-16 *** ## 3 2996 4777674 1 15756 9.8867 0.001681 ** ## 4 2995 4771604 1 6070 3.8090 0.051070. ## 5 2994 4770322 1 1283 0.8048 0.369731 ## 6 2993 4766389 1 3932 2.4675 0.116329 ## 7 2992 4763834 1 2555 1.6034 0.205515 ## 8 2991 4763707 1 127 0.0795 0.778016 ## 9 2990 4756703 1 7004 4.3952 0.036124 * ## 10 2989 4756701 1 3 0.0017 0.967552 ## 11 2988 4756597 1 103 0.0648 0.799144 ## 12 2987 4756591 1 7 0.0043 0.947923 ## 13 2986 4756401 1 190 0.1189 0.730224 ## 14 2985 4756158 1 243 0.1522 0.696488 ## 15 2984 4755364 1 795 0.4986 0.480151 ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 Copy code

According to the F test, we should choose a model whose age is increased to 3 times and pass cross-validation.

Now, we plot the results of the polynomial fit.

plot(wage ~ age, data = Wage, col = "darkgrey" , bty = "n" ) ... Copy code

  1. Fit the step function to
    wage
    Use to make predictions
    age 
    . Draw the obtained fitting graph.
cv.error <- NA ... # Highlight the minimum points (x = which.min(cv.error), y = min (cv.error, na.rm = TRUE ), copy the code

Cross-validation shows that in the case of k = 8, the test error is the smallest. The most parsimonious model within 1sd of the minimum has k = 4, so the data is divided into 5 different regions.

44

lm.fit <- glm(wage ~ cut(age, 4 ), data = Wage) matlines(age.grid, cbind( lm.pred$fit + 2 * lm.pred$se.fit, lm.pred$fit- 2 * lm.pred$se.fit), = COL "Red" , LTY = "Dashed" ) copy the code

 

The

Wage
The data set contains some other variables, such as marital status (
maritl
),Working level(
jobclass
),and many more. Explore the relationship between some of these other predictors
wage
, And use nonlinear fitting technology to fit the model to the data. 

... Summary (Wage [, C ( "maritl" , "jobclass" )]) Copy the code

 

## maritl jobclass ## 1. Never Married: 648 1. Industrial :1544 ## 2. Married :2074 2. Information:1456 ## 3. Widowed: 19 ## 4. Divorced: 204 ## 5. Separated: 55 Copy code
  Copy code
# Relation box plot par(mfrow = c ( 1 , 2 )) plot(Wage$maritl, Wage$wage, frame.plot = "FALSE" ) Copy code

 

It seems that a married couple earns more money on average than other groups. On average, wages for information jobs are higher than those for industrial jobs.

Polynomial and step function

m1 <- lm(wage ~ maritl, data = Wage) deviance (M1) # Fit (RSS in Linear; -2 * logLik) Copy the code
## [1] 4858941Copy code
m2 <- lm(wage ~ jobclass, data = Wage) deviance(m2) Copy code
## [1] 4998547Copy code
m3 <- lm(wage ~ maritl + jobclass, data = Wage) deviance(m3) Copy code
## [1] 4654752Copy code

As expected, using the most complex model can minimize in-sample data fit.

 

We cannot fit a spline curve to a categorical variable.

 

We cannot fit a spline curve to a factor, but we can use a spline curve to fit a continuous variable and add other predictors to the model.

library(gam) m4 <- gam(...) deviance(m4) Copy code
## [1] 4476501Copy code
anova(m1, m2, m3, m4) Copy code
## Analysis of Variance Table ## ## Model 1: wage ~ maritl ## Model 2: wage ~ jobclass ## Model 3: wage ~ maritl + jobclass ## Model 4: wage ~ maritl + jobclass + s(age, 4) ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 2995 4858941 ## 2 2998 4998547 -3.0000 -139606 31.082 <2.2e-16 *** ## 3 2994 4654752 4.0000 343795 57.408 <2.2e-16 *** ## 4 2990 4476501 4.0002 178252 29.764 <2.2e-16 *** ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 Copy code

The F test shows that the variables that we have statistically significantly improved from model four to model one are age.

wage
,
maritl
,with
jobclass
.

 

 

Boston data regression

The variables used in this question are dis (the weighted average of the distances to five Boston job centers) and nox (the concentration of nitric oxide per million people, in millions). We use dis as the predictor variable and nox as the dependent variable.

rm( list = ls()) set.seed( 1 ) library(MASS) attach(Boston) Copy code
## The following objects are masked from Boston (pos = 14): ## ## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio, ## rad, rm, tax, zn Copy code
  1. use
    poly()
    Function fit cubic polynomial regression to predict
    nox
    use
    dis
    . Report the regression output, and plot the resulting data and polynomial fit.
m1 <- lm(nox ~ poly(dis, 3 )) summary(m1) Copy code
## ## Call: ## lm(formula = nox ~ poly(dis, 3)) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.121130 -0.040619 -0.009738 0.023385 0.194904 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.554695 0.002759 201.021 <2e-16 *** ## poly(dis, 3)1 -2.003096 0.062071 -32.271 <2e-16 *** ## poly(dis, 3)2 0.856330 0.062071 13.796 <2e-16 *** ## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 *** ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 ## ## Residual standard error: 0.06207 on 502 degrees of freedom ## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131 ## F-statistic: 419.3 on 3 and 502 DF, p-value: <2.2e-16 Copy code
dislim <- range (dis) ... lines(x = dis.grid, y = lm.pred$fit, col = "red" , lwd = 2 ) matlines(x = dis.grid, y = cbind(lm.pred$fit + 2 * lm.pred$se.fit, lm.pred$fit- 2 * lm.pred$se.fit) , COL = "Red" , LWD = for 1.5 , LTY = "Dashed" ) copy the code

 

The summary shows that in

nox
When using for prediction, all polynomial terms are valid
dis
. The figure shows a smooth curve that fits the data well.

  1. Plot the range of polynomials suitable for different polynomial degrees (for example, from 1 to 10), and report the associated residual sum of squares.

We draw a polynomial from 1 to 10 degrees and save the RSS.

# train.rss <- NA ... # Show model fit in training set train.rss Copy code
## [1] 2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484 ## [8] 1.835630 1.833331 1.832171 Copy code

As expected, RSS decreases monotonically with the degree of the polynomial.

  1. Perform cross-validation or other methods to choose the best degree of polynomial and explain your results.

We execute LLOVC and code it by hand:

cv.error <- matrix( NA , nrow = nrow(Boston), ncol = 10 ) ... names (result) <- paste( "^" , 1 : 10 , sep= "" ) result Copy code
## ^1 ^2 ^3 ^4 ^5 ^6 ## 0.005471468 0.004022257 0.003822345 0.003820121 0.003785158 0.003711971 ## ^7 ^8 ^9 ^10 ## 0.003655106 0.003627727 0.003623183 0.003620892 Copy code
plot(result ~ seq( 1 , 10 ), type = "b" , pch = 19 , bty = "n" , xlab = "powers of dist to empl. center" , ylab = "cv error" ) abline (H = min (cv.error) + SD (cv.error), COL = "Red" , LTY = "Dashed" ) copy the code

Based on cross-validation, we will choose

dis
square.

  1. use
    bs()
    Function fitting regression spline curve usage
    nox
    Make predictions
    dis
    .

We divide this range by 4 intervals that are approximately equal to [3,6,9]

library(splines) m4 <- lm(nox ~ bs(dis, knots = c ( 3 , 6 , 9 ))) summary(m4) Copy code
## ## Call: ## lm(formula = nox ~ bs(dis, knots = c(3, 6, 9))) ## ## Residuals: ## Min 1Q Median 3Q Max ## -0.132134 -0.039466 -0.009042 0.025344 0.187258 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.709144 0.016099 44.049 <2e-16 *** ## bs(dis, knots = c(3, 6, 9))1 0.006631 0.025467 0.260 0.795 ## bs(dis, knots = c(3, 6, 9))2 -0.258296 0.017759 -14.544 <2e-16 *** ## bs(dis, knots = c(3, 6, 9))3 -0.233326 0.027248 -8.563 <2e-16 *** ## bs(dis, knots = c(3, 6, 9))4 -0.336530 0.032140 -10.471 <2e-16 *** ## bs(dis, knots = c(3, 6, 9))5 -0.269575 0.058799 -4.585 5.75e-06 *** ## bs(dis, knots = c(3, 6, 9))6 -0.303386 0.062631 -4.844 1.70e-06 *** ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 ## ## Residual standard error: 0.0612 on 499 degrees of freedom ## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7211 ## F-statistic: 218.6 on 6 and 499 DF, p-value: <2.2e-16 Copy code
# Drawing results ... # matlines( dis.grid, ... = COL "Black" , LWD = 2 , LTY = C ( "Solid" , "Dashed" , "Dashed" )) Copy the code

 

  1. Now fit the spline regression for a certain range of degrees of freedom, and plot the result fit and report the result RSS. Describe the results obtained.

We use dfs between 3 and 16 to fit regression splines.

box <- NA for (i in 3 : 16 ) { ... } Box [- C ( . 1 , 2 )] Copy Code
## [1] 1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653 ## [8] 1.792535 1.796992 1.788999 1.782350 1.781838 1.782798 1.783546 Copy code

 

ISLR
In the package
College
data set.

  1. Divide the data into training set and test set. Use tuition as a dependent variable, use other variables as predictors, perform forward stepwise selection on the training set, and determine a satisfactory model that uses only a subset of predictors.
rm( list = ls()) set.seed( 1 ) library(leaps) attach(College) Copy code
## The following objects are masked from College (pos = 14): ## ## Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate, ## Outstate, P.Undergrad, perc.alumni, Personal, PhD, Private, ## Room.Board, SFRatio, Terminal, Top10perc, Top25perc Copy code
# Train/test the index number of the split line train <- sample( length (Outstate), length (Outstate)/2 ) test <- -train ... abline (H = max.adjr2 - std.adjr2, COL = "Red" , LTY = 2 ) copy the code

 

All cp, BIC and adjr2 scores show that 6 is the smallest size of the subset. However, according to the 1 standard error rule, we will choose a model with 4 predictors.

... COEFi <- Coef (M5, = ID . 4 ) names (COEFi) copying the code
## [1] "(Intercept) " "PrivateYes" "Room.Board" "perc.alumni" "Expend" copy the code
  1. Fit GAM to the training data, use tuition as the response, and use the function selected in the previous step as the predictor variable. Plot the results and explain your findings.
library(gam) ... plot(gam.fit, se = TRUE , col = "blue" ) Copy code

 

  1. Evaluate the model obtained on the test set and explain the results obtained.
gam.pred <- predict(gam.fit, College.test) gam.err <- mean((College.test$Outstate-gam.pred)^ 2 ) gam.err Copy code
## [1] 3745460Copy code
gam.tss <- mean((College.test$Outstate-mean(College.test$Outstate))^ 2 ) test.rss <- 1-gam.err/gam.tss test.rss Copy code
## [1] 0.7696916Copy code

 

  1. For which variables (if any), is there evidence of a non-linear relationship with the dependent variable?
summary(gam.fit) Copy code
## ## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD, ## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate, ## df = 2), data = College.train) ## Deviance Residuals: ## Min 1Q Median 3Q Max ## -4977.74 -1184.52 58.33 1220.04 7688.30 ## ## (Dispersion Parameter for gaussian family taken to be 3300711) ## ## Null Deviance: 6221998532 on 387 degrees of freedom ## Residual Deviance: 1231165118 on 373 degrees of freedom ## AIC: 6941.542 ## ## Number of Local Scoring Iterations: 2 ## ## Anova for Parametric Effects ## Df Sum Sq Mean Sq F value Pr(>F) ## Private 1 1779433688 1779433688 539.106 <2.2e-16 *** ## s(Room.Board, df = 2) 1 1221825562 1221825562 370.171 <2.2e-16 *** ## s(PhD, df = 2) 1 382472137 382472137 115.876 <2.2e-16 *** ## s(perc.alumni, df = 2) 1 328493313 328493313 99.522 <2.2e-16 *** ## s(Expend, df = 5) 1 416585875 416585875 126.211 <2.2e-16 *** ## s(Grad.Rate, df = 2) 1 55284580 55284580 16.749 5.232e-05 *** ## Residuals 373 1231165118 3300711 ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 ## ## Anova for Nonparametric Effects ## Npar Df Npar F Pr(F) ## (Intercept) ## Private ## s(Room.Board, df = 2) 1 3.5562 0.06010. ## s(PhD, df = 2) 1 4.3421 0.03786 * ## s(perc.alumni, df = 2) 1 1.9158 0.16715 ## s(Expend, df = 5) 4 16.8636 1.016e-12 *** ## s(Grad.Rate, df = 2) 1 3.7208 0.05450. ## --- ## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1 Copy code

The non-parametric Anova test showed strong evidence of a non-linear relationship between the dependent variable and expenditure, and a moderately strong non-linear relationship between the dependent variable and Grad.Rate or PhD (using a p value of 0.05).


Most popular insights

1. R language multivariate logistic logistic regression application case

2. Implementation of panel smooth transfer regression (PSTR) analysis case

3. Partial least squares regression (PLSR) and principal component regression (PCR) in matlab

4. R language Poisson regression model analysis case

5. Hosmer-Lemeshow goodness-of-fit test in R language regression

6. Implementation of LASSO regression, Ridge regression and Elastic Net model in r language

7. Realize Logistic regression in R language

8. Python uses linear regression to predict stock prices

9. How does R language calculate IDI and NRI indicators in survival analysis and Cox regression