# Original source: Tuoduan Data Tribe Official Account

1. Perform polynomial regression using
age
prediction
wage
. Use cross-validation to select the best degree for the polynomial. To what extent is selected, which is related to the use of
ANOVA
How do the results of the hypothesis test compare? Plot the obtained polynomial fitting data.

Load the salary data set. Keep an array of all cross-validation errors. We perform K=10 K-fold cross-validation.

```rm( list = ls())
set.seed( 1 )

# Test error
cv.MSE <- NA

# Loop through the age
for (i in  1 : 15 ) {
glm.fit <- glm(wage ~ poly(age, i), data = Wage)
# We use the cross-validation of cv.glm and keep the cv error
cv.MSE[i] <- cv.glm(Wage, glm.fit, K = 10 )\$delta[ 1 ]
}
# Check the result object
cv.MSE
Copy code```

```
##  1675.837 1601.012 1598.801 1594.217 1594.625 1594.888 1595.500
##  1595.436 1596.335 1595.835 1595.970 1597.971 1598.713 1599.253
##  1595.332
Copy code```

We draw

type = "b"
The graph of the relationship between the point and the line illustrates the result.

```# Illustrate the result with a line graph
plot( x = 1 : 15 , y = cv.MSE, xlab = "power of age" , ylab = "CV error" ,
type = "b" , pch = 19 , lwd = 2 , bty = "n" ,
ylim = c ( min (cv.MSE)-sd(cv.MSE), max (cv.MSE) + sd(cv.MSE)))

# Horizontal line
abline(h = min (cv.MSE) + sd(cv.MSE), lty = "dotted" )

# Minimum
Points (X = which.min (cv.MSE), Y = min (cv.MSE), COL = "Red" , PCH = "X-" , CEX = for 1.5 )
to copy the code```

We again fit the model with a higher age weight for analysis of variance.

```
## Analysis of Deviance Table
##
## Model 1: wage ~ poly(age, a)
## Model 2: wage ~ poly(age, a)
## Model 3: wage ~ poly(age, a)
## Model 4: wage ~ poly(age, a)
## Model 5: wage ~ poly(age, a)
## Model 6: wage ~ poly(age, a)
## Model 7: wage ~ poly(age, a)
## Model 8: wage ~ poly(age, a)
## Model 9: wage ~ poly(age, a)
## Model 10: wage ~ poly(age, a)
## Model 11: wage ~ poly(age, a)
## Model 12: wage ~ poly(age, a)
## Model 13: wage ~ poly(age, a)
## Model 14: wage ~ poly(age, a)
## Model 15: wage ~ poly(age, a)
## Resid. Df Resid. Dev Df Deviance F Pr(>F)
## 1 2998 5022216
## 2 2997 4793430 1 228786 143.5637 <2.2e-16 ***
## 3 2996 4777674 1 15756 9.8867 0.001681 **
## 4 2995 4771604 1 6070 3.8090 0.051070.
## 5 2994 4770322 1 1283 0.8048 0.369731
## 6 2993 4766389 1 3932 2.4675 0.116329
## 7 2992 4763834 1 2555 1.6034 0.205515
## 8 2991 4763707 1 127 0.0795 0.778016
## 9 2990 4756703 1 7004 4.3952 0.036124 *
## 10 2989 4756701 1 3 0.0017 0.967552
## 11 2988 4756597 1 103 0.0648 0.799144
## 12 2987 4756591 1 7 0.0043 0.947923
## 13 2986 4756401 1 190 0.1189 0.730224
## 14 2985 4756158 1 243 0.1522 0.696488
## 15 2984 4755364 1 795 0.4986 0.480151
## ---
## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1
Copy code```

According to the F test, we should choose a model whose age is increased to 3 times and pass cross-validation.

Now, we plot the results of the polynomial fit.

```plot(wage ~ age, data = Wage, col = "darkgrey" , bty = "n" )
...
Copy code```

1. Fit the step function to
wage
Use to make predictions
age
. Draw the obtained fitting graph.
```cv.error <-   NA
...

# Highlight the minimum
points (x = which.min(cv.error), y = min (cv.error, na.rm = TRUE ),
copy the code```

Cross-validation shows that in the case of k = 8, the test error is the smallest. The most parsimonious model within 1sd of the minimum has k = 4, so the data is divided into 5 different regions.

44

```lm.fit <- glm(wage ~ cut(age, 4 ), data = Wage)

matlines(age.grid, cbind( lm.pred\$fit + 2 * lm.pred\$se.fit,
lm.pred\$fit- 2 * lm.pred\$se.fit),
= COL "Red" , LTY = "Dashed" )
copy the code```

The

Wage
The data set contains some other variables, such as marital status (
maritl
),Working level(
jobclass
),and many more. Explore the relationship between some of these other predictors
wage
, And use nonlinear fitting technology to fit the model to the data.

```...
Summary (Wage [, C ( "maritl" , "jobclass" )])
Copy the code```

```
## maritl jobclass
## 1. Never Married: 648 1. Industrial :1544
## 2. Married :2074 2. Information:1456
## 3. Widowed: 19
## 4. Divorced: 204
## 5. Separated: 55
Copy code```
```
Copy code```
```# Relation box plot
par(mfrow = c ( 1 , 2 ))
plot(Wage\$maritl, Wage\$wage, frame.plot = "FALSE" )
Copy code```

It seems that a married couple earns more money on average than other groups. On average, wages for information jobs are higher than those for industrial jobs.

# Polynomial and step function

```m1 <- lm(wage ~ maritl, data = Wage)
deviance (M1) # Fit (RSS in Linear; -2 * logLik)
Copy the code```
```##  4858941Copy
code```
```m2 <- lm(wage ~ jobclass, data = Wage)
deviance(m2)
Copy code```
```##  4998547Copy
code```
```m3 <- lm(wage ~ maritl + jobclass, data = Wage)
deviance(m3)
Copy code```
```##  4654752Copy
code```

As expected, using the most complex model can minimize in-sample data fit.

We cannot fit a spline curve to a categorical variable.

We cannot fit a spline curve to a factor, but we can use a spline curve to fit a continuous variable and add other predictors to the model.

```library(gam)
m4 <- gam(...)
deviance(m4)
Copy code```
```##  4476501Copy
code```
```anova(m1, m2, m3, m4)
Copy code```
```## Analysis of Variance Table
##
## Model 1: wage ~ maritl
## Model 2: wage ~ jobclass
## Model 3: wage ~ maritl + jobclass
## Model 4: wage ~ maritl + jobclass + s(age, 4)
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 2995 4858941
## 2 2998 4998547 -3.0000 -139606 31.082 <2.2e-16 ***
## 3 2994 4654752 4.0000 343795 57.408 <2.2e-16 ***
## 4 2990 4476501 4.0002 178252 29.764 <2.2e-16 ***
## ---
## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1
Copy code```

The F test shows that the variables that we have statistically significantly improved from model four to model one are age.

wage
,
maritl
,with
jobclass
.

# Boston data regression

The variables used in this question are dis (the weighted average of the distances to five Boston job centers) and nox (the concentration of nitric oxide per million people, in millions). We use dis as the predictor variable and nox as the dependent variable.

```rm( list = ls())
set.seed( 1 )
library(MASS)
attach(Boston)
Copy code```
```## The following objects are masked from Boston (pos = 14):
##
## age, black, chas, crim, dis, indus, lstat, medv, nox, ptratio,
## rad, rm, tax, zn
Copy code```
1. use
poly()
Function fit cubic polynomial regression to predict
nox
use
dis
. Report the regression output, and plot the resulting data and polynomial fit.
```m1 <- lm(nox ~ poly(dis, 3 ))
summary(m1)
Copy code```
```##
## Call:
## lm(formula = nox ~ poly(dis, 3))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.121130 -0.040619 -0.009738 0.023385 0.194904
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.554695 0.002759 201.021 <2e-16 ***
## poly(dis, 3)1 -2.003096 0.062071 -32.271 <2e-16 ***
## poly(dis, 3)2 0.856330 0.062071 13.796 <2e-16 ***
## poly(dis, 3)3 -0.318049 0.062071 -5.124 4.27e-07 ***
## ---
## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1
##
## Residual standard error: 0.06207 on 502 degrees of freedom
## Multiple R-squared: 0.7148, Adjusted R-squared: 0.7131
## F-statistic: 419.3 on 3 and 502 DF, p-value: <2.2e-16
Copy code```
```dislim <-   range (dis)
...
lines(x = dis.grid, y = lm.pred\$fit, col = "red" , lwd = 2 )
matlines(x = dis.grid, y = cbind(lm.pred\$fit + 2 * lm.pred\$se.fit,
lm.pred\$fit- 2 * lm.pred\$se.fit)
, COL = "Red" , LWD = for 1.5 , LTY = "Dashed" )
copy the code```

The summary shows that in

nox
When using for prediction, all polynomial terms are valid
dis
. The figure shows a smooth curve that fits the data well.

1. Plot the range of polynomials suitable for different polynomial degrees (for example, from 1 to 10), and report the associated residual sum of squares.

We draw a polynomial from 1 to 10 degrees and save the RSS.

```#
...
# Show model fit in training set
Copy code```
```##  2.768563 2.035262 1.934107 1.932981 1.915290 1.878257 1.849484
##  1.835630 1.833331 1.832171
Copy code```

As expected, RSS decreases monotonically with the degree of the polynomial.

1. Perform cross-validation or other methods to choose the best degree of polynomial and explain your results.

We execute LLOVC and code it by hand:

```cv.error <- matrix( NA , nrow = nrow(Boston), ncol = 10 )

...
names (result) <- paste( "^" , 1 : 10 , sep= "" )
result
Copy code```
```## ^1 ^2 ^3 ^4 ^5 ^6
## 0.005471468 0.004022257 0.003822345 0.003820121 0.003785158 0.003711971
## ^7 ^8 ^9 ^10
## 0.003655106 0.003627727 0.003623183 0.003620892
Copy code```
```plot(result ~ seq( 1 , 10 ), type = "b" , pch = 19 , bty = "n" , xlab = "powers of dist to empl. center" ,
ylab = "cv error" )
abline (H = min (cv.error) + SD (cv.error), COL = "Red" , LTY = "Dashed" )
copy the code```

Based on cross-validation, we will choose

dis
square.

1. use
bs()
Function fitting regression spline curve usage
nox
Make predictions
dis
.

We divide this range by 4 intervals that are approximately equal to [3,6,9]

```library(splines)
m4 <- lm(nox ~ bs(dis, knots = c ( 3 , 6 , 9 )))
summary(m4)
Copy code```
```##
## Call:
## lm(formula = nox ~ bs(dis, knots = c(3, 6, 9)))
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.132134 -0.039466 -0.009042 0.025344 0.187258
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.709144 0.016099 44.049 <2e-16 ***
## bs(dis, knots = c(3, 6, 9))1 0.006631 0.025467 0.260 0.795
## bs(dis, knots = c(3, 6, 9))2 -0.258296 0.017759 -14.544 <2e-16 ***
## bs(dis, knots = c(3, 6, 9))3 -0.233326 0.027248 -8.563 <2e-16 ***
## bs(dis, knots = c(3, 6, 9))4 -0.336530 0.032140 -10.471 <2e-16 ***
## bs(dis, knots = c(3, 6, 9))5 -0.269575 0.058799 -4.585 5.75e-06 ***
## bs(dis, knots = c(3, 6, 9))6 -0.303386 0.062631 -4.844 1.70e-06 ***
## ---
## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1
##
## Residual standard error: 0.0612 on 499 degrees of freedom
## Multiple R-squared: 0.7244, Adjusted R-squared: 0.7211
## F-statistic: 218.6 on 6 and 499 DF, p-value: <2.2e-16
Copy code```
```# Drawing results
...
#
matlines( dis.grid,
...
= COL "Black" , LWD = 2 , LTY = C ( "Solid" , "Dashed" , "Dashed" ))
Copy the code```

1. Now fit the spline regression for a certain range of degrees of freedom, and plot the result fit and report the result RSS. Describe the results obtained.

We use dfs between 3 and 16 to fit regression splines.

```box <-   NA

for (i in  3 : 16 ) {
...
}

Box [- C ( . 1 , 2 )]
Copy Code```
```##  1.934107 1.922775 1.840173 1.833966 1.829884 1.816995 1.825653
##  1.792535 1.796992 1.788999 1.782350 1.781838 1.782798 1.783546
Copy code```

ISLR
In the package
College
data set.

1. Divide the data into training set and test set. Use tuition as a dependent variable, use other variables as predictors, perform forward stepwise selection on the training set, and determine a satisfactory model that uses only a subset of predictors.
```rm( list = ls())
set.seed( 1 )
library(leaps)
attach(College)
Copy code```
```## The following objects are masked from College (pos = 14):
##
## Accept, Apps, Books, Enroll, Expend, F.Undergrad, Grad.Rate,
## Outstate, P.Undergrad, perc.alumni, Personal, PhD, Private,
## Room.Board, SFRatio, Terminal, Top10perc, Top25perc
Copy code```
```# Train/test the index number of the split line
train <- sample( length (Outstate), length (Outstate)/2 )
test <- -train
...
abline (H = max.adjr2 - std.adjr2, COL = "Red" , LTY = 2 )
copy the code```

All cp, BIC and adjr2 scores show that 6 is the smallest size of the subset. However, according to the 1 standard error rule, we will choose a model with 4 predictors.

```...
COEFi <- Coef (M5, = ID . 4 )
names (COEFi)
copying the code```
```##  "(Intercept)
" "PrivateYes" "Room.Board" "perc.alumni" "Expend" copy the code```
1. Fit GAM to the training data, use tuition as the response, and use the function selected in the previous step as the predictor variable. Plot the results and explain your findings.
```library(gam)
...
plot(gam.fit, se = TRUE , col = "blue" )
Copy code```

1. Evaluate the model obtained on the test set and explain the results obtained.
```gam.pred <- predict(gam.fit, College.test)
gam.err <- mean((College.test\$Outstate-gam.pred)^ 2 )
gam.err
Copy code```
```##  3745460Copy
code```
```gam.tss <- mean((College.test\$Outstate-mean(College.test\$Outstate))^ 2 )
Copy code```
```##  0.7696916Copy
code```

1. For which variables (if any), is there evidence of a non-linear relationship with the dependent variable?
```summary(gam.fit)
Copy code```
```##
## Call: gam(formula = Outstate ~ Private + s(Room.Board, df = 2) + s(PhD,
## df = 2) + s(perc.alumni, df = 2) + s(Expend, df = 5) + s(Grad.Rate,
## df = 2), data = College.train)
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -4977.74 -1184.52 58.33 1220.04 7688.30
##
## (Dispersion Parameter for gaussian family taken to be 3300711)
##
## Null Deviance: 6221998532 on 387 degrees of freedom
## Residual Deviance: 1231165118 on 373 degrees of freedom
## AIC: 6941.542
##
## Number of Local Scoring Iterations: 2
##
## Anova for Parametric Effects
## Df Sum Sq Mean Sq F value Pr(>F)
## Private 1 1779433688 1779433688 539.106 <2.2e-16 ***
## s(Room.Board, df = 2) 1 1221825562 1221825562 370.171 <2.2e-16 ***
## s(PhD, df = 2) 1 382472137 382472137 115.876 <2.2e-16 ***
## s(perc.alumni, df = 2) 1 328493313 328493313 99.522 <2.2e-16 ***
## s(Expend, df = 5) 1 416585875 416585875 126.211 <2.2e-16 ***
## s(Grad.Rate, df = 2) 1 55284580 55284580 16.749 5.232e-05 ***
## Residuals 373 1231165118 3300711
## ---
## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1
##
## Anova for Nonparametric Effects
## Npar Df Npar F Pr(F)
## (Intercept)
## Private
## s(Room.Board, df = 2) 1 3.5562 0.06010.
## s(PhD, df = 2) 1 4.3421 0.03786 *
## s(perc.alumni, df = 2) 1 1.9158 0.16715
## s(Expend, df = 5) 4 16.8636 1.016e-12 ***
## s(Grad.Rate, df = 2) 1 3.7208 0.05450.
## ---
## Signif. codes: 0'***' 0.001'**' 0.01'*' 0.05'.' 0.1 '' 1
Copy code```

The non-parametric Anova test showed strong evidence of a non-linear relationship between the dependent variable and expenditure, and a moderately strong non-linear relationship between the dependent variable and Grad.Rate or PhD (using a p value of 0.05).

Most popular insights