Ridge regression, lasso regression, principal component regression in R language: linear model selection and regularization

Ridge regression, lasso regression, principal component regression in R language: linear model selection and regularization

Original link: tecdat.cn/?p=9913

Original source: Tuoduan Data Tribe Official Account

 


 

Overview and definition

In this article, we will consider some alternative fitting methods for linear models, in addition to the usual  ordinary least squares method . These alternative methods can sometimes provide better prediction accuracy and model interpretability.

  • Prediction accuracy : linear, ordinary least square estimation will have low deviation. OLS also performed well,  n  >>  p . However, if  n is  not much larger than p , there may be a lot of variability in the fitting, resulting in overfitting and/or poor prediction. If  p  >  n , there is no longer a unique least squares estimate, and the method cannot be used at all.

This question is another aspect of the curse of dimensionality. When  p  starts to increase, the observation value  x  starts to become closer to the boundary between the categories than nearby observation values, which poses a major problem for prediction. In addition, for many  p , the training samples are often sparse, making it difficult to identify trends and make predictions.

By  limiting  and  narrowing down  estimated coefficients, we can usually greatly reduce the variance, because the increase in bias is negligible, which usually leads to a significant increase in accuracy.

  • Interpretability of the model : variables lead to unnecessary complexity of the resulting model. By deleting them (set coefficient = 0), we get a model that is easier to interpret. However, using OLS makes the coefficient extremely unlikely to be zero.

    • Subset selection : We use the least squares fitting model of the subset features.

Although we discussed the application of these techniques in linear models, they are also applicable to other methods, such as classification.

Detailed method

Subset selection

Best subset selection

Here, we  fit a separate OLS regression for each possible combination of p predictors, and then view the resulting model fit. The problem with this method is that the  best model is  hidden within 2^ p  possibilities. The algorithm is divided into two stages. (1) Fit all models containing k predictors, where  k  is the maximum length of the model. (2) Use cross-validated prediction error to select a model. More specific prediction error methods, such as AIC and BIC, will be discussed below.

This applies to other types of model choices, such as logistic regression, but the score we choose will vary based on the choice. For logistic regression, we will use  bias  instead of RSS and R^2.

 

Choose the best model

Each of the three algorithms mentioned above requires us to manually determine which model works best. As mentioned earlier, when using training error, the model with the most predicted values usually has the smallest RSS and the largest R^2. In order to select  the model with the largest test error, we need to estimate the test error. There are two ways to calculate the test error.

  1.  Through the training error and adjustment to indirectly estimate the test error to solve the bias of overfitting.
  2.  Use validation sets or cross-validation methods to directly estimate test errors.

 

Validation and cross validation

Generally, cross-validation techniques are more direct estimates of the test and make fewer assumptions about the underlying model. In addition, it can be used in a wider selection of model types.

 

Ridge regression

Ridge regression is similar to least squares, except that the coefficients are estimated by minimizing slightly different numbers. Like OLS, Ridge regression seeks to reduce the RSS coefficient estimates, but when the coefficients are close to zero, they also produce shrinkage losses. The effect of this loss is to reduce the coefficient estimate to zero. The parameter controls the effect of shrinkage. The behavior of =0 is exactly the same as OLS regression. Of course, choosing a good value is very important and should be selected using cross-validation. The requirement of ridge regression is that the  center of the predictor variable  X is set as 

mean = 0
, So the data must be standardized in advance.

 

Why is ridge regression better than least squares?

The advantage is obvious in the deviation variance . As increases, the flexibility of ridge regression fitting decreases. This leads to a reduction in variance and a smaller increase in deviation. The fixed OLS regression has a high variance, but no bias. However, the lowest test MSE often occurs at the intersection between variance and deviation. Therefore, by appropriately adjusting to obtain less variance, we can find a lower potential MSE.

Ridge regression is most effective when the least squares estimation has high variance. Ridge regression has higher computational efficiency than any subset method because it can solve all values at the same time.

Lasso

Ridge regression has at least one disadvantage. It includes all p predictors in the final model  . The penalty term will bring many of them close to zero, but will never be  exactly  zero. For prediction accuracy, this is usually not a problem, but it makes it more difficult for the model to interpret the results. Lasso overcomes this shortcoming, and can  force s to be small enough to force certain coefficients to zero. Since  s  = 1 leads to conventional OLS regression, when  s is  close to 0, the coefficient will shrink to zero. Therefore, lasso regression also performs variable selection.

 

Dimensionality reduction method

So far, the methods we have discussed have controlled the variance by using a subset of the original variables or reducing their coefficients to zero. Now, we explore a class of models that can  transform predictor variables, and then use the transformed variables to fit a least squares model. Dimensionality reduction reduces  the problem of estimating  p +1  coefficients to a simple problem of M +1 coefficients, where  M  <  p . The two methods for this task are  principal component regression  and  partial least squares .

Principal component regression (PCA)

PCA can be described as a method of deriving low-dimensional feature sets from a large number of variables.

In regression, we construct  M  principal components, and then use these components as predictors in linear regression using least squares. Generally, compared with ordinary least squares, we are likely to fit a better model because we can reduce the impact of overfitting.

 

Partial Least Squares

The PCR method we described above involves identifying the linear combination of X that best represents the predictor variable .

PLS achieves this by assigning higher weight to the variables that are most closely related to the dependent variable.

In fact, the performance of PLS is no better than ridge regression or PCR. This is because even if PLS can reduce bias, it may also increase variance, so there is no real difference in overall returns.

 

Interpret high-dimensional results

We must always be cautious about the way we report the model results obtained, especially in high-dimensional settings. In this case, the problem of multicollinearity is very serious, because any variable in the model can be written as a linear combination of all other variables in the model.

 

example

Subset selection method

Best subset selection

We hope to predict baseball players based on various statistics from the previous year

Salary
 Case .

library(ISLR) attach(Hitters) names(Hitters) Copy code
## [1] "AtBat" "Hits" "HmRun" "Runs" "RBI" ## [6] "Walks" "Years" "CAtBat" "CHits" "CHmRun" ## [11] "CRuns" "CRBI" "CWalks" "League" "Division" ## [16] "PutOuts" "Assists" "Errors" "Salary" "NewLeague" Copy code
dim(Hitters) Copy code
## [1] 322 20Copy code
str(Hitters) Copy code
##'data.frame': 322 obs. of 20 variables: ## $ AtBat: int 293 315 479 496 321 594 185 298 323 401 ... ## $ Hits: int 66 81 130 141 87 169 37 73 81 92 ... ## $ HmRun: int 1 7 18 20 10 4 1 0 6 17 ... ## $ Runs: int 30 24 66 65 39 74 23 24 26 49 ... ## $ RBI: int 29 38 72 78 42 51 8 24 32 66 ... ## $ Walks: int 14 39 76 37 30 35 21 7 8 65 ... ## $ Years: int 1 14 3 11 2 11 2 3 2 13 ... ## $ CAtBat: int 293 3449 1624 5628 396 4408 214 509 341 5206 ... ## $ CHits: int 66 835 457 1575 101 1133 42 108 86 1332 ... ## $ CHmRun: int 1 69 63 225 12 19 1 0 6 253 ... ## $ CRuns: int 30 321 224 828 48 501 30 41 32 784 ... ## $ CRBI: int 29 414 266 838 46 336 9 37 34 890 ... ## $ CWalks: int 14 375 263 354 33 194 24 12 8 866 ... ## $ League: Factor w/2 levels "A","N": 1 2 1 2 2 1 2 1 2 1 ... ## $ Division: Factor w/2 levels "E","W": 1 2 2 1 1 2 1 2 2 1 ... ## $ PutOuts: int 446 632 880 200 805 282 76 121 143 0 ... ## $ Assists: int 33 43 82 11 40 421 127 283 290 0 ... ## $ Errors: int 20 10 14 3 4 25 7 9 19 0 ... ## $ Salary: num NA 475 480 500 91.5 750 70 100 75 1100 ... ## $ NewLeague: Factor w/2 levels "A","N": 1 2 1 2 2 1 1 1 2 1 ... Copy code
# Check for missing values sum(is.na(Hitters$Salary))/length(Hitters[,1])*100 Copy code
## [1] 18.32Copy code

It turns out that about 18% of data is lost. We will omit the missing data.

Hitters <- na.omit(Hitters) dim(Hitters) Copy code
## [1] 263 20Copy code

Perform the best subset selection and use RSS for quantification.

library(leaps) regfit <- regsubsets(Salary ~ ., Hitters) summary(regfit) Copy code
## Subset selection object ## Call: regsubsets.formula(Salary ~ ., Hitters) ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 8 ## Selection Algorithm: exhaustive ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 (1) "" "" "" "" "" "" "" "" "" "" "" "" " ## 21 ) " " "*" " " " " " " " " " " " " " " " " " " ## 3 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 4 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 5 (1) "*" "*" "" "" "" "" "" "" "" "" "" "" " ## 6 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "" ## 7 (1) "" "*" "" "" "" ""*"" "" "*" "*" "*" "" ## 8 (1) "*" "*" "" "" "" ""*"" "" "" "" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 (1) "*" "" "" "" "" "" "" "" "" ## 21 ) "*" " " " " " " " " " " " " " " ## 3 (1) "*" "" "" "" "" "*" "" "" "" ## 4 (1) "*" "" "" ""*" "*"" "" "" " ## 5 (1) "*" "" "" ""*" "*"" "" "" " ## 6 (1) "*" "" "" ""*" "*"" "" "" " ## 7 (1) "" "" "" "*" "*" "" "" "" ## 8 (1) "" "*" "" "*" "*" "" "" "" Copy code

The asterisk indicates that the variable is included in the corresponding model.

## [1] 0.3215 0.4252 0.4514 0.4754 0.4908 0.5087 0.5141 0.5286 0.5346 0.5405 ## [11] 0.5426 0.5436 0.5445 0.5452 0.5455 0.5458 0.5460 0.5461 0.5461 Copy code

In this 19-variable model,  R ^2 increases monotonically.

We can use the built-in drawing function to draw RSS, adj  R ^2,  C p , AIC and BIC.

Note: The goodness of fit shown above is an estimate of all test errors (except R^2).

Stepwise selection forward and backward

 

## Subset selection object ## Call: regsubsets.formula(Salary ~ ., data = Hitters, nvmax = 19, method = "forward") ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 19 ## Selection Algorithm: forward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 (1) "" "" "" "" "" "" "" "" "" "" "" "" " ## 21 ) " " "*" " " " " " " " " " " " " " " " " " " ## 3 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 4 (1) "" "*"" "" "" "" "" "" "" "" "" "" "" ## 5 (1) "*" "*" "" "" "" "" "" "" "" "" "" "" " ## 6 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "" ## 7 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "" ## 8 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 9 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 10 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 11 (1) "*" "*" "" "" "" ""*"" ""*"" "" ""*" ## 12 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 13 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 14 (1) "*" "*" "*" "*" "" "*" "" "*" "" "" ""*" ## 15 (1) "*" "*" "*" "*" "" "*" "" "*" "*" "" "*" ## 16 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 17 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 (1) "*" "" "" "" "" "" "" "" "" ## 21 ) "*" " " " " " " " " " " " " " " ## 3 (1) "*" "" "" "" "" "*" "" "" "" ## 4 (1) "*" "" "" ""*" "*"" "" "" " ## 5 (1) "*" "" "" ""*" "*"" "" "" " ## 6 (1) "*" "" "" ""*" "*"" "" "" " ## 7 (1) "*" "*" "" "*" "*" "" "" "" ## 8 (1) "*" "*" "" "*" "*" "" "" "" ## 9 (1) "*" "*" "" "*" "*" "" "" "" ## 10 (1) "*" "*" "" "*" "*" "*" "" "" ## 11 (1) "*" "*" "*" "*" "*" "*" "" "" ## 12 (1) "*" "*" "*" "*" "*" "*" "" "" ## 13 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 14 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 15 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 16 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 17 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" Copy code
## Subset selection object ## 19 Variables (and intercept) ## Forced in Forced out ## AtBat FALSE FALSE ## Hits FALSE FALSE ## HmRun FALSE FALSE ## Runs FALSE FALSE ## RBI FALSE FALSE ## Walks FALSE FALSE ## Years FALSE FALSE ## CAtBat FALSE FALSE ## CHits FALSE FALSE ## CHmRun FALSE FALSE ## CRuns FALSE FALSE ## CRBI FALSE FALSE ## CWalks FALSE FALSE ## LeagueN FALSE FALSE ## DivisionW FALSE FALSE ## PutOuts FALSE FALSE ## Assists FALSE FALSE ## Errors FALSE FALSE ## NewLeagueN FALSE FALSE ## 1 subsets of each size up to 19 ## Selection Algorithm: backward ## AtBat Hits HmRun Runs RBI Walks Years CAtBat CHits CHmRun CRuns ## 1 (1) "" "" "" "" "" "" "" "" "" "" "" "" "*" ## 21 ) " " "*" " " " " " " " " " " " " " " " " "*" ## 3 (1) "" "*"" "" "" "" "" "" "" "" "" "" "*" ## 4 (1) "*" "*" "" "" "" "" "" "" "" "" "" ""*" ## 5 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 6 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 7 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 8 (1) "*" "*" "" "" "" ""*"" "" "" "" "" "*" ## 9 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 10 (1) "*" "*" "" "" "" ""*"" "" "*" "" "" ""*" ## 11 (1) "*" "*" "" "" "" ""*"" ""*"" "" ""*" ## 12 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 13 (1) "*" "*" "" "*" "" "*" "" "*" "" "" ""*" ## 14 (1) "*" "*" "*" "*" "" "*" "" "*" "" "" ""*" ## 15 (1) "*" "*" "*" "*" "" "*" "" "*" "*" "" "*" ## 16 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 17 (1) "*" "*" "*" "*" "*" "*" "" "*" "*" "" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" "*" ## CRBI CWalks LeagueN DivisionW PutOuts Assists Errors NewLeagueN ## 1 (1) "" "" "" "" "" "" "" "" "" ## 21 ) " " " " " " " " " " " " " " " " ## 3 (1) "" "" "" "" ""*"" "" "" " ## 4 (1) "" "" "" "" ""*"" "" "" " ## 5 (1) "" "" "" "" "" *" "" "" "" ## 6 (1) "" "" "" ""*" "*"" "" "" " ## 7 (1) "" "*" "" "*" "*" "" "" "" ## 8 (1) "*" "*" "" "*" "*" "" "" "" ## 9 (1) "*" "*" "" "*" "*" "" "" "" ## 10 (1) "*" "*" "" "*" "*" "*" "" "" ## 11 (1) "*" "*" "*" "*" "*" "*" "" "" ## 12 (1) "*" "*" "*" "*" "*" "*" "" "" ## 13 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 14 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 15 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 16 (1) "*" "*" "*" "*" "*" "*" "*" "" ## 17 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 18 (1) "*" "*" "*" "*" "*" "*" "*" "*" ## 19 (1) "*" "*" "*" "*" "*" "*" "*" "*" Copy code

We can see here that the 1-6 variable model is the same for the best subset  and  selection .

Ridge regression and lasso

Start the cross-validation method

We will also apply the cross-validation method in the regularization method.

Validation set

R ^ 2  C p and BIC estimate the test error rate, we can use the cross-validation method. We must use only training observations to perform all aspects of model fitting and variable selection. Then  calculate the test error by applying the trained model to the test or validation data.

## Ridge Regression ## ## 133 samples ## 19 predictors ## ## Pre-processing: scaled, centered ## Resampling: Bootstrapped (25 reps) ## ## Summary of sample sizes: 133, 133, 133, 133, 133, 133, ... ## ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared RMSE SD Rsquared SD ## 0 400 0.4 40 0.09 ## 1e-04 400 0.4 40 0.09 ## 0.1 300 0.5 40 0.09 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is lambda = 0.1. Copy code
mean(ridge.pred-test$Salary)^2 Copy code
## [1] 30.1Copy code

k cross validation

Use  k -cross validation to select the best lambda.

For cross-validation, we divide the data into test and training data.

## Ridge Regression ## ## 133 samples ## 19 predictors ## ## Pre-processing: centered, scaled ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 120, 120, 119, 120, 120, 119, ... ## ## Resampling results across tuning parameters: ## ## lambda RMSE Rsquared RMSE SD Rsquared SD ## 0 300 0.6 70 0.1 ## 1e-04 300 0.6 70 0.1 ## 0.1 300 0.6 70 0.1 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is lambda = 1e-04. Copy code
# Calculate the correlation coefficient predict(ridge$finalModel, type='coef', mode='norm')$coefficients[19,] Copy code
## AtBat Hits HmRun Runs RBI Walks ## -157.221 313.860 -18.996 0.000 -70.392 171.242 ## Years CAtBat CHits CHmRun CRuns CRBI ## -27.543 0.000 0.000 51.811 202.537 187.933 ## CWalks LeagueN DivisionW PutOuts Assists Errors ## -224.951 12.839 -38.595 -9.128 13.288 -18.620 ## NewLeagueN ## 22.326 Copy code
sqrt(mean(ridge.pred-test$Salary)^2) Copy code
## [1] 17.53Copy code

Therefore, the average salary error is about 33,000. The regression coefficient does not seem to be really going towards zero, but this is because we first standardized the data.

Now we should check if this is better than the regular

lm()
 The model is better.

## Linear Regression ## ## 133 samples ## 19 predictors ## ## Pre-processing: scaled, centered ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 120, 120, 121, 119, 119, 119, ... ## ## Resampling results ## ## RMSE Rsquared RMSE SD Rsquared SD ## 300 0.5 70 0.2 ## ## Copy code
coef(lmfit$finalModel) Copy code
## (Intercept) AtBat Hits HmRun Runs RBI ## 535.958 -327.835 591.667 73.964 -169.699 -162.024 ## Walks Years CAtBat CHits CHmRun CRuns ## 234.093 -60.557 125.017 -529.709 -45.888 680.654 ## CRBI CWalks LeagueN DivisionW PutOuts Assists ## 393.276 -399.506 19.118 -46.679 -4.898 41.271 ## Errors NewLeagueN ## -22.672 22.469 Copy code
sqrt(mean(lmfit.pred-test$Salary)^2) Copy code
## [1] 17.62Copy code

As we have seen, this ridge regression fit certainly has a lower RMSE and a higher  R ^2.

lasso

## The lasso ## ## 133 samples ## 19 predictors ## ## Pre-processing: scaled, centered ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 120, 121, 120, 120, 120, 119, ... ## ## Resampling results across tuning parameters: ## ## fraction RMSE Rsquared RMSE SD Rsquared SD ## 0.1 300 0.6 70 0.2 ## 0.5 300 0.6 60 0.2 ## 0.9 300 0.6 70 0.2 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is = 0.5. Copy code
## $s ## [1] 0.5 ## ## $fraction ## 0 ## 0.5 ## ## $mode ## [1] "fraction" ## ## $coefficients ## AtBat Hits HmRun Runs RBI Walks ## -227.113 406.285 0.000 -48.612 -93.740 197.472 ## Years CAtBat CHits CHmRun CRuns CRBI ## -47.952 0.000 0.000 82.291 274.745 166.617 ## CWalks LeagueN DivisionW PutOuts Assists Errors ## -287.549 18.059 -41.697 -7.001 30.768 -26.407 ## NewLeagueN ## 19.190 Copy code
sqrt(mean(lasso.pred-test$Salary)^2) Copy code
## [1] 14.35Copy code

In lasso, we see that many coefficients have been forced to zero. Even if the RMSE is a little higher than the ridge regression, it has advantages over the linear regression model.

PCR and PLS

Principal component regression

 

## Data: X dimension: 133 19 ## Y dimension: 133 1 ## Fit method: svdpc ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 451.5 336.9 323.9 328.5 328.4 329.9 337.1 ## adjCV 451.5 336.3 323.6 327.8 327.5 328.8 335.7 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 335.2 333.7 338.5 334.3 337.8 340.4 346.7 ## adjCV 332.5 331.7 336.4 332.0 335.5 337.6 343.4 ## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 345.1 345.7 329.4 337.3 343.5 338.7 ## adjCV 341.2 341.6 325.7 332.7 338.4 333.9 ## ## TRAINING:% variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## X 36.55 60.81 71.75 80.59 85.72 89.76 92.74 ## Salary 45.62 50.01 51.19 51.98 53.23 53.36 55.63 ## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps ## X 95.37 96.49 97.45 98.09 98.73 99.21 99.52 ## Salary 56.48 56.73 58.57 58.92 59.34 59.44 62.01 ## 15 comps 16 comps 17 comps 18 comps 19 comps ## X 99.77 99.90 99.97 99.99 100.00 ## Salary 62.65 65.29 66.48 66.77 67.37 Copy code

The algorithm reports the CV as RMSE and the training data as R^2. By plotting the MSE, we can see that we have achieved the lowest MSE. This shows a great improvement compared to the least squares method, because we can use only 3 components instead of 19 to explain most of the variance.

Execute on the test data set.

sqrt(mean((pcr.pred-test$Salary)^2)) Copy code
## [1] 374.8Copy code

The RMSE is lower than the lasso/linear regression.

 

## Principal Component Analysis ## ## 133 samples ## 19 predictors ## ## Pre-processing: centered, scaled ## Resampling: Cross-Validated (10 fold) ## ## Summary of sample sizes: 121, 120, 118, 119, 120, 120, ... ## ## Resampling results across tuning parameters: ## ## ncomp RMSE Rsquared RMSE SD Rsquared SD ## 1 300 0.5 100 0.2 ## 2 300 0.5 100 0.2 ## 3 300 0.6 100 0.2 ## ## RMSE is used to select the best model using the minimum value. ##The final value used for the model is ncomp = 3. Copy code

Choose the best model with 2 components

sqrt(mean(pcr.pred-test$Salary)^2) Copy code
## [1] 21.86Copy code

However, the PCR results are not easy to interpret.

Partial Least Squares

## Data: X dimension: 133 19 ## Y dimension: 133 1 ## Fit method: kernelpls ## Number of components considered: 19 ## ## VALIDATION: RMSEP ## Cross-validated using 10 random segments. ## (Intercept) 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps ## CV 451.5 328.9 328.4 332.6 329.2 325.4 323.4 ## adjCV 451.5 328.2 327.4 330.6 326.9 323.0 320.9 ## 7 comps 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps ## CV 318.7 318.7 316.3 317.6 316.5 317.0 319.2 ## adjCV 316.2 315.5 313.5 314.9 313.6 313.9 315.9 ## 14 comps 15 comps 16 comps 17 comps 18 comps 19 comps ## CV 323.0 323.8 325.4 324.5 323.6 321.4 ## adjCV 319.3 320.1 321.4 320.5 319.9 317.8 ## ## TRAINING:% variance explained ## 1 comps 2 comps 3 comps 4 comps 5 comps 6 comps 7 comps ## X 35.94 55.11 67.37 74.29 79.65 85.17 89.17 ## Salary 51.56 54.90 57.72 59.78 61.50 62.94 63.96 ## 8 comps 9 comps 10 comps 11 comps 12 comps 13 comps 14 comps ## X 90.55 93.49 95.82 97.05 97.67 98.45 98.67 ## Salary 65.34 65.75 66.03 66.44 66.69 66.77 66.94 ## 15 comps 16 comps 17 comps 18 comps 19 comps ## X 99.02 99.26 99.42 99.98 100.00 ## Salary 67.02 67.11 67.24 67.26 67.37 Copy code

The best  M   is 2. Evaluate the corresponding test error.

sqrt(mean(pls.pred-test$Salary)^2) Copy code
## [1] 14.34Copy code

Compared with PCR, here we can see an improvement in RMSE.


Most popular insights

1. Matlab Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR)

2. Principal component pca, t-SNE algorithm dimensionality reduction and visual analysis of high-dimensional data in R language

3. Principal component analysis (PCA) basic principles and analysis examples

4. Realize LASSO regression analysis based on R language

5. Use LASSO regression to predict stock return data analysis

6. The lasso regression, ridge ridge regression and elastic-net model in r language

7. Data analysis of partial least squares regression pls-da in r language

8. Partial least squares pls regression algorithm in r language

9. R language linear discriminant analysis (LDA), quadratic discriminant analysis (QDA) and regular discriminant analysis (RDA)