# Original link: tecdat.cn/?p=3795

# Original source: Tuoduan Data Tribe Official Account

Glmnet is a software package for fitting generalized linear models by penalizing the maximum likelihood relationship. The regularization path is calculated for the lasso or Elastic Net (elastic network) penalty value at the value grid of the regularization parameter . The algorithm is very fast and can take advantage of the sparsity in the input matrix

On a grid of lambda values covering the entire range. Here l(y, ) is the negative log-likelihood contribution of observation i; for example, it is for the Gaussian distribution. *The elastic network* penalty is controlled by , LASSO ( =1, default), Ridge ( =0). Adjust the parameter to control the total intensity of the penalty.

As we all know, the ridge penalty reduces the coefficients of related predictors, and Lasso tends to choose one of them and discard the other predictors. *The elastic network* mixes the two together.

The code can handle the sparse input matrix format and the range constraints of the coefficients. It also includes methods for prediction and drawing, as well as the function of performing K-fold cross-validation.

## Quick start

1. we load

library(glmnet) Copy code

The default model used in the package is a Gaussian linear model or "least squares". We load a set of pre-created data for illustration. Users can load their own data or use the data saved in the workspace.

This command loads the input matrix from the saved R data

We fit the model

fit = glmnet (x, y) copy the code

Can be executed by

plot(fit) Copy code

Each curve corresponds to a variable. It shows the path of the 1 norm of its coefficient relative to the entire coefficient vector when changes. The upper axis represents the number of non-zero coefficients at the current , which is the effective degree of freedom ( *df* ) of the lasso . The user may also wish to annotate the curve. This can be done by

print(fit) Copy code

## ## Call: glmnet(x = x, y = y) ## ## Df %Dev Lambda ## [1,] 0 0.0000 1.63000 ## [2,] 2 0.0553 1.49000 ## [3,] 2 0.1460 1.35000 ## [4,] 2 0.2210 1.23000 ## [5,] 2 0.2840 1.12000 ## [6,] 2 0.3350 1.02000 ## [7,] 4 0.3900 0.93300 ## [8,] 5 0.4560 0.85000 ## [9,] 5 0.5150 0.77500 ## [10,] 6 0.5740 0.70600 ## [11,] 6 0.6260 0.64300 ## [12,] 6 0.6690 0.58600 ## [13,] 6 0.7050 0.53400 ## [14,] 6 0.7340 0.48700 ## [15,] 7 0.7620 0.44300 ## [16,] 7 0.7860 0.40400 ## [17,] 7 0.8050 0.36800 ## [18,] 7 0.8220 0.33500 ## [19,] 7 0.8350 0.30600 ## [20,] 7 0.8460 0.27800 Copy code

It shows the number of non-zero coefficients from left to right (

We can obtain one or more actual coefficients at within the range of the sequence:

coef(fit,s=0.1) Copy code

## 21 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.150928 ## V1 1.320597 ## V2. ## V3 0.675110 ## V4. ## V5 -0.817412 ## V6 0.521437 ## V7 0.004829 ## V8 0.319416 ## V9. ## V10. ## V11 0.142499 ## V12. ## V13. ## V14 -1.059979 ## V15. ## V16. ## V17. ## V18. ## V19. ## V20 -1.021874 Copy code

You can also use new input data to predict at a specific :

predict (fit, newx = nx, s = c (0.1,0.05)) copying the code

## 1 2 ## [1,] 4.4641 4.7001 ## [2,] 1.7509 1.8513 ## [3,] 4.5207 4.6512 ## [4,] -0.6184 -0.6764 ## [5,] 1.7302 1.8451 ## [6,] 0.3565 0.3512 ## [7,] 0.2881 0.2662 ## [8,] 2.7776 2.8209 ## [9,] -3.7016 -3.7773 ## [10,] 1.1546 1.1067 Copy code

The function

We can draw objects.

It includes a cross-validation curve (red dashed line) and an upper and lower standard deviation curve (error bars) along the lambda sequence. The vertical dashed lines represent two selected lambdas.

We can view the selected and the corresponding coefficient. E.g,

cvfit$lambda.minCopy code

## [1] 0.08307Copy code

coef(cvfit, s = "lambda.min") Copy code

## 21 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.14936 ## V1 1.32975 ## V2. ## V3 0.69096 ## V4. ## V5 -0.83123 ## V6 0.53670 ## V7 0.02005 ## V8 0.33194 ## V9. ## V10. ## V11 0.16239 ## V12. ## V13. ## V14 -1.07081 ## V15. ## V16. ## V17. ## V18. ## V19. ## V20 -1.04341 Copy code

Note that the coefficients are expressed in sparse matrix format. The reason is that the solution along the regularization path is usually sparse, so using the sparse format is more efficient in time and space.

Can be based on the fitted

## 1 ## [1,] -1.3647 ## [2,] 2.5686 ## [3,] 0.5706 ## [4,] 1.9682 ## [5,] 1.4964 Copy code

## Linear regression

Linear regression here refers to two model series. one is

*distribution*, and the other is

*distribution*.

### Normal *distribution*

Suppose we have observations xi Rp and yi R, i = 1,...,N. The objective function is

Among them, 0 is the complexity parameter, and 0 1 is between ridge regression ( =0) and lasso LASSO ( =1).

The coordinate descent method is used to solve this problem. Specifically, by calculating the gradient at j= ~j and simple calculations, it is updated to

among them.

when

- alphaRepresents the elastic net mixing parameter , the range [0,1]. =1 is the lasso (default), =0 is Ridge.
- weightsUsed to observe the weight. The default value of each observation is 1.
- nlambdaIs the number of lambda values in the sequence. The default value is 100.
- lambdaCan be provided, but usually not provided, the program will build a sequence. When automatically generated, the sequence is determined bylambda.maxAnd oklambda.min.ratio.
- standardizeYesxLogical signs of variable standardization before fitting the model sequence.

For example, we set =0.2 and assign twice the weight to the observations in the second half. To avoid displaying here for too long, we set it

Then we can output

print(fit) Copy code

## ## Call: glmnet(x = x, y = y, weights = c(rep(1, 50), rep(2, 50)), alpha = 0.2, nlambda = 20) ## ## Df %Dev Lambda ## [1,] 0 0.000 7.94000 ## [2,] 4 0.179 4.89000 ## [3,] 7 0.444 3.01000 ## [4,] 7 0.657 1.85000 ## [5,] 8 0.785 1.14000 ## [6,] 9 0.854 0.70300 ## [7,] 10 0.887 0.43300 ## [8,] 11 0.902 0.26700 ## [9,] 14 0.910 0.16400 ## [10,] 17 0.914 0.10100 ## [11,] 17 0.915 0.06230 ## [12,] 17 0.916 0.03840 ## [13,] 19 0.916 0.02360 ## [14,] 20 0.916 0.01460 ## [15,] 20 0.916 0.00896 ## [16,] 20 0.916 0.00552 ## [17,] 20 0.916 0.00340 Copy code

This will show the call to generate the object

We can draw the fitted object.

Let's mark each curve against the log-lambda value to draw a "fit".

This is the percentage of deviation in the training data. What we see here is that at the end of the path, the value does not change much, but the coefficient is a bit "expanded". This allows us to focus on the important fitting parts.

We can extract coefficients and make predictions for certain specific values. Two commonly used options are:

- sSpecify the value for extraction.
- exactIndicates whether the exact value of the coefficient is required.

A simple example is:

## 21 x 2 sparse Matrix of class "dgCMatrix" ## 1 1 ## (Intercept) 0.19657 0.199099 ## V1 1.17496 1.174650 ## V2.. ## V3 0.52934 0.531935 ## V4.. ## V5 -0.76126 -0.760959 ## V6 0.46627 0.468209 ## V7 0.06148 0.061927 ## V8 0.38049 0.380301 ## V9.. ## V10.. ## V11 0.14214 0.143261 ## V12.. ## V13.. ## V14 -0.91090 -0.911207 ## V15.. ## V16.. ## V17.. ## V18. 0.009197 ## V19.. ## V20 -0.86099 -0.863117 Copy code

The left column is,

Users can make predictions based on the fitted objects. In addition to the options in

- The dependent variable is the same as the "link" of the normal distribution.
- "Coefficient" is calculated as the coefficient s

E.g,

## 1 ## [1,] -0.9803 ## [2,] 2.2992 ## [3,] 0.6011 ## [4,] 2.3573 ## [5,] 1.7520 Copy code

The fitted value of the first 5 observations is given when =0.05. If multiple values are provided,

Users can customize K-fold cross-validation. Except all

- "Mae" uses mean absolute error

for example,

cvfit = cv.glmnet (x, y, type.measure = "mse", nfolds = 20) copying the code

Perform 20-fold cross-validation according to the mean square error standard.

Parallel computing is also affected by

system.time(cv.glmnet(X, Y)) Copy code

## user system elapsed ## 3.591 0.103 3.724 Copy code

system.time (cv.glmnet (X, Y, parallel = TRUE)) copying the code

## user system elapsed ## 4.318 0.391 2.700 Copy code

As can be seen from the above suggestions, parallel computing can greatly speed up the calculation process.

- "Lambda.min": that reaches the minimum MSE.

cvfit$lambda.minCopy code

## [1] 0.08307Copy code

## 21 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.14936 ## V1 1.32975 ## V2. ## V3 0.69096 ## V4. ## V5 -0.83123 ## V6 0.53670 ## V7 0.02005 ## V8 0.33194 ## V9. ## V10. ## V11 0.16239 ## V12. ## V13. ## V14 -1.07081 ## V15. ## V16. ## V17. ## V18. ## V19. ## V20 -1.04341 Copy code

Here, we use the same k-fold and choose a value for .

Place them all on the same drawing:

We see lasso(

# Coefficient upper and lower limit

Suppose we want to fit our model, but limit the coefficient to be greater than -0.7 and less than 0.5. This can be done by

Usually, we want the coefficient to be positive, so we can only

# Punishment factor

This parameter allows the user to apply a separate penalty factor to each coefficient. The default value of each parameter is 1, but other values can be specified. In particular, any

In many cases, certain variables may be important, and we want to keep them all the time. This can be achieved by setting the corresponding penalty factor to 0:

We see from the label that the three variables with a penalty factor of 0 are always kept in the model, while other variables follow the typical regularization path and eventually shrink to 0.

# Custom map

Sometimes, especially when the number of variables is small, we want to add variable labels to the graph.

We first generate some data with 10 variables, then we fit the glmnet model and draw a standard graph.

We want to label the curve with variable names. Place the coefficient at the end of the path.

### Multivariate normal

use

Obviously, as the name suggests, y is not a vector, but a matrix. As a result, the coefficient of each value is also a matrix.

Here, we solve the following problems:

Here, j is the jth row of the p K coefficient matrix . For a single predictor variable xj, we replace the absolute penalty of each single coefficient with the lasso penalty of each coefficient K vector j.

We use a set of pre-generated data for illustration.

We fit the data and return the object "mfit".

mfit = glmnet (x, y, family = "mgaussian") copy the code

If it is

To visualize the coefficients, we use

Note that we set

By using the function

## , , 1 ## ## y1 y2 y3 y4 ## [1,] -4.7106 -1.1635 0.6028 3.741 ## [2,] 4.1302 -3.0508 -1.2123 4.970 ## [3,] 3.1595 -0.5760 0.2608 2.054 ## [4,] 0.6459 2.1206 -0.2252 3.146 ## [5,] -1.1792 0.1056 -7.3353 3.248 ## ## , , 2 ## ## y1 y2 y3 y4 ## [1,] -4.6415 -1.2290 0.6118 3.780 ## [2,] 4.4713 -3.2530 -1.2573 5.266 ## [3,] 3.4735 -0.6929 0.4684 2.056 ## [4,] 0.7353 2.2965 -0.2190 2.989 ## [5,] -1.2760 0.2893 -7.8259 3.205 Copy code

The prediction results are stored in a three-dimensional array, where the first two dimensions are the prediction matrix of each dependent variable, and the third dimension represents the dependent variable.

We can also perform k-fold cross-validation.

We draw the result

Display the selected optimal value of

cvmfit$lambda.minCopy code

## [1] 0.04732Copy code

cvmfit$lambda.1seCopy code

## [1] 0.1317Copy code

## Logistic regression

When the dependent variable is categorical, logistic regression is another widely used model. If there are two possible outcomes, use a binomial distribution, otherwise use a polynomial.

### Binomial model

For the binomial model, suppose the value of the dependent variable is G = {1,2}. It means yi = I (gi = 1). We model

Can be written in the following form

The objective function of penalized logistic regression uses negative binomial log likelihood

Our algorithm uses a quadratic approximation of the log-likelihood, and then reduces the resulting penalty-weighted least squares problem. These constitute the internal and external circulation.

For illustration purposes, we load the pre-generated input matrix from the data file

For binomial logistic regression, the dependent variable y can be a two-level factor, or a two-column matrix of counts or proportions.

fit = glmnet (x, y, family = "binomial") copy the code

As before, we can output and plot the fitted objects, extract the coefficients at a specific , and make predictions.

Logistic regression is slightly different, mainly reflected in the choice

- "Dependent variable" gives the appropriate probability
- "Category" produces the category label corresponding to the greatest probability.
- "Coefficient" is calculated as the coefficient s

In the following example, we have predicted the category label when =0.05,0.01.

## 1 2 ## [1,] "0" "0" ## [2,] "1" "1" ## [3,] "1" "1" ## [4,] "0" "0" ## [5,] "1" "1" Copy code

For logistic regression,

- "Deviation" uses the actual deviation.
- "Mae" uses the mean absolute error.
- "Class" gives the wrong classification error.
- "Auc" (only for two types of logistic regression) gives the area under the ROC curve.

E.g,

It uses classification error as the criterion for 10-fold cross-validation.

We draw the object and display the best value of .

cvfit$lambda.minCopy code

## [1] 0.01476Copy code

cvfit$lambda.1seCopy code

## [1] 0.02579Copy code

## 31 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.24371 ## V1 0.06897 ## V2 0.66252 ## V3 -0.54275 ## V4 -1.13693 ## V5 -0.19143 ## V6 -0.95852 ## V7. ## V8 -0.56529 ## V9 0.77454 ## V10 -1.45079 ## V11 -0.04363 ## V12 -0.06894 ## V13. ## V14. ## V15. ## V16 0.36685 ## V17. ## V18 -0.04014 ## V19. ## V20. ## V21. ## V22 0.20882 ## V23 0.34014 ## V24. ## V25 0.66310 ## V26 -0.33696 ## V27 -0.10570 ## V28 0.24318 ## V29 -0.22445 ## V30 0.11091 Copy code

As mentioned earlier, the results returned here are only for the second category of factor dependent variables.

## 1 ## [1,] "0" ## [21" ## [3,] "1" ## [4,] "0" ## [5,] "1" ## [6,] "0" ## [7,] "0" ## [8,] "0" ## [9,] "1" ## [10,] "1" Copy code

# Polynomial model

For the polynomial model, assume that the K level of the dependent variable is G = {1,2,...,K}. Here we model

Let Y be the N K index dependent variable matrix, and the element yi = I (gi = ). Then the negative log likelihood function of the elastic net penalty becomes

is a p K matrix of coefficients. k refers to the kth column (for the result category k), and j refers to the jth row (a vector of K coefficients of variable j). The last penalty term is || j|| q, we have two choices for q: q {1,2}. When q = 1, this is the lasso penalty for each parameter. When q = 2, this is a group lasso penalty for all K coefficients of a specific variable, which makes them all zero or non-zero together.

For the polynomial case, the usage is similar to logistic regression, we load a set of generated data.

A special option for polynomial regression is

We draw the result.

We can also perform cross-validation and draw the returned objects.

Predict the best choice :

## 1 ## [1,] "3" ## [22" ## [3,] "2" ## [4,] "1" ## [5,] "1" ## [6,] "3" ## [7,] "3" ## [8,] "1" ## [9,] "1" ## [10,] "2" Copy code

## Poisson model

Poisson regression is used to model count data under the assumption of Poisson error, or to model with non-negative data under the condition that the mean and variance are proportional. Poisson is also a member of the exponential distribution family. We usually model by logarithm:.

Log-likelihood of a given observation

As before, we optimized the penalty logarithm:

Glmnet uses an external Newton cycle and an internal weighted least squares cycle (such as logistic regression) to optimize this criterion.

1. we load a set of Poisson data.

Again, plot the coefficients.

As before, we can use separately

For example, we can

## 21 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## (Intercept) 0.61123 ## V1 0.45820 ## V2 -0.77061 ## V3 1.34015 ## V4 0.04350 ## V5 -0.20326 ## V6. ## V7. ## V8. ## V9. ## V10. ## V11. ## V12 0.01816 ## V13. ## V14. ## V15. ## V16. ## V17. ## V18. ## V19. ## V20. Copy code

## 1 2 ## [1,] 2.4944 4.4263 ## [2,] 10.3513 11.0586 ## [3,] 0.1180 0.1782 ## [4,] 0.9713 1.6829 ## [5,] 1.1133 1.9935 Copy code

We can also use cross-validation to find the best for inference.

The options are almost the same as the normal family, except that

We can draw

We can also display the best and the corresponding coefficients.

## 21 x 2 sparse Matrix of class "dgCMatrix" ## 1 2 ## (Intercept) 0.031263 0.18570 ## V1 0.619053 0.57537 ## V2 -0.984550 -0.93212 ## V3 1.525234 1.47057 ## V4 0.231591 0.19692 ## V5 -0.336659 -0.30469 ## V6 0.001026. ## V7 -0.012830. ## V8.. ## V9.. ## V10 0.015983. ## V11.. ## V12 0.030867 0.02585 ## V13 -0.027971. ## V14 0.032750. ## V15 -0.005933. ## V16 0.017506. ## V17.. ## V18 0.004026. ## V19 -0.033579. ## V20 0.012049 0.00993 Copy code

## Cox model

The Cox proportional hazards model is usually used to study the relationship between predictor variables and survival time.

The Cox proportional hazard regression model does not directly examine the relationship with X, but uses it as a dependent variable. The basic form of the model is:

Where is the partial regression coefficient of the independent variable, which is a parameter that must be estimated from the sample data; is the baseline hazard rate when the X vector is 0, which is the amount to be estimated from the sample data. Referred to as the **Cox back to the ** **regression model** .

Since the Cox regression model does not make any assumptions, the Cox regression model has greater flexibility in dealing with problems; on the other hand, in many cases, we only need to estimate the parameters (such as factor analysis, etc.), even when unknown In the case of, the parameters can still be estimated. That is to say, the Cox regression model is not a complete parametric model because it contains, but it can still estimate the parameters according to formula (1), so the Cox regression model is a **semi-parametric model** .

The formula can be transformed into:

We use a set of pre-generated sample data. Users can load their own data and follow a similar process. In this case, x must be an n p matrix of covariate values-each row corresponds to one patient, and each column corresponds to one covariate. y is an n 2 matrix.

## time status ## [1,] 1.76878 1 ## [2,] 0.54528 1 ## [3,] 0.04486 0 ## [4,] 0.85032 0 ## [5,] 0.61488 1 Copy code

We calculate the solution path under the default settings.

Plot the coefficient.

Extract the coefficient at a specific value .

## 30 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## V1 0.37694 ## V2 -0.09548 ## V3 -0.13596 ## V4 0.09814 ## V5 -0.11438 ## V6 -0.38899 ## V7 0.24291 ## V8 0.03648 ## V9 0.34740 ## V10 0.03865 ## V11. ## V12. ## V13. ## V14. ## V15. ## V16. ## V17. ## V18. ## V19. ## V20. ## V21. ## V22. ## V23. ## V24. ## V25. ## V26. ## V27. ## V28. ## V29. ## V30. Copy code

function

After fitting, we can view the best lambda value and the cross-validated error graph to help evaluate our model.

As mentioned earlier, the left vertical line in the figure shows us where the CV error curve reaches its minimum. The vertical line on the right shows us the regularized model whose CV error is within 1 standard deviation of the minimum value. We also extracted the optimal .

cvfit$lambda.minCopy code

## [1] 0.01594Copy code

cvfit$lambda.1seCopy code

## [1] 0.04869Copy code

We can examine the covariates in the model and view their coefficients.

index.minCopy code

## [1] 0.491297 -0.174601 -0.218649 0.175112 -0.186673 -0.490250 0.335197 ## [8] 0.091587 0.450169 0.115922 0.017595 -0.018365 -0.002806 -0.001423 ## [15] -0.023429 0.001688 -0.008236 Copy code

coef.minCopy code

## 30 x 1 sparse Matrix of class "dgCMatrix" ## 1 ## V1 0.491297 ## V2 -0.174601 ## V3 -0.218649 ## V4 0.175112 ## V5 -0.186673 ## V6 -0.490250 ## V7 0.335197 ## V8 0.091587 ## V9 0.450169 ## V10 0.115922 ## V11. ## V12. ## V13 0.017595 ## V14. ## V15. ## V16. ## V17 -0.018365 ## V18. ## V19. ## V20. ## V21 -0.002806 ## V22 -0.001423 ## V23. ## V24. ## V25 -0.023429 ## V26. ## V27 0.001688 ## V28. ## V29. ## V30 -0.008236 Copy code

## sparse matrix

Our package supports sparse input matrices, which can store and manipulate large matrices efficiently, but only have a few non-zero entries.

We load a set of pre-created sample data.

Load 100*20 sparse matrix and

## [1] "dgCMatrix" ## attr(,"package") ## [1] "Matrix" Copy code

We can fit the model as before.

fit = glmnet (x, y) copy the code

Perform cross-validation and draw the result object.

Predict the new input matrix. E.g,

## 1 ## [1,] 0.3826 ## [2,] -0.2172 ## [3,] -1.6622 ## [4,] -0.4175 ## [5,] -1.3941 Copy code

## references

Jerome Friedman, Trevor Hastie and Rob Tibshirani. (2008).

Regularization Paths for Generalized Linear Models via Coordinate Descent

references

1. **Matlab Partial Least Squares Regression (PLSR) and Principal Component Regression (PCR)**

3. **Principal component analysis (PCA) basic principles and analysis examples**

4. **Realize LASSO regression analysis based on R language**

5. **Use LASSO regression to predict stock return data analysis**

6. The **lasso regression, ridge ridge regression and elastic-net model in r language**

7. **Data analysis of partial least squares regression pls-da in r language**

8. **Partial least squares pls regression algorithm in r language**