Column | Feature Engineering Manual Based on Jupyter: Data Preprocessing (1)

Column | Feature Engineering Manual Based on Jupyter: Data Preprocessing (1)

Author: Yingxiang Chen & Zihan Yang Editor: Red Stone

The importance of feature engineering in machine learning is self-evident, and proper feature engineering can significantly improve the performance of machine learning models. We have compiled a systematic feature engineering tutorial on Github for your reference and study.

project address:

github.com/YC-Coder-Ch...

This article will discuss the data preprocessing part: how to use scikit-learn to process static continuous variables, use Category Encoders to process static category variables, and use Featuretools to process common time series variables.

table of Contents

The data preprocessing of feature engineering will be divided into three parts to introduce:\

  • Static continuous variable
  • Static categorical variables
  • Time series variables

This article will introduce 1.1 data preprocessing of static continuous variables. The following will be combined with Jupyter, using sklearn, for detailed explanation.

1.1 Static continuous variables

1.1.1 Discretization/

Discretizing continuous variables can make the model more robust. For example, when predicting the purchase behavior of a customer, a customer who has made 30 purchases may have very similar behaviors as a customer who has made 32 purchases. Sometimes the over-precision in the feature may be noise, which is why in LightGBM, the model uses histogram algorithm to prevent over-fitting. There are two methods for discrete continuous variables.

1.1.1.1 Binarization

Binarize numerical features./

# load the sample data from sklearn.datasets import fetch_california_housing dataset = fetch_california_housing() X-, Y = dataset.data, dataset.target # WE Will Take The First Example The AS column later duplicated code
%matplotlib inline import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots() sns.distplot(X[:, 0 ], hist = True , kde = True ) ax.set_title( 'Histogram' , fontsize = 12 ) ax.set_xlabel( 'Value' , fontsize = 12 ) ax.set_ylabel ( 'Frequency' , fontSize = 12 is ); # Long-tail has the this Feature Distribution duplicated code

from sklearn.preprocessing import Binarizer sample_columns = X[0:10,0] # select the top 10 samples # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) model = Binarizer(threshold=6) # set 6 to be the threshold # if value <= 6, then return 0 else return 1 result = model.fit_transform(sample_columns.reshape(-1,1)).reshape(-1) # Return array ([1., 1., 1., 0., 0., 0., 0., 0., 0., 0.]) Copy the code

1.1.1.2 Binning

Bind the numerical features.

Evenly box:\

from sklearn.preprocessing import KBinsDiscretizer # in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='uniform') # set 5 bins # return oridinal bin number, set all bins to have identical widths model.fit(train_set.reshape(-1,1)) result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([2., 2., 2., 1., 1., 1., 1., 0., 0., 1.]) bin_edge = model.bin_edges_[0] # Return array ([0.4999, 3.39994 , 6.29998, 9.20002, 12.10006, 15.0001]), the bin edges duplicated code
# visualiza the bin edges fig, ax = plt.subplots() sns.distplot(train_set, hist = True , kde = True ) for edge in bin_edge: # uniform bins line = plt.axvline(edge, color = 'b' ) ax.legend([line], [ 'Uniform Bin Edges' ], fontsize= 10 ) ax.set_title( 'Histogram' , fontsize = 12 ) ax.set_xlabel( 'Value' , fontsize = 12 ) ax.set_ylabel( 'Frequency' , fontsize = 12 ); Copy code

Quantile binning:\

from sklearn.preprocessing import KBinsDiscretizer # in order to mimic the operation in real-world, we shall fit the KBinsDiscretizer # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = KBinsDiscretizer(n_bins=5, encode='ordinal', strategy='quantile') # set 3 bins # return oridinal bin number, set all bins based on quantile model.fit(train_set.reshape(-1,1)) result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([4., 4., 4., 4., 2., 3., 2., 1., 0., 2.]) bin_edge = model.bin_edges_[0] Return Array # ([0.4999, 2.3523, 3.1406, 3.9667, 5.10824, 15.0001]), The bin Edges # The IS 20 is 2.3523% Quantile # 3.1406 40% Quantile The IS, etc .. copy the code
# visualiza the bin edges fig, ax = plt.subplots() sns.distplot(train_set, hist = True , kde = True ) for edge in bin_edge: # quantile based bins line = plt.axvline(edge, color = 'b' ) ax.legend([line], [ 'Quantiles Bin Edges' ], fontsize= 10 ) ax.set_title( 'Histogram' , fontsize = 12 ) ax.set_xlabel( 'Value' , fontsize = 12 ) ax.set_ylabel( 'Frequency' , fontsize = 12 ); Copy code

1.1.2 Zoom/

It is difficult to compare features of different scales, especially in linear models such as linear regression and logistic regression. In the k-means clustering or KNN model based on Euclidean distance, feature scaling is required, otherwise the distance measurement is useless. And for any algorithm that uses gradient descent, scaling will also speed up the convergence speed.

Some commonly used models:\

Note: Skewness affects the PCA model, so it is better to use power transformation to eliminate skewness.

1.1.2.1 Standard scaling (Z-score standardization)

formula:\

Among them, X is a variable (feature), ???? is the mean of X, and ???? is the standard deviation of X. This method is very sensitive to outliers, because outliers affect both ???? and ????.

from sklearn.preprocessing import StandardScaler # in order to mimic the operation in real-world, we shall fit the StandardScaler # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = StandardScaler() model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([ 2.34539745, 2.33286782, 1.78324852, 0.93339178, -0.0125957, # 0.08774668, -0.11109548, -0.39490751, -0.94221041, -0.09419626]) # result is the same as ((X[0:10,0]-X [10:, 0] .mean ( ))/X [10:, 0] .std ()) copying the code
# visualize the distribution after the scaling # fit and transform the entire first feature import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 )) sns.distplot(X[:, 0 ], hist = True , kde= True , ax=ax[ 0 ]) ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 ) ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution model = StandardScaler() model.fit(X[:, 0 ].reshape(- 1 , 1 )) result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 ) # show the distribution of the entire feature sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ]) ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 ) ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution is the same, but scales change fig.tight_layout() Copy code

1.1.2.2 MinMaxScaler (scale according to the numerical range)/

Assume that the range of feature values we want to scale is (a, b).

formula:\

Among them, Min is the minimum value of X, and Max is the maximum value of X. This method is also very sensitive to outliers, because outliers affect both Min and Max./

from sklearn.preprocessing import MinMaxScaler # in order to mimic the operation in real-world, we shall fit the MinMaxScaler # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = MinMaxScaler(feature_range=(0,1)) # set the range to be (0,1) model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([0.53966842, 0.53802706, 0.46602805, 0.35469856, 0.23077613, # 0.24392077, 0.21787286, 0.18069406, 0.1089985, 0.22008662]) # result is the same as (X[0:10,0]-X[10:,0] .min ())/(X [ 10:, 0] .max () - X [10:, 0] .min ()) copying the code
# visualize the distribution after the scaling # fit and transform the entire first feature import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 )) sns.distplot(X[:, 0 ], hist = True , kde= True , ax=ax[ 0 ]) ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 ) ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution model = MinMaxScaler(feature_range=( 0 , 1 )) model.fit(X[:, 0 ].reshape(- 1 ,1 )) result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 ) # show the distribution of the entire feature sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ]) ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 ) ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 1 ].set_ylabel( 'Frequency' , fontsize= 12 is ); # The IS Distribution The Same, But Scales Change fig.tight_layout () # now to The Change Scale [0,1] duplicated code

1.1.2.3 RobustScaler (anti-outlier scaling)/

Use statistics that are robust to outliers (quantiles) to scale features. Suppose we want to scale the feature quantile range (a, b).

formula:

This method is more robust to abnormal points.

import numpy as np from sklearn.preprocessing import RobustScaler # in order to mimic the operation in real-world, we shall fit the RobustScaler # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = RobustScaler(with_centering = True, with_scaling = True, quantile_range = (25.0, 75.0)) # with_centering = True => recenter the feature by set X'= X-X.median() # with_scaling = True => rescale the feature by the quantile set by user # set the quantile to the (25%, 75%) model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) # return array([ 2.19755974, 2.18664281, 1.7077657, 0.96729508, 0.14306683, # 0.23049401, 0.05724508, -0.19003715, -0.66689601, 0.07196918]) # result is the same as (X[0:10,0]-np.quantile(X [10:, 0], 0.5 ))/(np.quantile (X [10:, 0], 0.75) -np.quantile (X [10:, 0], 0.25)) copying the code
# visualize the distribution after the scaling # fit and transform the entire first feature import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 )) sns.distplot(X[:, 0 ], hist = True , kde = True , ax=ax[0 ]) ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 ) ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution model = RobustScaler(with_centering = True , with_scaling = True , quantile_range = ( 25.0 , 75.0 )) model.fit(X[:, 0 ].reshape(- 1 , 1 )) result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 ) # show the distribution of the entire feature sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ]) ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 ) ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution is the same, but scales change fig.tight_layout() Copy code

1.1.2.4 Power transformation (non-linear transformation)

All the scaling methods described above maintain the original distribution. But normality is an important assumption of many statistical models. We can use a power transformation to convert the original distribution to a normal distribution.

Box-Cox transformation:

The Box-Cox transformation is only applicable to positive numbers and assumes the following distribution:

Considering all values, the optimal value of stable variance and minimized skewness is selected through maximum likelihood estimation.

from sklearn.preprocessing import PowerTransformer # in order to mimic the operation in real-world, we shall fit the PowerTransformer # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = PowerTransformer(method='box-cox', standardize=True) # apply box-cox transformation model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) Return Array # ([1.91669292, 1.91009687, 1.60235867, 1.0363095, 0.19831579, # 0.30244247, 0.09143411, -0.24694006, -1.08558469, .11011933]) Copy the code
# visualize the distribution after the scaling # fit and transform the entire first feature import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 )) sns.distplot(X[:, 0 ], hist = True , kde= True , ax=ax[ 0 ]) ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 ) ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution model = PowerTransformer(method = 'box-cox' , standardize = True ) model.fit(X[:, 0 ].reshape(- 1 , 1 )) result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 ) # show the distribution of the entire feature sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ]) ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 ) ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution now becomes normal fig.tight_layout() Copy code

Yeo-Johnson transformation:

The Yeo Johnson transformation applies to positive and negative numbers, and assumes the following distribution:

Considering all values, the optimal value of stable variance and minimized skewness is selected through maximum likelihood estimation.

from sklearn.preprocessing import PowerTransformer # in order to mimic the operation in real-world, we shall fit the PowerTransformer # on the trainset and transform the testset # we take the top ten samples in the first column as test set # take the rest samples in the first column as train set test_set = X[0:10,0] # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) train_set = X[10:,0] model = PowerTransformer(method='yeo-johnson', standardize=True) # apply box-cox transformation model.fit(train_set.reshape(-1,1)) # fit on the train set and transform the test set # top ten numbers for simplification result = model.transform(test_set.reshape(-1,1)).reshape(-1) Return Array # ([1.90367888, 1.89747091, 1.604735, 1.05166306, .20617221, # .31245176, .09685566, -0.25011726, -1.10512438, .11598074]) Copy the code
# visualize the distribution after the scaling # fit and transform the entire first feature import seaborn as sns import matplotlib.pyplot as plt fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 )) sns.distplot(X[:, 0 ], hist = True , kde = True , ax=ax[ 0 ]) ax[ 0 ].set_title( 'Histogram of the Original Distribution' , fontsize= 12 ) ax[ 0 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 0 ].set_ylabel( 'Frequency' , fontsize = 12 ); # this feature has long-tail distribution model = PowerTransformer(method = 'yeo-johnson' , standardize = True ) model.fit(X[:, 0 ].reshape(- 1 , 1 )) result = model.transform(X[:, 0 ].reshape(- 1 , 1 )).reshape(- 1 ) # show the distribution of the entire feature sns.distplot(result, hist = True , kde = True , ax=ax[ 1 ]) ax[ 1 ].set_title( 'Histogram of the Transformed Distribution' , fontsize = 12 ) ax[ 1 ].set_xlabel( 'Value' , fontsize = 12 ) ax[ 1 ].set_ylabel( 'Frequency' , fontsize = 12 ); # the distribution now becomes normal fig.tight_layout() Copy code

1.1.3 Regularization/

All of the above zooming methods are operated in columns. But regularization works in every row, it tries to "scale" each sample so that it has a unit norm. Since regularization works in every row, it will distort the relationship between features, so it is not common. But regularization methods are very useful in the context of text classification and clustering.

Suppose X[i][j] represents the value of feature j in sample i.

L1 regularization formula:

L2 regularization formula:/

L1 regularization:/

from sklearn.preprocessing import Normalizer # Normalizer performs operation on each row independently # So train set and test set are processed independently ###### for L1 Norm sample_columns = X[0:2,0:3] # select the first two samples, and the first three features # return array([[ 8.3252, 41., 6.98412698], # [8.3014 , 21., 6.23813708]]) model = Normalizer(norm='l1') # use L2 Norm to normalize each samples model.fit(sample_columns) result = model.transform(sample_columns) # test set are processed similarly # return array([[0.14784762, 0.72812094, 0.12403144], # [0.23358211, 0.59089121, 0.17552668]]) # result = sample_columns/np.sum(np.abs( sample_columns), axis = 1) .reshape (-1,1) copying the code

L2 regularization:/

###### for L2 Norm sample_columns = X[0:2,0:3] # select the first three features # return array([[ 8.3252, 41., 6.98412698], # [8.3014, 21., 6.23813708] ]) model = Normalizer(norm='l2') # use L2 Norm to normalize each samples model.fit(sample_columns) result = model.transform(sample_columns) # return array([[0.19627663, 0.96662445, 0.16465922], # [0.35435076, 0.89639892, 0.26627902]]) # result = sample_columns/np.sqrt(np.sum(sample_columns**2, axis=1)).reshape(- 1,1) Copy code
# visualize the difference in the distribuiton after Normalization # compare it with the distribuiton after RobustScaling # fit and transform the entire first & second feature import seaborn as sns import matplotlib.pyplot as plt # RobustScaler fig, ax = plt.subplots( 2 , 1 , figsize = ( 13 , 9 )) model = RobustScaler(with_centering = True , with_scaling = True , quantile_range = ( 25.0 , 75.0 )) model.fit(X[:, 0 : 2 ]) result = model.transform(X[:, 0 : 2 ]) sns.scatterplot(result[:, 0 ], result[:, 1 ], ax=ax[ 0 ]) ax[ 0 ].set_title( 'Scatter Plot of RobustScaling result' , fontsize= 12 ) ax[ 0 ].set_xlabel( 'Feature 1' , fontsize= 12 ) ax[ 0 ].set_ylabel( 'Feature 2' , fontsize = 12 ); model = Normalizer(norm = 'l2' ) model.fit(X[:, 0 : 2 ]) result = model.transform(X[:, 0 : 2 ]) sns.scatterplot(result[:, 0 ], result[:, 1 ], ax=ax[ 1 ]) ax[ 1 ].set_title( 'Scatter Plot of Normalization result' , fontsize= 12 ) ax[ 1 ].set_xlabel( 'Feature 1' , fontsize= 12 ) ax[ 1 ].set_ylabel( 'Feature 2' , fontsize= 12 ); fig.tight_layout () # Original Distribution Normalization Distort The duplicated code

1.1.4 Estimation of missing values

In practice, there may be missing values in the data set. However, this sparse data set is incompatible with most scikit learning models, which assume that all features are numerical, without missing values. So before applying the scikit learning model, we need to estimate the missing values.

But some new models, such as XGboost, LightGBM, and Catboost implemented in other packages, provide support for the missing values in the data set. So when applying these models, we no longer need to fill in the missing values in the data set.

1.1.4.1 Univariate feature interpolation

Assuming there are missing values in the i-th column, then we will estimate it with a constant or statistical data (mean, median or mode) in the i-th column.

from sklearn.impute import SimpleImputer test_set = X[0:10,0].copy() # no missing values # return array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912]) # manully create some missing values test_set[3] = np.nan test_set[6] = np.nan # now sample_columns becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912]) # create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,0].copy() train_set[3] = np.nan train_set[6] = np.nan imputer = SimpleImputer(missing_values=np.nan, strategy='mean') # use mean # we can set the strategy to'mean','median','most_frequent','constant' imputer.fit(train_set.reshape(-1,1)) result = imputer.transform(test_set.reshape(-1,1)).reshape(-1) # return array([8.3252, 8.3014, 7.2574, 3.87023658, 3.8462, # 4.0368, 3.87023658, 3.12, 2.0804, 3.6912 ]) # all missing values are imputed with 3.87023658 # 3.87023658 = np.nanmean(train_set) # which is the mean of the trainset ignoring missing values duplicated code

1.1.4.2 Multivariate feature interpolation/

Multivariate feature imputation uses the information of the entire data set to estimate and impute missing values. In scikit-learn, it is implemented in a loop iterative manner.

In each step, one feature column is designated as the output y, and the other feature columns are regarded as the input X. A regressor is suitable for (X, y) where y is known. Then, the regressor is used to predict the missing value of y. This is done in an iterative manner for each feature, and then repeated for the maximum value interpolation round.

Use a linear model (take BayesianRidge as an example):

from sklearn.experimental import enable_iterative_imputer # have to import this to enable # IterativeImputer from sklearn.impute import IterativeImputer from sklearn.linear_model import BayesianRidge test_set = X[0:10,:].copy() # no missing values, select all features # the first columns is # array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912 ]) # manully create some missing values test_set[3,0] = np.nan test_set[6,0] = np.nan test_set[3,1] = np.nan # now the first feature becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912]) # create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,:].copy() train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nan impute_estimator = BayesianRidge() imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator) imputer.fit(train_set) result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works # return array([8.3252, 8.3014, 7.2574, 4.6237195, 3.8462, # 4.0368, 4.00258149, 3.12, 2.0804, 3.6912] ) Copy code

Use a tree-based model (take ExtraTrees as an example):

from sklearn.experimental import enable_iterative_imputer # have to import this to enable # IterativeImputer from sklearn.impute import IterativeImputer from sklearn.ensemble import ExtraTreesRegressor test_set = X[0:10,:].copy() # no missing values, select all features # the first columns is # array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912 ]) # manully create some missing values test_set[3,0] = np.nan test_set[6,0] = np.nan test_set[3,1] = np.nan # now the first feature becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912]) # create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,:].copy() train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nan impute_estimator = ExtraTreesRegressor(n_estimators=10, random_state=0) # parameters can be turned in CV though sklearn pipeline imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator) imputer.fit(train_set) result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works # return array([8.3252, 8.3014, 7.2574, 4.63813, 3.8462, 4.0368, 3.24721, # 3.12, 2.0804, 3.6912] ) Copy code

Use K Nearest Neighbor (KNN):

from sklearn.experimental import enable_iterative_imputer # have to import this to enable # IterativeImputer from sklearn.impute import IterativeImputer from sklearn.neighbors import KNeighborsRegressor test_set = X[0:10,:].copy() # no missing values, select all features # the first columns is # array([8.3252, 8.3014, 7.2574, 5.6431, 3.8462, 4.0368, 3.6591, 3.12, 2.0804, 3.6912 ]) # manully create some missing values test_set[3,0] = np.nan test_set[6,0] = np.nan test_set[3,1] = np.nan # now the first feature becomes # array([8.3252, 8.3014, 7.2574, nan, 3.8462, 4.0368, nan, 3.12 ,2.0804, 3.6912]) # create the test samples # in real-world, we should fit the imputer on train set and tranform the test set. train_set = X[10:,:].copy() train_set[3,0] = np.nan train_set[6,0] = np.nan train_set[3,1] = np.nan impute_estimator = KNeighborsRegressor(n_neighbors=10, p = 1) # set p=1 to use manhanttan distance # use manhanttan distance to reduce effect from outliers # parameters can be turned in CV though sklearn pipeline imputer = IterativeImputer(max_iter = 10, random_state = 0, estimator = impute_estimator) imputer.fit(train_set) result = imputer.transform(test_set)[:,0] # only select the first column to revel how it works # return array([8.3252, 8.3014, 7.2574, 3.6978, 3.8462, 4.0368, 4.052, 3.12, # 2.0804, 3.6912] ) Copy code

1.1.4.3 Marking estimates

Sometimes, some missing values may be useful. Therefore, scikit learn also provides the function of converting a data set with missing values into a corresponding binary matrix, which indicates the existence of missing values in the data set.

from sklearn.impute import MissingIndicator # illustrate this function on trainset only # since the precess is independent in train set and test set train_set = X[ 10 :,:].copy() # select all features train_set[ 3 , 0 ] = np.nan # manully create some missing values train_set[ 6 , 0 ] = np.nan train_set[ 3 , 1 ] = np.nan indicator = MissingIndicator(missing_values=np.nan, features = 'all' ) # show the results on all the features result = indicator.fit_transform(train_set) # result have the same shape with train_set # contains only True & False, True corresponds with missing value Result [:, 0 .] SUM () # 2 Should return, The column has TWO First Missing values Result [:, . 1 ]. SUM (); # Should return. 1, The Missing value SECOND One column has duplicated code

1.1.5 Feature transformation/

1.1.5.1 Polynomial transformation/

Sometimes we hope to introduce nonlinear features into the model, thereby increasing the complexity of the model. For simple linear models, this will greatly increase the complexity of the model. But for more complex models, such as tree-based ML models, they already include non-linear relationships in the non-parametric tree structure. Therefore, this feature conversion may not be very helpful for tree-based ML models.

For example, if we set the order to 3, the form is as follows:

from sklearn.preprocessing import PolynomialFeatures # illustrate this function on one synthesized sample train_set = np.array([2,3]).reshape(1,-1) # shape (1,2) # return array([[2, 3]]) poly = PolynomialFeatures(degree = 3, interaction_only = False) # the highest degree is set to 3, and we want more than just intereaction terms result = poly.fit_transform(train_set) # have shape (1, 10) # array([[ 1., 2., 3., 4., 6., 9., 8., 12., 18., 27. ]]) Copy code

1.1.5.2 Custom transform/

from sklearn.preprocessing import FunctionTransformer # illustrate this function on one synthesized sample train_set = np . array ( [ 2 , 3 ] ). reshape ( 1 ,- 1 ) # shape ( 1 , 2 ) # return array ( [[ 2 , 3 ]] ) transformer = FunctionTransformer ( func = np.log1p, validate=True ) # perform log transformation , X '= log ( 1 + x ) # func can be any numpy function such as np . exp result = transformer . transform ( train_set ) # Return Array ( [[ 1.09861229 , 1.38629436 ]] ), The Same AS NP . Log1p ( train_set ) copying the code

Well, the above is the introduction to the data preprocessing of static continuous variables. It is recommended that readers combine the code and perform it in Jupyter.

```php

Wonderful review of previous issues. Routes and materials for beginners to get started with artificial intelligence. Download Machine Learning Online Manual Deep Learning Online Manual AI Basic Download (pdf updated to 25 episodes) qq group on this site 1003271085, please reply to "add group" to join the WeChat group of this site Get a 10% discount on the Knowledge Planet Coupon on this site, please reply to the "Knowledge Planet" like article, click one to see

Copy code