Support vector machines (SVM) are two classification algorithms. The so-called two classification is to divide data with multiple characteristics (attributes) into two categories. Among the current mainstream machine learning algorithms, neural networks and other machine learning models can already be used. Good completion of two classifications, multi-classifications, learning and research of SVM, understanding the rich algorithm knowledge behind SVM, will be of great benefit to the future research of other algorithms; in the process of implementing SVM, it will comprehensively use the one-dimensional search, KKT conditions and penalties introduced before Functions and other related knowledge. This article first explains the principle of SVM in detail, and then introduces how to implement SVM algorithm from scratch using python.

For ease of understanding, assuming that the sample has two attributes, the attribute values can be mapped to the x and y axes of the two-dimensional space axis, as shown in the following figure:

The samples in the example are clearly divided into two categories. The black solid dots can be named category 1, and the hollow dots can be named category 2. In practical applications, the categories will be digitized. For example, category one is represented by 1, and category two is represented by -1. , And call the digitized category a label. Each category corresponds to label 1, or -1 means that there is no hard and fast rule, you can do it according to your own preferences. It should be noted that since the SVM algorithm label also participates in mathematical operations, the category label cannot be set to 0 here.

It still corresponds to the above figure. If you can find a straight line, divide the above-mentioned solid point and hollow point into two parts. When there are other sample points next time, draw their attribute values as coordinates on the coordinate axis. The position relationship between the new sample point and the straight line can be used to determine its category. There are countless lines to satisfy such a line, SVM is to find the most suitable one: Observe the above picture, the green line is definitely not good, the classification line is already wrong without the verification set; and the blue line and the red line can be classified, and the blue line The color line is too close to the solid black dot. In comparison, the red line is more'fair'. The red line is the classification line that SVM needs to find. The mathematical language to describe the "fairness" characteristic of the red line can be expressed as: treat the black point and the hollow point as two sets, if a straight line is found, some points in the black point set will be separated The straight line is the closest, and the distance between these points and the straight line is d; a series of points can be found in the hollow point set, the closest to the straight line, and the distance is also d, then the straight line is what we want to find the linear classifier, and we call the two sets at the same time The point closest to the straight line is the support vector, and the SVM support vector machine is named after it.

Some algorithm books describe the SVM algorithm in this way to find a straight line that maximizes the separation space between the straight line and the closest point on both sides. It can also be seen from the above figure that the distance between the black point and the blue line is less than the distance to the red line (Right-angled hypotenuse). If you can find the support vector, you must find the classification line, and vice versa. The above is for two attribute values. The characteristics of the SVM algorithm can be derived by observing the two-dimensional plane. If there are many sample attributes, how to summarize the regularity of the algorithm? 1. let's talk about the separability theorem of lower convex sets. This theorem is not only the core theoretical support of SVM, but also the cornerstone of machine learning algorithms.

1. Convex set separability theorem

Let's take two-dimensional space as an example. In middle school, we have learned linear equations. For example, there is a linear equation y=2x-1, as shown in the following figure:

Write the linear equation y=2x-1 in inner product form:

The vector (-2,1) corresponds to the OA vector in the above figure, and the OA vector is changed to a unit vector, that is, the direction is the same as OA, the modulus is 1 vector OS, and the coordinate of S is, and the two sides of the straight line equation are divided by the same, we can get:

(x,y) represents any point on the straight line y=2x-1, the above formula shows that the inner product of any point on y=2x-1 and the unit vector S: is, the length of the vector OP in the figure is, and the negative sign is because The direction of the OP vector is opposite to the direction of the OS; in the above figure, the vectors v1 and v2 are projected on the OS vector as OP. This example shows: by introducing a vector OS, countless points on the line y=2x-1 can be on the vector OS Used to express, or in other words, the straight line y=2x-1 can be represented by coordinates (0,) on the vector OS. Through the inner product projection method, high-dimensional data can be turned into a real number on the vector. This is a linear functional process. In the field of mathematics, the inner product is commonly used to reduce the data dimension, and the multi-dimensional data is processed into a real number for later analysis. deal with.

When extended to any dimension, the above equation can be expressed by a set formula: S:{x | pTx= } x Rn. Corresponding to the above figure: the p vector corresponds to the vector OS, x is any point on the line, in the SVM algorithm, the set S formed by the x points is called the hyperplane. Sometimes the inner product value of the high-dimensional data set projected to the vector p is a range: such as {x | pTx>= } or {y | pTy<=- }, the high-dimensional data is projected onto the two intervals of the vector p :

Next, introduce the convex set separation theorem:

In the above figure, X and Y are convex sets in the high-dimensional space, and X Y= ; after projecting X and Y onto the vector p, two convex sets X', Y'can be obtained, on the vector p X'and Y'must be at the two ends of the vector p, that is, the two convex sets of X'and Y'are separable.

Note that there are two conditions in the theorem, one is that X and Y are convex sets, and the other is that the intersection of X and Y is an empty set. X', Y'separable means that the two can be distinguished on the vector p, and X and X', Y and Y'are all one-to-one mapping, X', Y'can be distinguished, it indirectly means X , Y can also be distinguished, the so-called'distinguishable' here is the dichotomy to be realized by SVM. The separability theorem of convex sets shows that two convex sets without intersection are linearly separable. At the same time, if the two data sets cannot be linearly separated, the data can be transformed into two convex sets by the kernel function, so the convex set separability theorem It also has a guiding significance for the generation of the kernel function.

2. SVM algorithm

2.1 Hyperplane and maximum separation

As mentioned earlier, the linear classifier that can realize the two-class classification of data is called a hyperplane. The SVM algorithm needs to request the hyperplane with the largest interval. The hyperplane can be set as S:{x|pTx= }, because pTx= , etc. Both sides of the formula can be divided by a positive number to normalize p to a unit vector. It is better to set p to be a unit vector after processing. This setting does not affect the SVM algorithm. In the general literature, the hyperplane is usually written as an implicit function:

S:{x|pTx- =0} x Rn, p Rn, ||p||=1

From the knowledge of geometry, the formula for the distance from any point x in the space to the hyperplane S is:

The distance between the support vectors belonging to the two classifications and the hyperplane is d, d>0. Since the support vectors are the closest points to the hyperplane in the respective classification data, the following inequalities are available for all data:

Divide both sides of formula (1) by d, we can get:, using the method of substitution, let:

In this way, the common form of constraints is obtained:

Next, we need to remove the absolute value symbol from the formula (3). SVM is a two-class classification problem: the classification label y=1 when Tx+b>=1; the classification label y=- when Tx+b<=-1 1. In this way, all data have inequalities:

y( Tx+b)>=1

Go back and look at the exchange element setting, =pT/d, the norm operations on both sides of the equation are: || ||=||pT||/d=1/d, we can get d=1/| | ||, the SVM algorithm needs to maximize the separation distance d on the basis of satisfying the constraints of formula 4. Assuming there are m data to be classified, implementing the SVM algorithm is equivalent to solving a nonlinear programming problem with inequality constraints:

The above problem can also be expressed as a minimum problem:

The classification effect of the hyperplane and the relationship between each parameter can be referred to the following figure:

2.2 Maximum soft interval

In practice, due to the existence of abnormal data, the hyperplane cannot completely divide the data into two parts, as shown in the following figure:

A small amount of abnormal data is mixed in the two classifications. Excluding a very small number of abnormal points, the hyperplane in the above figure can separate most of the samples. The formula can be compatible with the situation in the figure above after introducing slack variables:

**(5.1)**

** Normally it is equal to 0, which means that the sample points are listed on both sides of the support vector;** >0 means abnormal data

**:**when 0<<1, it means that the data is between the hyperplane and the support vector, as shown in the figure above 2 and point 4; and **

**>1 means that the data is in the other side's space, as shown in point 1 and point 3 in the above figure. The interval that satisfies the formula (5.1) is called the soft interval. The optimization result of the soft interval not only requires the maximum interval of the support vector, but also makes each**** as small as possible

**.**In other words

**,**after the hyperplane is determined, the number of samples defined as abnormal data should be as small as possible.

[x_train, y_train, x_test, y_test]= data();% load data % Project part 1 disp('--------Project part 1----------'); plot_alldata(x_train, y_train, x_test, y_test);% plot all data fit_data(x_train, y_train, x_test, y_test);% fitting data & plot the best model % Project part 2 disp('--------Project part 2----------'); svm(x_train,y_train,x_test,y_test);% svm model fitting & plots Copy code

### Result display

>> main_project Warning: Before creating variable names for the table, the column headers in the file were modified to make them valid MATLAB identifiers. The original column headings are stored in the VariableDescriptions property in. Set'PreserveVariableNames' to true to use the original column headings as table variable names. --------Project part 1---------- For n = 1 mse_train= 2.9834e+04 r2_train= 0.1084 mse_test= 3.2158e+04 r2_test= 0.1305 --------------- For n = 2 mse_train= 1.3195e+04 r2_train= 0.6057 mse_test= 1.5085e+04 r2_test= 0.6028 --------------- For n = 3 mse_train= 1.2732e+04 r2_train= 0.6195 mse_test= 1.4313e+04 r2_test= 0.6237 --------------- For n = 4 mse_train= 1.2687e+04 r2_train= 0.6208 mse_test= 1.4152e+04 r2_test= 0.6283 --------------- For n = 5 mse_train= 1.1969e+04 r2_train= 0.6423 mse_test= 1.3453e+04 r2_test= 0.6470 --------------- For n = 6 mse_train= 6.3150e+03 r2_train= 0.8113 mse_test= 6.8526e+03 r2_test= 0.8285 --------------- For n = 7 Warning: The condition of the polynomial is not set correctly. Please add points with different X values, reduce the degree of the polynomial, or try to center and scale as described in HELP POLYFIT. > In polyfit (line 79) In fit_data (line 5) In main_project (line 6) mse_train= 6.2697e+03 r2_train= 0.8126 mse_test= 6.8162e+03 r2_test= 0.8296 --------------- For n = 8 Warning: The condition of the polynomial is not set correctly. Please add points with different X values, reduce the degree of the polynomial, or try to center and scale as described in HELP POLYFIT. > In polyfit (line 79) In fit_data (line 5) In main_project (line 6) mse_train= 3.0802e+03 r2_train= 0.9079 mse_test= 3.3461e+03 r2_test= 0.9258 --------------- For n = 9 Warning: The condition of the polynomial is not set correctly. Please add points with different X values, reduce the degree of the polynomial, or try to center and scale as described in HELP POLYFIT. > In polyfit (line 79) In fit_data (line 5) In main_project (line 6) mse_train= 2.9532e+03 r2_train= 0.9117 mse_test= 3.3297e+03 r2_test= 0.9261 --------------- Warning: The condition of the polynomial is not set correctly. Please add points with different X values, reduce the degree of the polynomial, or try to center and scale as described in HELP POLYFIT. > In polyfit (line 79) In fit_data (line 32) In main_project (line 6) model coefficient: 0.0000 -0.0001 0.0074 -0.2230 3.8894 -39.3375 219.0309 -587.1015 589.8787 0.1253 --------Project part 2---------- -------Gaussian------- CASE1-default: mse = 1.4268e+03 r2 = 0.9771 CASE2: mse = 1.6445e+03 r2 = 0.9758 CASE3: mse = 1.4664e+03 r2 = 0.9772 OPTIMIZED: |================================================ ================================================== =================| | Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon | | | result | log(1+loss) | runtime | (observed) | (estim.) | | | | |================================================ ================================================== =================| | 1 | Best | 10.431 | 0.89758 | 10.431 | 10.431 | 394.32 | 736.91 | 10.115 | | 2 | Accept | 10.437 | 0.1401 | 10.431 | 10.432 | 0.0011055 | 0.053194 | 47.555 | | 3 | Best | 10.426 | 0.071798 | 10.426 | 10.426 | 25.627 | 30.172 | 19150 | | 4 | Accept | 10.426 | 0.054275 | 10.426 | 10.426 | 1.1377 | 0.0012314 | 14020 | | 5 | Accept | 10.428 | 0.14271 | 10.426 | 10.426 | 470.09 | 836.88 | 0.41086 | | 6 | Accept | 10.426 | 0.037576 | 10.426 | 10.426 | 2.6562 | 0.0010627 | 3310.4 | | 7 | Accept | 10.426 | 0.034094 | 10.426 | 10.426 | 0.0010521 | 24.307 | 5816.8 | | 8 | Best | 10.074 | 0.065287 | 10.074 | 10.074 | 17.708 | 0.056046 | 1.6934 | | 9 | Accept | 10.426 | 0.038535 | 10.074 | 10.074 | 132.62 | 0.08944 | 3956.1 | | 10 | Best | 9.9965 | 0.06095 | 9.9965 | 9.9965 | 21.302 | 0.036971 | 0.28347 | | 11 | Best | 8.3865 | 0.07324 | 8.3865 | 8.3867 | 642.83 | 0.0044812 | 0.22288 | | 12 | Accept | 9.5704 | 0.046863 | 8.3865 | 8.3867 | 993.59 | 0.0028159 | 156.21 | | 13 | Best | 8.3864 | 0.071449 | 8.3864 | 8.3864 | 975.49 | 0.0012285 | 0.27834 | | 14 | Accept | 8.3865 | 0.064198 | 8.3864 | 8.3862 | 507.99 | 0.0010299 | 0.24398 | | 15 | Best | 8.3853 | 0.06352 | 8.3853 | 8.386 | 611.28 | 0.0019797 | 0.51237 | | 16 | Accept | 8.3865 | 0.069143 | 8.3853 | 8.3751 | 660.76 | 0.0013634 | 0.22497 | | 17 | Accept | 8.386 | 0.068376 | 8.3853 | 8.3773 | 828.03 | 0.0024918 | 0.35096 | | 18 | Accept | 8.3861 | 0.067393 | 8.3853 | 8.3799 | 732.96 | 0.0014109 | 0.32464 | | 19 | Accept | 8.3861 | 0.062025 | 8.3853 | 8.3797 | 520.66 | 0.0022738 | 0.30965 | | 20 | Accept | 8.3863 | 0.066882 | 8.3853 | 8.3797 | 668.85 | 0.0018481 | 0.24784 | |================================================ ================================================== =================| | Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon | | | result | log(1+loss) | runtime | (observed) | (estim.) | | | | |================================================ ================================================== =================| | 21 | Accept | 8.3866 | 0.063036 | 8.3853 | 8.3796 | 903.96 | 0.0023935 | 0.22712 | | 22 | Accept | 8.3864 | 0.065384 | 8.3853 | 8.3807 | 988.43 | 0.0056305 | 0.26435 | | 23 | Accept | 8.3857 | 0.080765 | 8.3853 | 8.3808 | 768.95 | 0.0028004 | 0.40427 | | 24 | Accept | 10.434 | 0.047994 | 8.3853 | 8.3807 | 0.036485 | 962.56 | 0.22485 | | 25 | Accept | 10.432 | 0.091279 | 8.3853 | 8.3805 | 0.065077 | 0.001002 | 0.22826 | | 26 | Accept | 10.426 | 0.03629 | 8.3853 | 8.3804 | 0.15322 | 962.81 | 20662 | | 27 | Accept | 8.3865 | 0.063816 | 8.3853 | 8.3815 | 648.08 | 0.0022537 | 0.24789 | | 28 | Accept | 10.426 | 0.035603 | 8.3853 | 8.3814 | 0.0013648 | 0.0010192 | 21509 | | 29 | Accept | 10.434 | 0.048371 | 8.3853 | 8.3813 | 0.0010406 | 767.16 | 0.32152 | | 30 | Accept | 8.3866 | 0.079493 | 8.3853 | 8.3814 | 912.54 | 0.0035796 | 0.24067 | __________________________________________________________ The optimization is complete. Reach MaxObjectiveEvaluations 30. Total number of function calculations: 30 Total elapsed time: 26.8651 seconds. Total objective function calculation time: 2.808 Observed best feasible point: BoxConstraint KernelScale Epsilon _____________ ___________ _______ 611.28 0.0019797 0.51237 Observed objective function value = 8.3853 Estimated objective function value = 8.3814 Function calculation time = 0.06352 Estimated best feasible point (according to the model): BoxConstraint KernelScale Epsilon _____________ ___________ _______ 668.85 0.0018481 0.24784 Estimated objective function value = 8.3814 Estimated function calculation time = 0.067649 mse = 2.0132e+03 r2 = 0.9671 -------RBF------- CASE1-default: mse = 1.4268e+03 r2 = 0.9771 CASE2: mse = 1.4664e+03 r2 = 0.9772 CASE3: mse = 1.4591e+03 r2 = 0.9766 OPTIMIZED: |================================================ ================================================== =================| | Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon | | | result | log(1+loss) | runtime | (observed) | (estim.) | | | | |================================================ ================================================== =================| | 1 | Best | 10.437 | 0.064096 | 10.437 | 10.437 | 0.0024199 | 0.012131 | 0.82686 | | 2 | Best | 10.424 | 0.035964 | 10.424 | 10.425 | 792.75 | 0.51879 | 2798.3 | | 3 | Accept | 10.424 | 0.033958 | 10.424 | 10.424 | 0.0015712 | 1.7285 | 12257 | | 4 | Best | 10.315 | 0.052699 | 10.315 | 10.315 | 64.709 | 60.677 | 0.43676 | | 5 | Accept | 10.43 | 0.053824 | 10.315 | 10.341 | 598.05 | 816.9 | 0.22621 | | 6 | Best | 10.315 | 0.049229 | 10.315 | 10.315 | 65.348 | 60.686 | 0.73483 | | 7 | Accept | 10.424 | 0.032769 | 10.315 | 10.315 | 0.068404 | 38.268 | 382.69 | | 8 | Best | 10.23 | 0.046858 | 10.23 | 10.23 | 163.2 | 61.687 | 38.56 | | 9 | Accept | 10.424 | 0.03518 | 10.23 | 10.23 | 4.1446 | 62.152 | 2623.5 | | 10 | Accept | 10.424 | 0.033162 | 10.23 | 10.23 | 199.56 | 62.459 | 2073.2 | | 11 | Best | 10.113 | 0.048122 | 10.113 | 10.113 | 307.74 | 60.662 | 18.094 | | 12 | Accept | 10.44 | 0.052386 | 10.113 | 10.113 | 0.0020345 | 55.312 | 10.057 | | 13 | Best | 9.5718 | 0.048005 | 9.5718 | 10.002 | 694.03 | 28.973 | 12.634 | | 14 | Best | 7.8382 | 0.057983 | 7.8382 | 7.8399 | 932.32 | 4.265 | 11.006 | | 15 | Accept | 10.433 | 0.041807 | 7.8382 | 7.8384 | 0.39209 | 4.3466 | 348.78 | | 16 | Accept | 8.111 | 0.055765 | 7.8382 | 7.8388 | 976.56 | 5.5148 | 1.4862 | | 17 | Accept | 10.424 | 0.032728 | 7.8382 | 7.8386 | 992.59 | 4.0578 | 2128.8 | | 18 | Best | 7.65 | 0.066125 | 7.65 | 7.65 | 819.88 | 2.352 | 4.6514 | | 19 | Accept | 7.7333 | 0.062206 | 7.65 | 7.6576 | 912.6 | 3.2069 | 4.5506 | | 20 | Best | 7.6008 | 0.071054 | 7.6008 | 7.6011 | 990.85 | 1.656 | 8.9775 | |================================================ ================================================== =================| | Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon | | | result | log(1+loss) | runtime | (observed) | (estim.) | | | | |================================================ ================================================== =================| | 21 | Accept | 7.6387 | 0.098309 | 7.6008 | 7.6007 | 982.52 | 0.83259 | 2.7969 | | 22 | Accept | 7.6275 | 0.076077 | 7.6008 | 7.6018 | 797.68 | 0.9112 | 0.29266 | | 23 | Accept | 7.9363 | 0.0648 | 7.6008 | 7.6019 | 994.85 | 0.32932 | 0.23043 | | 24 | Accept | 7.6935 | 0.045446 | 7.6008 | 7.6018 | 29.346 | 0.967 | 0.37377 | | 25 | Accept | 7.6274 | 0.080696 | 7.6008 | 7.6017 | 910.2 | 0.9449 | 0.91925 | | 26 | Accept | 7.6121 | 0.077758 | 7.6008 | 7.603 | 943 | 1.1589 | 5.0625 | | 27 | Accept | 7.6153 | 0.062063 | 7.6008 | 7.6032 | 517.55 | 1.4588 | 0.25744 | | 28 | Accept | 7.6051 | 0.058127 | 7.6008 | 7.6033 | 254.49 | 1.155 | 0.38512 | | 29 | Accept | 7.6205 | 0.075103 | 7.6008 | 7.603 | 999.64 | 1.4418 | 1.9978 | | 30 | Accept | 7.6088 | 0.051488 | 7.6008 | 7.604 | 227.74 | 1.2748 | 1.9489 | __________________________________________________________ The optimization is complete. Reach MaxObjectiveEvaluations 30. Total number of function calculations: 30 Total elapsed time: 20.9975 seconds. Total objective function calculation time: 1.6638 Observed best feasible point: BoxConstraint KernelScale Epsilon _____________ ___________ _______ 990.85 1.656 8.9775 Observed objective function value = 7.6008 Estimated objective function value = 7.604 Function calculation time = 0.071054 Estimated best feasible point (according to the model): BoxConstraint KernelScale Epsilon _____________ ___________ _______ 254.49 1.155 0.38512 Estimated objective function value = 7.604 Estimated function calculation time = 0.057088 mse = 1.3855e+03 r2 = 0.9748 -------Polynomial------- CASE1-default: mse = 1.6898e+04 r2 = 0.6155 CASE2: mse = 1.5912e+04 r2 = 0.6422 CASE3: mse = 6.6300e+03 r2 = 0.8565 OPTIMIZED: |================================================ ================================================== =================| | Iter | Eval | Objective: | Objective | BestSoFar | BestSoFar | BoxConstraint| KernelScale | Epsilon | | | result | log(1+loss) | runtime | (observed) | (estim.) | | | | |================================================ ================================================== =================| | 1 | Best | 17.436 | 28.927 | 17.436 | 17.436 | 232.42 | 0.041278 | 0.77342 | | 2 | Best | 10.42 | 0.071368 | 10.42 | 10.775 | 21.908 | 0.010392 | 12625 | | 3 | Accept | 10.42 | 0.043085 | 10.42 | 10.42 | 0.0058184 | 18.398 | 3455.5 | | 4 | Accept | 71.286 | 28.847 | 10.42 | 11.828 | 0.0059341 | 0.0013698 | 2.1953 | | 5 | Accept | 10.42 | 0.047948 | 10.42 | 10.418 | 24.457 | 0.0030582 | 9739.2 | | 6 | Accept | 10.42 | 0.037128 | 10.42 | 10.418 | 0.10555 | 1.1212 | 22115 | | 7 | Accept | 42.845 | 28.004 | 10.42 | 10.415 | 998.17 | 0.0026635 | 83.309 | | 8 | Accept | 10.42 | 0.038793 | 10.42 | 10.412 | 0.36885 | 0.0012841 | 5198.9 | | 9 | Accept | 81.729 | 28.321 | 10.42 | 10.418 | 944.81 | 0.0043766 | 0.22306 | | 10 | Best | 9.571 | 0.40222 | 9.571 | 9.5701 | 172.69 | 0.60399 | 0.77291 | | 11 | Accept | 9.5906 | 0.17913 | 9.571 | 9.5705 | 0.021205 | 0.14439 | 154.14 | | 12 | Accept | 10.435 | 0.045818 | 9.571 | 9.5705 | 42.148 | 256.32 | 0.22485 | | 13 | Accept | 10.42 | 0.036242 | 9.571 | 9.5706 | 65.078 | 995.86 | 9943.1 | Copy code