# SVM

## 1. Basic concepts

The basic model of Support Vector Machine (SVM) is to find the best separation hyperplane in the feature space to maximize the interval between positive and negative samples on the training set. SVM is a supervised learning algorithm used to solve two classification problems. After the introduction of the kernel method, SVM can also be used to solve nonlinear problems.
Generally, there are three types of SVM:

• Hard interval support vector machine (linear separable support vector machine): When the training data is linearly separable, a linear separable support vector machine can be obtained through the maximum chemistry of the hard interval.
• Soft interval support vector machine: When the training data is approximately linearly separable, a linear support vector machine can be obtained through the maximum chemistry of the soft interval.
• Non-linear support vector machine: When the training data is linear and inseparable, a non-linear support vector machine can be obtained through the kernel method and the soft interval maximum chemistry.

## 2. Hard Interval Support Vector Machine

Given the training sample set D={(x1 ,y1),(x2 ,y2),...,(xn ,yn)}D={(x1 ,y1),(x2 ,y2),..., (xn ,yn)}, yi {+1, 1}yi {+1, 1}, ii represents the iith sample, and nn represents the sample size. The most basic idea of classification learning is to find a best partitioned hyperplane in the feature space based on the training set DD to separate the positive and negative samples, and the SVM algorithm solves the problem of how to find the best hyperplane. The hyperplane can be described by the following linear equation:

w Tx +b=0(1)(1)w Tx +b=0

Among them, w w represents the normal vector, which determines the direction of the hyperplane; bb represents the offset, which determines the distance between the hyperplane and the origin.
For the training data set DD suppose that the best hyperplane w x +b =0w x +b =0 is found, and the decision classification function is defined

f(x )=sign(w x +b )(2)(2)f(x )=sign(w x +b )

This classification decision function is also called linear separable support vector machine.
In testing, for linearly separable support vector machines, the distance between a sample and the dividing hyperplane can be used to indicate the reliability of classification prediction. If the sample is farther from the dividing hyperplane, the classification of the sample is more reliable, and vice versa. .
So, what kind of dividing hyperplane is the best hyperplane?
For the three hyperplanes A, B, and C in Figure 1, it is obvious that hyperplane B should be selected, which means that the hyperplane should first be able to separate the two types of sample points.

figure 1

For the three hyperplanes A, B, and C in Figure 2, hyperplane C should be selected, because the use of hyperplane C for division has the best "tolerance" for local disturbances of the training samples and the strongest classification robustness. For example, due to the limitations of the training set or the interference of noise, the samples outside the training set may be closer to the current separation boundary of the two classes than the training samples in Figure 2, and errors will occur in the classification decision, and the hyperplane C The least affected, that is to say, the classification result produced by the hyperplane C is the most robust and reliable, and the generalization ability for unseen samples is the strongest.

figure 2

The best hyperplane is obtained by deriving the example in Figure 3 below.

image 3

The hyperplane in space can be denoted as (w ,b)(w ,b). According to the distance formula from point to plane, the distance from any point x x to the hyperplane (w ,b)(w ,b) in space can be Written as:

r=w x +b||w ||(3)(3)r=w x +b||w ||

Assuming that the hyperplane (w ,b)(w ,b) can correctly classify the training samples, then for any sample (xi ,yi) D(xi ,yi) D on one side of the positive sample, it should be When the distance from the normal vector w w of the sample point to the hyperplane to the origin is greater than a certain value cc, the sample point is predicted to be a positive sample, that is, there is a value cc such that w Txi >cw Txi >c when yi=+1yi=+1. w Txi >cw Txi >c can also be written as w Txi +b>0w Txi +b>0. During training, we require stricter constraints to make the final classifier more robust, so we require w Txi +b>1w Txi +b>1. It can also be written as greater than other distances, but both can be made to 1 by scaling w w and bb in the same proportion, so for the convenience of calculation, choose 1 directly here. Similarly for negative samples, there should be w Txi +b< 1w Txi +b< 1 when yi= 1yi= 1. which is:

{w Txi +b +1,w Txi +b 1,yi=+1yi= 1(4)(4){w Txi +b +1,yi=+1w Txi +b 1,yi= 1

that is:

yi(w Txi +b) +1(5)(5)yi(w Txi +b) +1

As shown in Figure 3, the few training sample points closest to the optimal hyperplane w x +b=0w x +b=0 make the equal sign in the above formula true. They are called "support vectors" (support vector). Remember the distance between the hyperplane w x +b=+1w x +b=+1 and w x +b= 1w x +b= 1 as , which is also called "interval" (Margin), one of the core of SVM is to find a way to maximize the "interval" . Let's deduce what factors is related to:
remember the positive sample on hyperplane w x +b=+1w x +b=+1 as x+ x+ , hyperplane w x +b= 1w x The negative sample on +b= 1 is x x , then according to the vector addition and subtraction rule x+ x+ subtract x x the normal vector w w of the obtained vector in the best hyperplane The projection of the direction is the "interval" :

=(x+ x )w ||w ||=x+ w ||w || x w ||w ||(6)(6) =(x+ x )w ||w ||=x+ w ||w || x w ||w ||

And w x+ +b=+1w x+ +b=+1, w x +b= 1w x +b= 1, namely:

{w x+ =1 bw x+ = 1 b(7)(7){w x+ =1 bw x+ = 1 b

Take (7) into (6) to get:

=2||w ||(8)(8) =2||w ||

That is to say, the factor that maximizes the distance between the two types of samples is only related to the normal vector of the best hyperplane!
To find the best hyperplane with the "maximum margin" is to find the parameters w w , bb that can satisfy the constraints in equation (4) to maximize , namely:

{maxw ,b2||w ||styi(w Txi +b) +1,i=1,2, ,n(9)(9){maxw ,b2||w ||styi(w Txi +b) +1,i=1,2, ,n

Obviously (9) is equivalent to

{minw ,b12||w ||2s.t.yi(w Txi +b) +1,i=1,2, ,n(10)(10){minw ,b12||w | |2s.t.yi(w Txi +b) +1,i=1,2, ,n

This is the basic type of SVM.

### 2.1 Lagrangian duality problem

According to the basic type of SVM, the model corresponding to the best hyperplane can be obtained by solving w w and bb:

f(x )=sign(w x +b)(11)(11) f(x )=sign(w x +b)

The solving problem itself is a convex quadratic programming (convex quadratic propgramming) problem, which can be solved by an open source optimization calculation package, but this will not reflect the essence of SVM. We can pass the convex quadratic programming problem through Lagrangian Duality to solve.
Add Lagrangian multiplier i 0 i 0 to each constraint of equation (10), then the Lagrangian function of the problem can be written as:

L(w ,b, )=12||w ||2 i=1n i(yi(w Txi+b) 1)(12)(12)L(w ,b, )=12 ||w ||2 i=1n i(yi(w Txi+b) 1)

Where =( 1, 2, , n) =( 1, 2, , n) are the Lagrange multipliers corresponding to each sample.
Take the partial derivative of L(w ,b, )L(w ,b, ) with respect to w w and bb and make the partial derivative equal to zero to obtain:

{w = ni=1 iyixi ni=1 iyi=0(13)(13){w = i=1n iyixi i=1n iyi=0

Take (13) into (12) and eliminate w w and bb to get the dual problem of (10):

max ni=1 i 12 ni=1 nj=1 i jyiyjxi Txj st i 0,i=1,2,...,n ni=1 iyi=0(14)(14) {max i=1n i 12 i=1n j=1n i jyiyjxi Txj st i 0,i=1,2,...,n i=1n iyi=0

It can be seen from (14) that we do not care about how a single sample is, we only care about the product of the two samples, which also provides great convenience for the following kernel methods.
After solving , and then solving w w and bb, the SVM decision model can be obtained:

f(x )=w Tx +b= i=1n iyixi Tx +b(15)(15)f(x )=w Tx +b= i=1n iyixi Tx +b

### 2.2 KKT condition of SVM problem

There are inequality constraints in (10), so the above process satisfies the Karush-Kuhn-Tucker (KKT) condition:

i 0yi(w Tx +b) 1 0,i=1,2, ,n i(yi(w Tx +b) 1)=0(16)(16)( i 0yi(w Tx +b) 1 0,i=1,2, ,n i(yi(w Tx +b) 1)=0

For any sample (xi ,yi)(xi ,yi) there is always i=0 i=0 or yi(w Tx +b) 1=0yi(w Tx +b) 1=0. If i=0 i=0, it can be seen from equation (15) that the sample point has no effect on solving the optimal hyperplane. When i>0 i>0, there must be yi(w Tx +b) 1=0yi(w Tx +b) 1=0, indicating that the corresponding sample point is on the boundary of the maximum interval, which corresponds to the support vector. This also leads to an important property of SVM: After the training is completed, most of the training samples do not need to be retained, and the final model is only related to the support vector.
Then how to solve the equation (14) ? Obviously this is a quadratic programming problem, which can be solved using a general quadratic programming algorithm, but the algorithm complexity of SVM is O(n2)O(n2), which is too much in actual problems. In order to effectively solve the quadratic programming problem, people have proposed many efficient algorithms by using the characteristics of the problem itself. Sequential Minimal Optimization (SMO) is a commonly used efficient algorithm. The above KKT conditions need to be used when solving using the SMO algorithm. Use the SMO algorithm to find and then according to:

{w = ni=1 iyixi yi(w Tx +b) 1=0(17)(17){w = i=1n iyixi yi(w Tx +b) 1=0

Then we can find w w and bb. It can be used after solving w w and bb

f(x )=sign(w Tx +b)(18)(18) f(x )=sign(w Tx +b)

Predictive classification is performed. Note that 1 1 is not required during testing, and strict requirements are required when there is no training during testing.

## 3. Soft interval support vector machine

In real tasks, it is difficult to find a hyperplane to completely separate different types of samples, that is, it is difficult to find a suitable kernel function to make the training samples linearly separable in the feature space. Taking a step back, even if you find a kernel function that can completely separate the training set in the feature space, it is difficult to determine whether the linearly separable result is caused by overfitting. The solution to this problem is to run SVM to a certain extent and make mistakes on some samples. For this reason, the concept of "soft margin" (soft margin) is introduced, as shown in Figure 4:

Figure 4

Specifically, the hard-interval support vector machine requires that all samples are correctly divided by the best hyperplane, while the soft-interval support vector machine allows some sample points to not satisfy the condition that the interval is greater than or equal to 1, yi(w xi +b) 1yi(w xi +b) 1. Of course, when maximizing the interval, the number of samples that do not satisfy the interval greater than or equal to 1 should be limited to make it as few as possible. So we introduce a penalty coefficient C>0C>0, and introduce a slack variable 0 0 for each sample point (xi ,yi)(xi ,yi), then the formula ( 10) Rewrite as

minw ,b(12||w ||2+C ni=1 i)styi(w Txi +b) 1 i,i=1,2, ,n i 0( 19)(19){minw ,b(12||w ||2+C i=1n i)styi(w Txi +b) 1 i,i=1,2, ,n i 0

In the above formula, the constraint condition is changed to yi(w xi +b) 1 iyi(w xi +b) 1 i, which means that the interval plus the slack variable is greater than or equal to 1; the optimization goal is changed to minw ,b( 12||w ||2+C ni=1 i)minw ,b(12||w ||2+C i=1n i) means that for each slack variable there must be a cost loss C iC i, CC is The greater the penalty for misclassification, the smaller the CC, the smaller the penalty for misclassification.
Equation (19) is the original problem of soft interval support vector machines. It can be proved that the solution of w w is unique, the solution of bb is not unique, and the solution of bb is in an interval. Assuming that the optimal hyperplane obtained by solving the soft interval support vector machine interval maximization problem is w x +b =0w x +b =0, the corresponding classification decision function is

f(x )=sign(w x +b )(20)(20)f(x )=sign(w x +b )

f(x )f(x ) is called soft interval support vector machine.
Similar to equation (12), the Lagrangian function of the above equation can be obtained by using the Lagrangian multiplier method

L(w ,b, , , )=12||w ||2+C i=1n i i=1n i(yi(w Txi +b) 1+ i) i=1n i i (21)(21)L(w ,b, , , )=12||w ||2+C i=1n i i=1n i(yi(w Txi + b) 1+ i) i=1n i i

Among them, i 0 i 0 and i 0 i 0 are Lagrangian multipliers.
Let L(w ,b, , , )L(w ,b, , , ) to find the partial derivative of w w , bb, and make the partial derivative zero. Get:

w = ni=1 iyixi ni=1 iyi=0C= i+ i(22)(22){w = i=1n iyixi i=1n iyi=0C= i+ i

Putting equation (22) into equation (21) can get the dual problem of equation (19):

max ni=1 i 12 ni=1 nj=1 i jyiyjxi Txj st ni=1 iyi=0,i=1,2,...,n0 i C(23)( 23){max i=1n i 12 i=1n j=1n i jyiyjxi Txj st i=1n iyi=0,i=1,2, ,n0 i C

Comparing the dual problem of soft-interval support vector machine and the dual problem of hard-interval support vector machine, we can find that the only difference between the two lies in the different constraints of the dual variable. The soft-interval support vector machine has the constraint of 0 i C0 i C, the constraint of the dual variable of the hard-interval support vector machine is 0 i0 i, so the same solution as the hard-interval support vector machine can be used to solve equation (23). Similarly, after introducing the kernel method, the same support vector expansion as equation (23) can be obtained.
Similar to equation (16) For soft interval support vector machines, KKT conditions require:

i 0, i 0yi(w x +b) 1+ i 0 i(yi(w x +b) 1+ i)=0 i 0, i i=0( 24)(24){ i 0, i 0yi(w x +b) 1+ i 0 i(yi(w x +b) 1+ i)=0 i 0, i i=0

Similar to the hard interval support vector machine, for any training sample (xi ,yi)(xi ,yi), there is always i=0 i=0 or yi(w x +b 1+ i)yi(w x +b 1+ i), if i=0 i=0, the sample will not have any influence on the best decision surface; if i>0 i>0, there must be yi(w x +b)=1 iyi( w x +b)=1 i, which means that the sample is a support vector. From equation (22), it can be seen that if i<C i<C, then i>0 i>0 and then i=0 i=0, that is, the sample is on the boundary of the maximum interval; if i=C i=C, then i=0 i=0 If xii 1xii 1, the sample is inside the maximum interval, and if i>1 i>1, the sample is outside the maximum interval, which means it is classified incorrectly. It can also be seen that the final model of the soft interval support vector machine is only related to the support vector.

## 4. Nonlinear Support Vector Machine

In the real task, the original sample space DD may not have a hyperplane that can correctly divide the two types of samples. For example, the problem shown in Figure 4 cannot find a hyperplane to divide the two types of samples well.
For such problems, the samples can be linearly separable in the mapped feature space by mapping the samples from the original space to the feature space. For example, the feature mapping z=x2+y2z=x2+y2 in Fig. 5 can get the sample distribution as shown in Fig. 6, so that the linear division is very good.

Figure 5

Figure 6

Let (x ) (x ) denote the feature vector after mapping the sample point x x , which is similar to the representation method in linear separable support vector machine. The model corresponding to the division of the hyperplane in the feature space can be expressed as

f(x )=w Tx+b(25)(25)f(x )=w Tx+b

Among them, w w and bb are the model parameters to be solved. Similar to equation (10), there are

{minw ,b12||w ||2s.t.yi(w T (x )+b) 1,i=1,2, ,n(26)(26){minw ,b12||w ||2s.t.yi(w T (x )+b) 1,i=1,2, ,n

The Lagrangian dual problem is

max ni=1 i 12 ni=1 nj=1 i jyiyj (xi T) (xj )st i 0,i=1,2, ,n ni=1 iyi= 0(27)(27){max i=1n i 12 i=1n j=1n i jyiyj (xi T) (xj )st i 0,i=1,2,...,n i=1n iyi =0

To solve (27), we need to calculate (xi T) (xj ) (xi T) (xj ), that is, the inner product after the sample is mapped to the feature space. Since the feature space may have a high dimensionality, it may even be It is an infinite dimension, so it is usually very difficult to directly calculate (xi T) (xj ) (xi T) (xj ). As we mentioned above, we don t care about the performance of a single sample at all. , We only care about the product of the pair of samples in the feature space, so we don t need to map the samples in the original space to the feature space one by one. We just need to figure out the sample corresponding to the product of the pair of samples in the feature space. can. In order to solve this problem, it is conceivable that there is a kernel function:

(xi ,xj )= (xi T) (xj )(28)(28) (xi ,xj )= (xi T) (xj )

That is to say, the inner product of xi xi and xj xj in the feature space is equal to the result calculated by the function ( , ) ( , ) in the original space, which brings a lot to the solution Convenience. Then equation (27) can be written as:

max ni=1 i 12 ni=1 nj=1 i jyiyj (xi ,xj )st i 0,i=1,2,...,n ni=1 iyi=0(29)( 29){max i=1n i 12 i=1n j=1n i jyiyj (xi ,xj )st i 0,i=1,2,...,n i=1n iyi=0

Similarly, we only care about the result of multiplying two points between samples in high-dimensional space and do not care how the samples are transformed into high-dimensional space. It can be obtained after solving

f(x )=w T (x )+b= i=1n iyi (x )T (x )+b= i=1n iyi (xi ,xj )+b(30)(30)f(x )=w T (x )+b= i=1n iyi (x )T (x )+b= i=1n iyi (xi ,xj )+b

The remaining problem is also to solve i i, and then solve w w and bb to get the best hyperplane.

## Support vector regression

Support vector machines can be used not only to solve classification problems but also to solve regression problems, called Support Vector Regression (SVR).
For the sample (x ,y)(x ,y), the loss is usually calculated according to the difference between the model output f(x )f(x ) and the true value (ie groundtruth) yiyi, if and only if f(x )=yif(x )=yi, the loss is zero. The basic idea of SVR is to allow a maximum deviation of between the predicted value f(x )f(x ) and yiyi, when |f(x ) yi| |f(x ) yi| It is considered that the prediction is correct and the loss is not calculated, and the loss is calculated only when |f(x ) yi|> |f(x ) yi|> . The SVR problem can be described as:

minw ,b(12||w ||2+C i=1nL (f(x ) yi))(31)(31)minw ,b(12||w ||2+C i= 1nL (f(x ) yi))

Among them, C 0C 0 is the penalty term, and L L is the loss function, which is defined as:

L (z)={0,|z| |z| ,otherwise(32)(32)L (z)={0,|z| |z| ,otherwise

Further introducing slack variables i i, ^i ^i, the new optimization problem is:

minw ,b, , ^i(12||w ||2+C ni=1( i+ ^i))stf(xi ) yi + i,i=1,2, ,nyi f(x ) + ^i i 0, ^i 0(33)(33){minw ,b, , ^ i(12||w ||2+C i=1n( i+ ^i))stf(xi ) yi + i,i=1,2, ,nyi f(x ) + ^i i 0, ^i 0

This is the original problem of SVR. Similarly, introducing Lagrangian multipliers i 0 i 0, ^i 0 ^i 0, i 0 i 0, ^i 0 ^i 0, the corresponding Lagrangian function is :

L(w ,b, , ^ , , ^ , , ^ )=12||w ||2+C i=1n( + ^) i=1n i i i=1n ^i ^i+ i=1n i(f(xi ) yi )+ i=1n ^i(yi f(xi ) ^i)(34) (34)L(w ,b, , ^ , , ^ , , ^ )=12||w ||2+C i=1n( + ^) i=1n i i i=1n ^i ^i+ i=1n i(f(xi ) yi )+ i=1n ^i(yi f(xi ) ^i)

Let L(w ,b, , ^ , , ^ , , ^ )L(w ,b, , ^ , , ^ , , ^ )The partial derivative of w ,b, , ^ w ,b, , ^ is zero, we can get:

w = ni=1()xi ni=1( ^i i)=0C= i+ iC= ^i+ ^i(35)(35) {w = i=1n()xi i=1n( ^i i)=0C= i+ iC= ^i+ ^i

Substituting equation (35) into equation (34) can get the dual problem of SVR:

{max , ^ ni=1(yi( ^i i) ( ^i+ i) 12 ni=1 nj=1( ^i i)( ^j j )xi Txj )st ni=1( ^i i)=00 i, ^i C(36)(36)(max , ^ i=1n(yi( ^i i) ( ^i+ i) 12 i=1n j=1n( ^i i)( ^j j)xi Txj )st i=1n( ^i i)=00 i, ^i C

The KKT conditions are:

i(f(xi ) yi i)=0 ^i(f(xi ) yi ^i)=0 i ^i=0 , i ^i=0(C i) i=0,(C ^i) ^i=0(37)(37){ i(f(xi ) yi i)=0 ^i(f(xi ) yi ^i)=0 i ^i=0, i ^i=0(C i) i=0,(C ^i) ^i=0

The solution of SVR is as follows:

f(x )= i=1n( ^i i)xi Tx +b(38)(38)f(x )= i=1n( ^i i)xi Tx +b

Further, if the kernel function is introduced, SVR can be expressed as:

f(x )= i=1n( ^i i) (xi ,x )+b(39)(39)f(x )= i=1n( ^i i) ( xi ,x )+b

Where (xi ,x )= (xi )T (x ) (xi ,x )= (xi )T (x ) is the kernel function.

## Common kernel functions

nameexpressionparameter
Linear kernel (xi ,xj )=xi Txj (xi ,xj )=xi Txj
Polynomial kernel (xi ,xj )=(xi Txj )n (xi ,xj )=(xi Txj )nn 1n 1 is the degree of polynomial
Gaussian Kernel (RBF) (xi ,xj )=exp( xi xj 22 2) (xi ,xj )=exp( xi xj 22 2) >0 >0 is the bandwidth of the Gaussian kernel
Laplace nucleus (xi ,xj )=exp( xi xj ) (xi ,xj )=exp( xi xj ) >0
Sigmoid core (xi ,xj )=tanh( xi Txj + ) (xi ,xj )=tanh( xi Txj + )thah is the hyperbolic tangent function

## 5. Advantages and disadvantages of SVM

SVM is easy to obtain the nonlinear relationship between data and features when small and medium sample sizes are used. It can avoid the use of neural network structure selection and local minimum problems. It has strong interpretability and can solve high-dimensional problems.
SVM is sensitive to missing data, there is no universal solution to nonlinear problems, the correct choice of kernel function is not easy, the computational complexity is high, mainstream algorithms can reach O(n2)O(n2) complexity, this is right Large-scale data is unbearable.

## 6. References

Zhou Zhihua. Machine learning[D]. Tsinghua University Press, 2016.
Hua School, Wang Zhenglin. Python vs. Machine Learning[D]. Electronics Industry Press, 2017.
Peter Flach, translated by Duan Fei. Machine Learning[D]. People's Posts and Telecommunications Press, 2016.
Understanding Support Vector Machine algorithm from examples (along with code)
Introduction to KKT conditions

```tic% timer
%% Clear environment variables
close all
clear
clc
%format compact
% load CMPE original
% mappedX=X;
%% Data Extraction
zc=mappedX(1:60,:);% feature input
lie=mappedX(61:120,:);
mo=mappedX(121:180,:);
que=mappedX(181:240,:);
duan=mappedX(241:300,:);
mm=size(zc,1);
nn=20;

a=ones(mm,1);% behavioral overall sample size
b=2*ones(mm,1);
c=3*ones(mm,1);
d=4*ones(mm,1);
f=5*ones(mm,1);

n1=randperm(size(zc,1));
n2=randperm(size(lie,1));
n3=randperm(size(mo,1));
n4=randperm(size(que,1));
n5=randperm(size(duan,1));

train_wine = [zc(n1(1:nn),:);lie(n2(1:nn),:);mo(n3(1:nn),:);que(n4(1:nn),:) ;duan(n5(1:nn),:)];
% The labels of the corresponding training set should also be separated
train_wine_labels = [a(1:nn,:);b(1:nn,:);c(1:nn,:);d(1:nn,:);f(1:nn,:)];
% Use 31-59 in the first category, 96-130 in the second category, and 154-178 in the third category as the test set
test_wine = [zc(n1((nn+1):mm),:);lie(n2((nn+1):mm),:);mo(n3((nn+1):mm),:) ;que(n4((nn+1):mm),:);duan(n5((nn+1):mm),:)];
% The labels of the corresponding test set should also be separated
test_wine_labels = [a((nn+1):mm,:);b((nn+1):mm,:);c((nn+1):mm,:);d((nn+1): mm,:);f((nn+1):mm,:)];
%% data preprocessing
% Data preprocessing, normalize the training set and test set to the interval [0,1]
[mtrain,ntrain] = size(train_wine);
[mtest,ntest] = size(test_wine);

dataset = [train_wine;test_wine];
% mapminmax is the normalization function that comes with MATLAB
[dataset_scale,ps] = mapminmax(dataset',0,1);
dataset_scale = dataset_scale';

train_wine = dataset_scale(1:mtrain,:);
test_wine = dataset_scale( (mtrain+1):(mtrain+mtest),: );

%% Default parameters

n=10;% Population size, typically10 to 40
N_gen=150;% Number of generations
A=0.5;% Loudness (constant or decreasing)
r=0.5;% Pulse rate (constant or decreasing)
% This frequency range determines the scalings
% You should change these values  if necessary
Qmin=0;% Frequency minimum
Qmax=2;% Frequency maximum
% Iteration parameters
N_iter=0;% Total number of function evaluations% What does this mean? ? ?
% Dimension of the search variables
d=2;% Number of dimensions
% Lower limit/bounds/a vector
Lb=[0.01,0.01];% lower bound of parameter value
Ub=[100,100];% upper bound of parameter value
% Initializing arrays
Q=zeros(n,1);% Frequency
v=zeros(n,d);% Velocities
% Initialize the population/solutions

% Output/display
disp(['Number of evaluations:',num2str(N_iter)]);
disp(['Best =',num2str(best),' fmin=',num2str(fmin)]);

%% Use the best parameters for SVM network training
cmd_gwosvm = ['-c',num2str(best(:,1)),' -g',num2str(best(:,2))];
model_gwosvm = svmtrain(train_wine_labels,train_wine,cmd_gwosvm);
%% SVM network prediction
[predict_label] = svmpredict(test_wine_labels,test_wine,model_gwosvm);
total = length(test_wine_labels);% print test set classification accuracy
right = length(find(predict_label == test_wine_labels));
Accuracy=right/total;
% disp('Print test set classification accuracy');
% str = sprintf('Accuracy = %g%% (%d/%d)',accuracy(1),right,total);
% disp(str);
%% Result analysis
% Actual classification and predicted classification map of the test set
figure;
hold on;
plot(test_wine_labels,'o');
plot(predict_label,'r*');
xlabel('Test set sample','FontSize',12);
ylabel('Category Label','FontSize',12);
legend('Actual test set classification','Predicted test set classification');
title('The actual classification and prediction classification map of the test set','FontSize',12);
grid on
snapnow

figure
plot(1:N_gen,AAA);
Copy code```