CN108154178A

CN108154178A - Semi-supervised support attack detection method based on improved SVM-KNN algorithms

Info

Publication number: CN108154178A
Application number: CN201711416340.2A
Authority: CN
Inventors: 沈琦; 牛立坤
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2017-12-25
Filing date: 2017-12-25
Publication date: 2018-06-12

Abstract

The invention discloses a kind of semi-supervised support attack detection method based on improved SVM KNN algorithms, including：Marked user is collected as training set and trains initial SVM classifier；Unmarked user is collected using initial SVM classifier and carries out preliminary classification；The user data of normal users is incorporated in training set, using improved KNN similarities formula as the range formula of KNN algorithms, secondary classification is carried out to remaining user data；Update training set, and the SVM classifier that re -training is new；Judge whether classification results reach optimum detection performance, if it is determined that being then to export final classification device, otherwise recycle the user concentrated to unmarked user and classify；Support attack detecting is carried out to user data using final classification device.By technical scheme of the present invention, the generalization ability of support attack detecting and detection accuracy are improved, in a small amount of information and continually changing environment, the performance than previous attack detecting algorithm is more superior.

Description

Semi-supervised support attack detection method based on improved SVM-KNN algorithms

Technical field

The present invention relates to technical field of network security more particularly to a kind of based on the semi-supervised of improved SVM-KNN algorithms Hold in the palm attack detection method.

Background technology

In practical network environment, a large amount of user identity can not determine, and the support attack faced also can be increasingly It is complicated.Such as in Taobao's shopping website, some conditions it can be determined by liveness, positive rating, imperial crown user etc. It is real user, but most of user, it only simply operates some shopping process or even will not all go to evaluate, this part User can not determine whether it is real user.Attacker can also construct the attack to become increasingly complex with the understanding to website Model.But existing attack detecting algorithm, it can be true in face of the novel support attack to become increasingly complex and only a small amount of user In the case of determining identity, detection result is unsatisfactory.

Invention content

At least one of regarding to the issue above, the present invention provides a kind of half prisons based on improved SVM-KNN algorithms Support attack detection method is superintended and directed, initially sets up label user data set and unmarked user data set, is secondly used according to a small amount of label User data trains an initial SVM classifier, calculates the distance of unmarked user data and initial SVM classifier boundary, distance More than the threshold values of setting, then classified using SVM, otherwise classified using KNN, the data newly marked are added to training It concentrates, re -training SVM classifier, the continuous iteration above process finally obtains a higher svm classifier of nicety of grading Device.It is utilized the accuracy of the flag data of the detector of semi-supervised learning, and the reasonable employment distribution rule of data untagged Rule, has combined SVM and KNN algorithms, so as to improve generalization ability and detection accuracy, in a small amount of information and continually changing ring In border, the performance than previous attack detecting algorithm is more superior.

To achieve the above object, the present invention provides a kind of semi-supervised supports based on improved SVM-KNN algorithms to attack inspection Survey method, including：User's set is divided into marked user's collection and unmarked user collection, marked user is collected as training set Train initial SVM classifier；Any sample of users that unmarked user concentrates tentatively is divided using initial SVM classifier Class；The user data for being will be marked to be incorporated in the training set in preliminary classification, remaining user data is incorporated to Classification boundaries are nearby in vector set；Using improved KNN similarities formula (1) as the range formula of KNN algorithms, to the classification User near border vector set carries out secondary classification,

Wherein, A+b+c=1；

The label user data that KNN algorithm classifications obtain is incorporated in the training set, and utilizes the updated instruction Practice the new SVM classifier of collection re -training；Judge whether classification results reach optimum detection performance, if it is determined that being then to export most Otherwise whole grader recycles the user concentrated to the unmarked user and classifies；Using the final classification device to user Data carry out support attack detecting.

In the above-mentioned technical solutions, it is preferable that any sample concentrated using initial SVM classifier to unmarked user This user carries out preliminary classification and specifically includes：An optional sample of users is concentrated in the unmarked user, utilizes SVM calculation formula (2) value of categorised decision function f (x) is calculated；

Judge the absolute value of the categorised decision function | f (x) | whether it is more than given classification thresholds ε (0<ε<1)；If sentence Being set to is, then is normal users by the sample labeling.

In the above-mentioned technical solutions, it is preferable that the distance using improved KNN similarities formula (1) as KNN algorithms Formula carries out secondary classification to the user in vector set near the classification boundaries and specifically includes：By the use in the training set The user data vectorization consistent with the sample of users data progress to be sorted in the classification boundaries nearby vector set；Using described Range formula calculates the distance of sample to be sorted and each sample in the training set, and the k sample that chosen distance is nearest Arest neighbors as the sample to be sorted；The weighted value that each sample in arest neighbors belongs to different classifications is calculated successively；Compare A certain sample belongs to the size of the weighted value of different classifications, and the sample is divided into the classification of weighted value maximum.

In the above-mentioned technical solutions, it is preferable that a certain sample belongs to the c of each classification_jWeight vectors be q_j,q_j= (q_j1,q_j2,...,q_jp), wherein q_j1+q_j2+...+q_jp=1, weighted value q_jkIt is characterized a t_kCorresponding weight, weighted value it is big Small expression t_kSignificance level in different classifications.

In the above-mentioned technical solutions, it is preferable that judge whether classification results reach optimum detection performance, if it is determined that being, then Final classification device is exported, is otherwise recycled and classification is carried out to the user that the unmarked user concentrates is specifically included：

According to classification results, calculate the accuracy rate and recall ratio of assorting process, wherein, the classification results include really, It really bears, is false just with false minus four kinds of data, described really and really bear is respectively the number being appropriately determined as attack user and real user According to the data for being judged to attacking user and real user for mistake are just being born in the vacation with vacation, and the calculation formula of the accuracy rate is： Accuracy rate=really/(really+vacation is just), the calculation formula of the recall ratio is：Recall ratio=really/(really+false negative)；Judge Whether the accuracy rate and the recall ratio reach preset optimal threshold；If the accuracy rate and recall ratio reach default Optimal threshold, then will obtain the SVM-KNN graders of the classification results and exported as final classification device, otherwise recycled to institute The user that unmarked user concentrates is stated to classify.

Compared with prior art, beneficial effects of the present invention are：By provided by the invention a kind of based on improved SVM- The semi-supervised support attack detection method of KNN algorithms initially sets up label user data set and unmarked user data set, secondly root An initial SVM classifier is trained according to a small amount of label user data, calculates unmarked user data and initial SVM classifier boundary Distance, distance be more than setting threshold values, then classified using SVM, otherwise classified using KNN, the number that will newly mark According to being added in training set, re -training SVM classifier, the continuous iteration above process finally obtains a nicety of grading and compares High SVM classifier.It is utilized the accuracy of the flag data of the detector of semi-supervised learning, and the unmarked number of reasonable employment According to the regularity of distribution, combined SVM and KNN algorithms, so as to improve generalization ability and detection accuracy, in a small amount of information and not In the environment of disconnected variation, the performance than previous attack detecting algorithm is more superior.

Description of the drawings

Fig. 1 semi-supervised support attack detecting sides based on improved SVM-KNN algorithms disclosed in an embodiment of the present invention The flow diagram of method；

Fig. 2 is the principle of classification schematic diagram of SVM-KNN graders disclosed in an embodiment of the present invention；

Fig. 3-Fig. 8 is the datagram that attack detecting disclosed in an embodiment of the present invention is tested.

Specific embodiment

Purpose, technical scheme and advantage to make the embodiment of the present invention are clearer, below in conjunction with the embodiment of the present invention In attached drawing, the technical solution in the embodiment of the present invention is clearly and completely described, it is clear that described embodiment is The part of the embodiment of the present invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people Member's all other embodiments obtained under the premise of creative work is not made, shall fall within the protection scope of the present invention.

The present invention is described in further detail below in conjunction with the accompanying drawings：

As depicted in figs. 1 and 2, it is attacked according to a kind of semi-supervised support based on improved SVM-KNN algorithms provided by the invention Detection method is hit, including：User's set is divided into marked user's collection and unmarked user collects, by marked user by step S11 Collection trains initial SVM classifier as training set；Step S12 appoints unmarked user's concentration using initial SVM classifier One sample of users carries out preliminary classification；Step S13 will mark the user data for being to be incorporated to training set in preliminary classification In, remaining user data is incorporated near classification boundaries in vector set；Step S14, with improved KNN similarities formula (1) As the range formula of KNN algorithms, secondary classification is carried out to the user in vector set near classification boundaries,

Wherein, cosine similarityThe cosine added in after weighted value is similar DegreeCommon trait item number accounts for total characteristic item number purpose ratioParameter a+b+c=1, x_i、d_lFor the feature vector of sample, q_jkIt is characterized a t_k's Weighted value, w_ikAnd w_jk(1≤k≤p) is respectively sample d_iAnd d_jIn k-th of characteristic item weight, x_ik,w_jk(1≤k≤p) is the Coordinate in k dimensions, p are characterized a number,Weight for class center vector；

The label user data that KNN algorithm classifications obtain is incorporated in training set by step S15, and utilizes updated instruction Practice the new SVM classifier of collection re -training；Step S16, judges whether classification results reach optimum detection performance, if it is determined that be, Final classification device is then exported, the user concentrated to unmarked user is otherwise recycled and classifies；Step S17, utilizes final classification Device carries out support attack detecting to user data.

In the above embodiment, it is preferable that any sample of users concentrated using initial SVM classifier to unmarked user Preliminary classification is carried out to specifically include：An optional sample of users is concentrated in unmarked user, is calculated using SVM calculation formula (2) The value of categorised decision function f (x)；

Judge the absolute value of categorised decision function | f (x) | whether it is more than given classification thresholds ε (0<ε<1)；If it is determined that it is It is then normal users by sample labeling to be.

In the above embodiment, it is preferable that using improved KNN similarities formula (1) as the range formula of KNN algorithms, Secondary classification is carried out to the user in vector set near classification boundaries to specifically include：By the user data in training set and classification side Sample of users data to be sorted near boundary in vector set carry out consistent vectorization；Sample to be sorted is calculated using range formula With the distance of each sample in training set, and arest neighbors of the nearest k sample of chosen distance as sample to be sorted；Successively Calculate the weighted value that each sample in arest neighbors belongs to different classifications；More a certain sample belongs to the big of the weighted value of different classifications It is small, and the sample is divided into the classification of weighted value maximum.

In the above embodiment, it is preferable that a certain sample belongs to the c of each classification_jWeight vectors be q_j,q_j=(q_j1, q_j2,...,q_jp), wherein q_j1+q_j2+...+q_jp=1, weighted value q_jkIt is characterized a t_kCorresponding weight, the size table of weighted value Show t_kSignificance level in different classifications.

In the above embodiment, it is preferable that judge whether classification results reach optimum detection performance, if it is determined that be, then it is defeated Go out final classification device, otherwise recycle the process classified to the user that unmarked user concentrates and specifically include：It is tied according to classification Fruit calculates the accuracy rate and recall ratio of assorting process, wherein, classification results include really, it is very negative, false just with false minus four kinds of numbers According to it is respectively the data being appropriately determined as attack user and real user really and really to bear, and vacation is just born with vacation to be determined as mistake The data of user and real user are attacked, the calculation formula of accuracy rate is：Accuracy rate=really/(really+vacation is just), recall ratio Calculation formula is：Recall ratio=really/(really+false negative)；Whether judging nicety rate and recall ratio reach preset best threshold Value；If accuracy rate and recall ratio reach preset optimal threshold, the SVM-KNN graders of classification results will be obtained as most Whole grader output, otherwise recycles the user concentrated to unmarked user and classifies.

In this embodiment, for support attack problem, using the attack detecting side of the semi-supervised learning based on machine learning Method.Common attack detection method supervised learning, unsupervised learning and semi-supervised learning based on machine learning.

Specifically, supervised learning refers to the parameter of the sample adjustment grader using one group of known class, reaches institute It is required that the process of performance.Support attack is come out by different Construction of A Model, and support attack certainly exists some spies with real user The otherness of sign using these features, is detected support attack.Many features index has been proposed in research at present, can be with It is divided into common index, the index based on model, index in general picture.What the scoring of common index from target item and filler was distributed Difference portrays the difference between support attack and normal users, such as average departure degree (RDMA), arest neighbors similarity (DegSim) etc..Based on the index of model according to the distinctive marking mode of different support challenge models, different attack general pictures come structure Build index, for distinguishing support attack and normal users, such as user's average score and its mean value that scores variation (MeanVar), The highest item that scores and the average deviation (FMTD) of remaining item set etc..Statistical property in general picture between characteristic evidences scoring Difference is distinguished, such as degree of concern (TMF) index of target item etc..

Unsupervised learning refers to be solved according to the training sample of classification unknown (not being labeled) various in pattern-recognition Problem.Unsupervised detection algorithm generally classifies to different user dependent on cluster mode, the model used according to algorithm Difference can be divided into the algorithm based on dimensionality reduction and the algorithm based on model.The algorithm of dimensionality reduction is found most main according to the thought of dimensionality reduction The variable wanted gives expression to rating matrix, then according to normal users and the difference detection support attack of support attack.Existing research has base In the unsupervised learning algorithm (PCA Select Users) that the user of principal component analysis screens.The algorithm is using PCA dimensionality reductions Principle, extracts mutually independent feature, the correlation between feature it is low be then likely to be support attack.Also unsupervised attack General picture searching algorithm (Un RAP), algorithm determine if it is support attack using the matching degree of user and rating matrix height. Algorithm based on model utilizes the potential expression of different models, reconstructs the situation of rating matrix under the model, passes through the mould Difference under type is attacked to detect support.Existing research has based on probability latent semantic analysis (Probabilistic latent Semantic analysis, PLSA) support attack probe algorithm.The average distance of each user is calculated, and uses " density " Index weighs it.Think that " density " of support attack is relatively large, therefore support attack detecting can be come out.In addition based on unusual The technology that value is decomposed proposes SVD detection algorithms.Since certain rule is presented in the scoring of support attack, support attack is by singular value There are larger differences for the low-dimensional model of low-dimensional model and normal users after decomposition.

Semi-supervised learning refers to using a large amount of Unlabeled data and simultaneously using flag data, to carry out pattern knowledge It does not work.Semi-supervised detection algorithm refers to only know the general picture of sub-fraction support attack sample, and utilizes this sub-fraction support The general picture feature of attack constructs detection algorithm, and such as Semi-SAD algorithms, algorithm is being marked first using Naive Bayes Classifier Training preliminary classification device, then improves grader in unmarked data, promotes the detection performance of grader in data.Half supervises Educational inspector practises, i.e., synthetically using markd sample data and unmarked sample data, to generate suitable disaggregated model.It Thought be using a small amount of marked one fundamental classifier of sample training, recycle fundamental classifier to unmarked sample Data carry out classification marker, then using this part by the sample set that fundamental classifier marks as new training sample set again Training grader, finally obtain one by a small amount of known mark sample set and a large amount of unknown marks sample set training and The grader come.

Present invention joint SVM algorithm and improved KNN algorithms, attack inspection is carried out by way of semi-supervised learning It surveys.

Specifically, nearest neighbour method (NearestNeighbor, abbreviation NN) is most important method in pattern-recognition nonparametric method One of, initial nearest neighbour method is to be proposed by Cover and Hart in nineteen sixty-eight, so-called k nearest neighbor, is exactly to investigate and sample to be sorted This K most like sample judges the category attribute of sample to be sorted according to the classification of this K sample.The base of NN graders Present principles：For a sample vector x to be sorted, using all training samples as representing a little, K phase is found out in point is represented As sample, then using this K text as candidate categories, using the value of x and the similarity of K sample as measurement weight, together When set similarity threshold values, it is possible to determine that the classification of x.

The size of two similar sample degrees of correlation is known as similarity, can be with when we carry out vector to sample represents The similarity degree of two texts is weighed using the distance between sample.There is much the computational methods of distance between two samples Kind, such as：Euclidean distance (Euclidean distance), COS distance (Cosine distance), city distance (City- Block distance), correlation distance (Correlation distancce), Hamming distance (Hamming distance) etc..

In vector space model, after feature extraction, each sample can be converted by opposite one group of entry sample Or feature (t₁,t₂,t₃,...,t_p) composition the relatively low space vector of dimension, each feature t_iAll there are one corresponding power Weight values w_i(represent character pair t_iSignificance level in the sample), the feature t newly chosen after Feature Selection₁,t₂,t₃,...,t_p The latitude of p dimension coordinates system, w can be regarded as₁,w₂,w₃,...,w_pIt is corresponding characteristic value on each latitude.Have in this way sample to The coordinate representation of amount, it is possible to the similarity between sample is measured, if sample vector is d_i=(w_i1,w_i2,w_i3,...,w_ip), d_j= (w_j1,w_j2,w_j3,...,w_jp), the definition of several distances is given below.

(1) Euclidean distance (Euclidean distance)

It is for p dimensional feature space distance definitions：

Wherein d_iAnd d_jThe feature vector of sample is expressed as, p is characterized dimension of a vector space, w_ikRepresent sample d_iKth The coordinate of dimension, w_jkRepresent sample d_jKth dimension coordinate.For two samples apart from smaller, two sample similarity degrees are higher, may belong to same A kind of possibility is also bigger, conversely, it is then more dissimilar, it may more belong to different classifications.

(2) COS distance (Cosine distance)

The formula that usually used cosine calculates distance is a kind of blue formula distance, generally using the inner product of feature vector or The cosine of angle theta calculates, and vectorial angle is smaller, that is, cosine value is bigger illustrates that similarity is higher.Use COS distance The formula for calculating the similarity of two feature vectors is as follows：

Or

Wherein w_ikAnd w_jk(1≤k≤p) is respectively sample d_iAnd d_jIn k-th of characteristic item weight, p is characterized a number, That is characteristic vector space dimension.Formula (5) is after first feature vector length is normalized in fact, then seeks inner product, normalizing It is so that the inconsistent text of length is comparable to change purpose.

(3) city distance (City-block distance)

Sampling feature vectors d_iAnd d_jBetween city distance definition it is as follows：

D(d_i,d_j)=| w_i1-w_j1|+|w_i2-w_j2|+...+|w_ip-w_jp| (6)

d_i=(w_i1,w_i2,...,w_ip) and d_j=(w_j1,w_j2,...w_jp) represent sample W_iAnd W_jFeature vector, with D (d_i, d_j) represent sample point d in sample set_iAnd d_jBetween distance.

(4) correlation distance (Correlation distance)

If

I-th row d of matrix_i=(w_i1,w_i2,...,w_ip) character pair vector for sample, c_i(1≤i≤c) is sample This classification, Y classes are the n-dimensional vector of classification composition, and the correlation distance of any two sample vector is defined as：

Wherein

(5) Hamming distance (Hamming distance)

In computer binary system theory, Hamming weight represent be symbol in code word " 1 " number, Hamming weight also referred to as Code weight is abbreviated as W, such as code word " 110010 ", and code length is just p=6, and code weight is just W=3.

Hamming distance between the code length two code word x=(x1x2...xk...xp) for p, y=(y1y2...yk...yp) It is defined as：

WhereinRepresent mould 2 plus operation, x_k∈{0,1},y_k∈{0,1}.X in D (x, y), y are code word, and D (x, y) is code word On equivalent site the number of different symbol the sum of, its size embodies the difference degree between two code words, formula (8) Value it is bigger explanation two code word differences it is bigger.

Text is after Feature Selection, using boolean's Weight, it is possible to be arranged in the code word that code length is p, such as sample This W₁It can be expressed as：d₁=(10011100101010.....101), 0,1 state for corresponding respectively to sample herein, at certain There is no the sample informations on component positions to remember 0, and there are sample information note 1, such sample sets on certain this component positions It closes just and codeword set establishes 1-1 correspondences.It is exactly to ask in fact between two code words so the problem of research text is similar Hamming distance.If text W₁And W₂Corresponding code word is respectively d₁And d₂, then the Hamming distance between two samples, it is possible to use Formula (8) represents.D(d₁,d₂) vector space model of the value between 0 and p in, when two sample vector p bit words are complete When exactly the same, when Hamming distance between the two is that 0, p bit words are entirely different, Hamming distance p, so D (x, y) is quantitative Ground describes the difference degree between different texts.

When carrying out sample classification, sample set is converted into codeword set first, for sample W₁Code word d₁= (x₁x₂...x_k...x_n) and sample W₂Code word d₂=(y₁y₂...y_k...y_n), so that it may it is similar to be defined with following formula Degree：

Wherein x_k,y_kSample W is represented respectively₁Corresponding d₁With sample W₂Corresponding d₂Numerical value on k-th of component, value 0 or Person 1.Similarity degree is described with formula (9), when two samples are substantially similar, i.e., code word is essentially identical, similarity Sim (d₁, d₂) value just closer to 1；Conversely, then code word is entirely different, Sim (d₁,d₂) closer to 0.

The rudimentary algorithm step of KNN includes：

Step 1：Data vector during training is gathered；

Step 2：The data to be sorted that will carry out KNN algorithms also carry out the vectorization caused with training set unification；

Step 3：According to range formula, such as cosine formula：

The distance of sample to be sorted and each sample in training set is calculated, k nearest sample of chosen distance is used as should K arest neighbors of sample to be sorted；

Step 4：According to the k neighbour selected, the weighted value for belonging to every one kind is calculated successively.Weighted value I (d_l,c_j) Instead of specific method is more in the prior art, and details are not described herein.

Wherein KNN represents k arest neighbors.

Step 5：Compare weight size, which kind of weighted value maximum belonged to, then which kind of belongs to.

The formula of the similarity of above-mentioned KNN algorithms is very easy to find following two problems by studying and testing：

(1) sample to be tested x is not related in similarity formula_iWith different classes of of training set with the presence or absence of similitude with And similarity degree, aggregation extent and each the sample to be tested point for not accounting for every a kind of sample are each in training set Class c_jClass center distance, aggregation extent is bigger and x_iApart from certain one kind c_jClass center is nearer, then sample to be tested is got over such It is similar；

(2) training set that different classification is not accounted in formula may be because the characteristic item number difference included to classification As a result different influences is generated, a little classes having are laid particular stress on comprising some featured items, and some classes are laid particular stress on comprising other characteristic items, class The not different characteristic items included are also different, that is, do not account for common between sample point in sample to be sorted and every a kind of sample set The number of features possessed.

Denominator, which is equivalent to, divided by the mould of sample vector is long, has done normalizing can be seen that for above-mentioned COS distance formula (10) Change is handled, and eliminates influence of the sample vector length to classifying quality.

First problem is considered first, is remembered per one kind c_jClass center vector beIt is instruction Practice set in all sample vectors summation and then divided by training set sample size n sample range calculate result.Sample x to be sorted_iAway from From each class c_jClass center similarity can be expressed as：

Secondly consider Second Problem, sample x to be sorted_iWith every one kind c in training set_jEach sample in training set This point co-owns the situation of number of features, noteAccounting for total characteristic item number for common trait item number, (characteristic vector space is tieed up The ratio of number p), is defined with Hamming similarity

Wherein x_i=(x_i1,x_i2,...,x_ip) represent sample to be sorted, d_j=(w_j1,w_j2,...,w_jp) it is to appoint in training set Meaning sample, x_ik,w_jkCoordinate in (1≤k≤p) kth dimension, takes 0 or 1.

2 points more than considering, improved COS distance similarity is defined as：

Wherein a+b+c=1 can seek optimum allocation ratio by many experiments, make improved similarity classification effect Fruit is optimal.Improved COS distance formula can solve above-mentioned two problems, but can clearly be found out by formula, special Sign item does not introduce weighted value, and characteristic items all so are all equal weight values, so also needing to final step, introduces weight Value, can more meticulously adjust in this way, improve classifying quality.According to category attribute, we set sample and belong to every one kind c_j's Weight vectors are q_j,q_j=(q_j1,q_j2,...,q_jp), wherein q_j1+q_j2+...+q_jp=1, weighted value q_jkIt is characterized a t_kIt is corresponding Weight, q_jkSize represent t_kSignificance level in different classes of.

In conclusion the cosine similarity formula after weighted value is added in formula (10)：

Ibid, sample to be sorted and each class c_jThe similarity formula at class center also introduce weight on the basis of formula (12) Value：

Comprehensive (15) formula, (16) formula, (13) formula show that the similarity formula finally improved is：

Secondly, support vector machines (SVM) is that Vapnik et al. is proposed in nineteen ninety-five according to structural risk minimization A kind of machine learning algorithm.It has structural risk minimization, and the strong distinguishing feature of generalization ability for classification problem, is supported Vector machine algorithm can be sketched as the sample in the input space is mapped to a feature space by certain nonlinear function In, make two class samples linear separability in this feature space, and it is super to find optimum linearity classification of the sample in this feature space Plane.SVM algorithm includes linear classification and Nonlinear Classification.

(1) linear classification

SVM defines optimal hyperlane, and searching optimum linearity hyperplane is converted into solution quadratic programming problem, is based on Sample space by Nonlinear Mapping, is mapped to high-dimensional feature space by Mercer theorems, and sample is solved thereby using linear method Nonlinearity problem in this space.

Support vector machines is proposed for two classification.Assuming that training sample (x_i,y_i), i=1,2 ..., l, x ∈ R_d,y_i ∈ { -1,1 }, there are Optimal Separating Hyperplane wx+b=0, for classifying face is made correctly to classify all samples and has class interval, It must satisfy：

y_i[(w·x_i)+b]-1≥0 (18)

Calculating class interval is：

It is required that maximum class interval 2/ | | w | |, i.e. requirement minimizes | | w | |, then solving optimal hyperlane problem can table Constrained optimization problem is shown as, i.e., under the constraint of formula (18), minimizes function：

Introduce Lagrange functions：

Wherein, α_i>0 is Lagrange coefficients.Formula (21) is sought into local derviation to w and b respectively and it is enabled to be equal to 0, it is possible to will The above problem is converted into simple dual problem.

Formula (22) and formula (23) are substituted into (21), you can obtain antithesis optimization problem：Solve the maximum of lower array function Value：

y_i[(w·x_i)+b]-1≥0 (25)

Wherein, α_i>=0, i=1 ..., l

This is the quadratic function extreme-value problem (QP, Quadratic Programming) under an inequality constraints.According to Karush-Kuhn-Tucker (KKT) condition, the solution of the optimization problem must satisfy：

α_i{y_i[(w·x_i)+b] -1=0, i=1 ..., l (27)

Therefore, the corresponding α of most samples_iIt is 0 to be, α_i≠ 0 corresponds to the sample that equal sign is set up in formula (5.5.1) Referred to as supporting vector.In algorithm of support vector machine, supporting vector is the key element in training set, they are from decision boundary most Closely.If removing other all training samples, then training is re-started, identical classifying face will be obtained.

After solving above-mentioned quadratic programming problem, then categorised decision function is represented by：

Summation in formula only carries out supporting vector, i.e., the α being only not zero_iCorresponding training sample determines classification knot Fruit, and other samples are unrelated with classification results.b^*It is classification threshold values.When training sample set is linearly inseparable, introduce non-negative Slack variable ξ, i=1,2 ..., l, the optimal problem for damp plane of classifying are：

Its dual problem is that the maximum value of lower array function is solved to α：

s.t.y_i[(w·x_i)+b]≥1-ξ_i (31)

Wherein C>0 is a constant, referred to as error punishment parameter, it controls the punishment degree for dividing mistake sample：ξ_iBe The non-negative slack variable introduced during training sample linearly inseparable.

(2) Nonlinear Classification

For Nonlinear Classification problem, then using appropriate interior Product function K (x_i,y_j) it can realize a certain non-linear change Linear classification after changing, the object function optimized at this time become：

And corresponding categorised decision function representation is：

Above categorised decision function is exactly support vector machines.It can obtain as described above, original problem is converted into Dual problem so that the complexity of calculating depends no longer on space dimensionality, but depending on the branch in sample number, especially sample Vectorial number is held, this feature of support vector machines enables it to effectively tackle higher-dimension problem.

In SVM, kernel function K (x_i,x_j) introducing, can realize that the inner product operation of higher dimensional space is converted into former space Inner product kernel function calculates, and Nonlinear Classification is realized in the case where not increasing algorithm complexity.Different kernel functions, can be with structure Different SVM is built out, the selection of kernel function is most important.Common kernel function has following four：

(1) linear inner product kernel function

K(x_i,x_j)=(x_i·x_j) (35)

(2) polynomial kernel

K(x_i,x_j)=[(x_i·x_j)+C]^q, q ＞ 0 (36)

(3) radial direction base core

K(x_i·y_j)=exp (- | | x_i-x_j||²/2σ²) (37)

(4) two layers of neural network core

K(x_i,x_j)=tanh (v (x_i·x_j)+θ) (38)

As shown in Figure 1, mistake divides the distribution of sample point to carry out analysis to find during by svm classifier, SVM classifier and other Grader it is the same, error sample point is all near interface, therefore we can be by improving the sample near interface Nicety of grading improve classification performance.SVM can be regarded as every class by we, and only there are one the 1NN graders for representing point.Work as sample When this point is near interface, since SVM only takes one to represent a little every class supporting vector, the representative point cannot be good sometimes Such is represented, it is because KNN is so as to make classification using all supporting vectors of every class as representative point to be at this moment combined it with KNN Utensil has higher classification accuracy.Specifically, it for sample x to be identified, calculates x and two class supporting vectors represents point x+ and x- Range difference, if range difference is more than a given threshold value, i.e. x from classifying face farther out, such as region I and II in Fig. 1, uses SVM Classification can generally divide pair.When range difference is less than a given threshold value, i.e. x is nearer from interface, that is, falls into region III, such as Classification SVM, the distance for representing point that only calculating x and two classes are taken is easier misclassification, at this moment using KNN to sample point Class obtains it judgement using each supporting vector as representing the distance that a little, calculates sample to be identified and each supporting vector.

Included the following steps using basic SVM-KNN classifier algorithms：

Step 1：Supporting vector and constant b is obtained using traditional SVM algorithm.If T is test set, T_svFor supporting vector collection, k For the number taken, ε is the threshold value of classification, is typically set to 1, and by grader system 0 if ε is 0, system becomes traditional SVM Grader；

Step 2：IfX ∈ T are then taken, ifThen stop；

Step 3：Value is substituted into formula to calculate, chooses linear separability or linearly inseparable formula as the case may be；

Step 4：If | f (x) |>ε then directly enables f (x) export, if | f (x) |≤ε transmits T_SV, x, k carry out KNN algorithms Classification, with T_SVFor whole sample sets, obtained return value is output valve；

Step 5：T=T- { x } turns to step 1.

Since the sample of svm classifier mistake is concentrated mainly near interface, and the sample point near interface is most It is supporting vector, therefore for the classification performance for improving SVM, KNN graders can be combined, to the different samples being distributed in space Point uses different sorting techniques, and classifier performance is improved by the way of semi-supervised learning.Support attack detecting is one and changes The process in generation initially sets up label user data set and unmarked user data set.Secondly, according to a small amount of label user data instruction Practice an initial SVM classifier, calculate the distance of unmarked user data and initial SVM classifier boundary, distance is more than setting Threshold values, then classified using SVM, otherwise classified using KNN.The data newly marked are added in training set, weight New training SVM classifier.The continuous iteration above process, finally obtains a higher SVM classifier of nicety of grading.

Semi-supervised are applied to of SVM-KNN is held in the palm in attack detecting, which is improved.The algorithm is according to not Same situation, using different graders, is merged into re -training in training set, iteration is instructed to the end always by new flag data Practise a higher grader of precision, therefore in order to improve algorithm performance, SVM and KNN can be carried out certain independent It improves.For example, SVM can be improved in convergence rate, these unbalanced problems of data.

During the concrete practice of the present invention, the improved semi-supervised attack detection methods of SVM-KNN include the following steps：

Step 1：User's set is divided into two parts by known users set, and a part is marked set L={ (u₁,c_j), (u₂,c_j),...,(u_m,c_j), m represents the quantity of label user, c_jRepresent classification, the value of j is 1 and 2, and attack detecting is two Classification problem, altogether only two classes, c₁Value represents normal users, c for 1₂Value represents for -1 and attacks user.Another part is not Label user's collection is combined into U={ u`₁, u`₂..., u`_n, n represents unmarked number of users.It regard label user's collection as training set, Train initial SVM classifier；

Step 2：An optional sample u` from unmarked user data set U_i, this obtains classification by SVM calculation formula The value of decision function f (x), formula are following (non-linear formula)：

Step 3：When | f (x) |>ε can be determined that u`_iFrom classification boundaries farther out, can direct output category result, and will The data newly marked are directly incorporated into training set.When | f (x) |<ε, it is possible to determine that u`_iNearer from classification boundaries, wherein ε is given Classification thresholds (0<ε<1) the nearer data of the partial distance classification boundaries, are added in into vector set U near classification boundaries_sv={ ui ∈ U, i=1,2 ..., k }, wherein k is the number of near border vector；

Step 4：Using improved KNN to set U_svIn user data reclassified；

Step 5：The boundary user data newly marked that KNN classifies are put into together in original training set, are being expanded Training set utilizes updated training set to train new SVM classifier；

Step 6：Whether judging result reaches best detection performance, if it is, output final classification device；If not, 2 re-optimization training user's data sets are then gone to step, carry out SVM training, loop iteration.

Assessment detection performance is assessed by accuracy rate and recall ratio two indices, and four kinds of data are included in grouped data, It is really and very negative to represent the number being appropriately determined as attack user and real user respectively, it is false just with vacation is negative represents mistake respectively and sentence It is set to the number of attack user and real user.Accuracy rate and the calculation formula of recall ratio are respectively：Accuracy rate=really/(true Just+vacation is just), recall ratio=really/(really+false negative).

In above-mentioned algorithm, during each repetitive exercise grader, add in the mark quality of boundary sample in training set to point The influence of class effect is very big, but since label user data is insufficient, the classification capacity of initial SVM is weaker, easily wrong point Boundary sample.Therefore, it herein during update training set every time, introduces KNN and classifies to boundary sample, assist SVM Optimize the mark quality of data boundary, so as to improve the accuracy of detection of final classification device.

It is tested for the semi-supervised support attack detecting algorithms of improved SVM-KNN.

Experiment is using MovieLens100K data sets.The data set contains 943 users, and 1682 films are carried out 1 to 5 score data, while every user at least gives a mark to 20 films.It is just common to give tacit consent to original user Family, scoring are normal believable.Structure attack user, attack scale is 15%, and filling rate is respectively 3%, 5%, 10%, 15%, 20%, attack type has random attack, mean value attack, popular attack.Data set is divided into training set and test is gathered, instruction Practice collection comprising 189 normal users, 128 attack users, test set has 754 normal users, 113 attack users (for It is compared with similar scheme, data distribution schemes are tested fully according to reference to other people).In experiment, SVM uses radial direction base letter Number.

It is from accuracy rate and the recall ratio of the different detection methods under different attacks, it can be seen that common as shown in Fig. 3-Fig. 8 The semi-supervised attack detectings of SVM-KNN better than SVM classifier, improved SVM-KNN algorithms (optimize KNN) performance more Excellent, accuracy rate and recall ratio under different attack patterns are above common SVM-KNN detections and SVM classifier detection.

The above is embodiments of the present invention, it is contemplated that there are problems that SVM-KNN to be caused to attack by KNN in the prior art The technical issues of accuracy of detection is poor, detection result is bad, the present invention propose a kind of based on improved SVM-KNN algorithms Semi-supervised support attack detection method, label user data set and unmarked user data set is initially set up, secondly according to a small amount of User data is marked to train an initial SVM classifier, calculate unmarked user data and initial SVM classifier boundary away from From distance is more than the threshold values of setting, then is classified using SVM, otherwise classified using KNN, the data newly marked are added Enter into training set, re -training SVM classifier, it is higher to finally obtain a nicety of grading for the continuous iteration above process SVM classifier.It is utilized the accuracy of the flag data of the detector of semi-supervised learning, and reasonable employment data untagged The regularity of distribution has combined SVM and KNN algorithms, so as to improve generalization ability and detection accuracy, in a small amount of information and constantly change In the environment of change, the performance than previous attack detecting algorithm is more superior.

It these are only the preferred embodiment of the present invention, be not intended to restrict the invention, for those skilled in the art For member, the invention may be variously modified and varied.Any modification for all within the spirits and principles of the present invention, being made, Equivalent replacement, improvement etc., should all be included in the protection scope of the present invention.

Claims

1. a kind of semi-supervised support attack detection method based on improved SVM-KNN algorithms, which is characterized in that including：

User's set is divided into marked user's collection and unmarked user collection, is trained just using marked user collection as training set Beginning SVM classifier；

Preliminary classification is carried out to any sample of users that unmarked user concentrates using initial SVM classifier；

The user data for being will be marked to be incorporated in the training set in preliminary classification, remaining user data is incorporated to Classification boundaries are nearby in vector set；

Using improved KNN similarities formula (1) as the range formula of KNN algorithms, SVM-KNN graders are formed to the classification User near border vector set carries out secondary classification,

Wherein, cosine similarityAdd in the cosine similarity after weighted valueCommon trait item number accounts for total characteristic item number purpose ratioParameter a+b+c=1, x_i、d_lFor the feature vector of sample, q_jkIt is characterized a t_k's Weighted value, w_ikAnd w_jk(1≤k≤p) is respectively sample d_iAnd d_jIn k-th of characteristic item weight, x_ik,w_jk(1≤k≤p) is the Coordinate in k dimensions, p are characterized a number,Weight for class center vector；

The label user data that KNN algorithm classifications obtain is incorporated in the training set, and utilizes the updated training set The new SVM classifier of re -training；

Judge whether classification results reach optimum detection performance, if it is determined that being then to export final classification device, otherwise recycle to described The user that unmarked user concentrates classifies；

Support attack detecting is carried out to user data using the final classification device.

2. the semi-supervised support attack detection method according to claim 1 based on improved SVM-KNN algorithms, feature exist In described that any sample of users progress preliminary classification that unmarked user concentrates is specifically included using initial SVM classifier：

An optional sample of users is concentrated in the unmarked user, categorised decision function f is calculated using SVM calculation formula (2) (x) value；

Judge the absolute value of the categorised decision function | f (x) | whether it is more than given classification thresholds ε (0<ε<1)；

Then it is normal users by the sample labeling if it is determined that being yes.

3. the semi-supervised support attack detection method according to claim 1 based on improved SVM-KNN algorithms, feature exist In the range formula using improved KNN similarities formula (1) as KNN algorithms, to vector set near the classification boundaries In user carry out secondary classification specifically include：

By the sample of users data to be sorted in vector set near the user data in the training set and the classification boundaries into The consistent vectorization of row；

The distance of sample to be sorted and each sample in the training set is calculated using the range formula, and chosen distance is most Arest neighbors of the k near sample as the sample to be sorted；

The weighted value that each sample in arest neighbors belongs to different classifications is calculated successively；

More a certain sample belongs to the size of the weighted value of different classifications, and the sample is divided into the classification of weighted value maximum In.

4. the semi-supervised support attack detection method according to claim 3 based on improved SVM-KNN algorithms, feature exist In a certain sample belongs to the c of each classification_jWeight vectors be q_j,q_j=(q_j1,q_j2,...,q_jp), wherein q_j1+q_j2+...+ q_jp=1, weighted value q_jkIt is characterized a t_kCorresponding weight, the size of weighted value represent t_kSignificance level in different classifications.

5. the semi-supervised support attack detection method according to claim 1 based on improved SVM-KNN algorithms, feature exist In, it is described to judge whether classification results reach optimum detection performance, if it is determined that be then to export final classification device, otherwise cycle pair The user that the unmarked user concentrates carries out classification and specifically includes：

According to classification results, calculate the accuracy rate and recall ratio of assorting process, wherein, the classification results include really, it is very negative, Vacation is described really and really to bear the data being respectively appropriately determined as attack user and real user, institute just with false minus four kinds of data The false data just born with vacation and be judged to attacking user and real user for mistake are stated, the calculation formula of the accuracy rate is：Accurately Rate=really/(really+vacation is just), the calculation formula of the recall ratio is：Recall ratio=really/(really+false negative)；

Judge whether the accuracy rate and the recall ratio reach preset optimal threshold；

If the accuracy rate and recall ratio reach preset optimal threshold, SVM-KNN points of the classification results will be obtained Class device is exported as final classification device, is otherwise recycled the user concentrated to the unmarked user and is classified.