CN108121998A

CN108121998A - A kind of training method of support vector machine based on Spark frames

Info

Publication number: CN108121998A
Application number: CN201711269096.1A
Authority: CN
Inventors: 许千帆; 王宇; 陈玫
Original assignee: Beijing Send Cloud Dingcheng Technology Co Ltd
Current assignee: Beijing Send Cloud Dingcheng Technology Co Ltd
Priority date: 2017-12-05
Filing date: 2017-12-05
Publication date: 2018-06-05
Anticipated expiration: 2037-12-05
Also published as: CN108121998B

Abstract

The present invention provides a kind of training method of support vector machine based on Spark frames, including：Training sample set is obtained, all sample vector distributed storages that training sample is concentrated are in the back end of Spark frames；It is concentrated from training sample and extracts the sample vector V for violating KKT condition maximums₂, while choose and sample vector V₂The centre of sphere away from the maximum sample vector V of difference₁；To sample vector V₁And V₂It is iterated optimization to calculate, obtains updated sample vector V₁ ^newAnd V₂ ^new；By sample vector V₁ ^newAnd V₂ ^newIt is broadcast in the back end of Spark, sample vector V is calculated in each back end₁And V₂The difference of generation obtains the updated centre of sphere so as to calculate；Then update the data the centre of sphere of each sample vector in node away from and the radius of a ball.Method provided by the invention by the way that the computation-intensive work of unit is distributed to each working node using Spark distributed computing frameworks, when data increase, can carry out extending transversely, memory space limits from unit.

Description

A kind of training method of support vector machine based on Spark frames

Technical field

The present invention relates to field of computer technology, are instructed more particularly, to a kind of support vector machines based on Spark frames Practice method.

Background technology

Support vector machines (Support Vector Machine, SVM) has been applied to information peace in large quantities since appearance Entirely, image procossing, pattern-recognition, the fields such as fault diagnosis, abnormality detection.1999, Tax, Scholkopf and Duin et al., It proposes 2 kinds of One Class SVM algorithms, is the One Class SVM based on hyperplane and based on suprasphere respectively.Wherein support Vector data description (support vector data description, SVDD) be with suprasphere into single class sorting technique, It aims at by the use of training data to describe a suprasphere as the discrimination model of classification.

Current common SVM pattern-recognitions are the scikit-learn of python and Taiwan woods intelligence with the software package returned The LIBSVM of benevolence professor.Wherein, Scikit-Learn is the machine learning module based on python, is increased income licensing based on BSD, What this project was initiated earliest by David Cournapeau in 2007, be also at present to be safeguarded by community volunteer； LIBSVM is a simple, easy to use and quickly and effectively SVM pattern of Taiwan Univ. Lin Zhiren professors et al. exploitation design Identification and the software package returned, it is not only provided compiled can additionally provide in the execution file of Windows serial systems Source code facilitates improvement, modification and is applied in other operating systems；The software is opposite to the parameter regulation involved by SVM It is fewer, many default parameters are provided, can be solved the problems, such as using these default parameters very much；And provide cross-verification Function.The software can solve the problems such as C-SVM, ν-SVM, ε-SVR and ν-SVR, including the multiclass based on one-to-one algorithm Pattern recognition problem.

But with the exponential growth of data volume, the requirement of standalone version memory and CPU can not meet demand, to calculating The demand of the method for solving of method parallelization is more and more urgent.SMO Algorithm for Solving Support Vector data description (support vector Data description, SVDD) it needs to calculate multiple quadratic programming problems and there is higher computational complexity, SVDD operations Time can increase with training samples number and increased dramatically.Storing the required memories of nuclear matrix Kii is instructed in training set Practice the rapid growth of points N, the scale of nuclear matrix is sample number quadratic relationship, directly detects SVDD applied to data exception Calculation amount can be caused excessive and memory overflow problem.

The content of the invention

To solve in the prior art, SMO Algorithm for Solving SVDD needs to calculate multiple quadratic programming problems and have higher Computational complexity, SVDD run times can increase with training samples number and increased dramatically.It is different that SVDD is directly applied to data Calculation amount can be caused excessive for often detection and memory overflow problem, proposes a kind of support vector machines training side based on Spark frames Method.

Method provided by the invention includes：

S1 obtains training sample set, and all sample vector distributed storages that the training sample is concentrated are in Spark frames In the back end of frame；

S2 is concentrated from the training sample and is extracted the sample vector V for violating KKT condition maximums₂, while choose with sample to Measure V₂The centre of sphere away from the maximum sample vector V of difference₁；

S3, to the sample vector V₁And V₂It is iterated optimization to calculate, obtains updated sample vector V₁ ^newWith V₂ ^new；

S4, by the updated sample vector V₁ ^newAnd V₂ ^newIt is broadcast in the back end of the Spark, each The sample vector V is calculated in back end₁And V₂The difference of generation, according to the difference calculated in each back end, meter It calculates and obtains updated centre of sphere a^new；

S5, according to the updated centre of sphere a^new, update the ball of each sample vector in the back end of the Spark The heart is away from while updating radius of sphericity R.

Wherein, the step S1 is further included：It reads in each back end and is instructed described in the corresponding back end Practice the sample vector in sample, a unique data mark is generated to sample vector each described.

Preferably, the unique data is identified by the timestamp of burst area code and the back end local of the back end It is composed.

Wherein, the calculating parameter initialized needed for the iteration optimization calculating is further included in the step S1；Wherein, it is described Calculating parameter includes Lagrange multiplier α, the centre of sphere a of all sample vectors and the centre of sphere of each sample vector away from d²。

Wherein, the calculating parameter that the initialization iteration optimization calculates specifically includes：

The Lagrange multiplier α values for initializing all sample vectors are 1/N；Wherein, N is described in the training sample set The number of sample vector；

Initialize square R of radius of sphericity²So that R²=0；

The centre of sphere is initialized according to the following formula：

A is the centre of sphere in formula, and α i and α j concentrate any two sample vector, K for the training sample_ijFor kernel function；

According to formulaThe centre of sphere of the sample vector is calculated away from d²。

Preferably, in the step S2, concentrated from the training sample and extract the sample vector V for violating KKT condition maximums₂ Extraction type be without putting back to extraction.

Wherein, chosen and sample vector V in the step S2₂The centre of sphere away from the maximum sample vector V of difference₁It specifically includes：

For any one of back end, obtain in the back end with the sample vector V₂The centre of sphere away from difference Maximum sample vector；

In the Driver Program of Spark frames according in each back end with the sample vector V₂'s The centre of sphere obtains and sample vector V away from the maximum sample vector of difference₂The centre of sphere away from the maximum sample vector V of difference₁。

Wherein, in the step S4, calculate and obtain updated centre of sphere a^newThe step of, it specifically includes：In Spark frames Driver Program in the difference being calculated in all back end is added up, calculate and obtain the new centre of sphere a^new。

Wherein, further included after the step S5：According to the centre of sphere of updated each vector away from seek radius R, Sample vector in boundary is removed, the sample for retaining all unbounded samples performs S1 to return.

Wherein, further included after the step S5, according to the drawing for judging to work as all sample vectors of training sample concentration Ge Lang multipliers all meet KKT conditions or the sample vector V₁And V₂Target loss function loss be less than predetermined threshold value when, stop Only train.

Method provided by the invention, by being distributed to the computation-intensive work of unit using Spark distributed computing frameworks Each working node；Unit is largely stored to nuclear matrix K_iiIt is distributed to each back end, during data increase, transverse direction can be carried out Extension, and the time is calculated since operating point is independent, will not substantially it increase；Memory space limits from unit.On the other hand, apply The mode of incremental computations saves a large amount of computations cycles by the full dose calculation that each iteration will carry out is avoided, and accelerates Solve calculating process.

Description of the drawings

Fig. 1 is a kind of flow for training method of support vector machine based on Spark frames that one embodiment of the invention provides Figure；

Fig. 2 is Spark in a kind of training method of support vector machine based on Spark frames that one embodiment of the invention provides The structure chart of frame；

Fig. 3 is a kind of stream for training method of support vector machine based on Spark frames that further embodiment of this invention provides Cheng Tu.

Specific embodiment

With reference to the accompanying drawings and examples, the specific embodiment of the present invention is described in further detail.Implement below Example is not limited to the scope of the present invention for illustrating the present invention.

With reference to figure 1, Fig. 1 is a kind of support vector machines training side based on Spark frames that one embodiment of the invention provides The flow chart of method, the described method includes：

S1 obtains training sample set, and all sample vector distributed storages that the training sample is concentrated are in Spark frames In the back end of frame.

Specifically, after training sample set is received, it is by distributed storage, the sample vector in sample set is distributed It is stored in the back end under Spark frames.

As shown in Fig. 2, Apache Spark are to aim at large-scale distributed data distribution formula memory to calculate and design fast The general engine of speed.It is by the class Hadoop MapReduce to increase income of the AMP laboratories offer of University of California Berkeley Universal parallel frame.Spark can be preserved in memory due to exporting result among MapReduce Job, so as to no longer need HDFS is read and write, therefore Spark can preferably be suitable for the calculation that data mining and machine learning etc. need the MapReduce of iteration Method.Many Parallel Algorithms all have realization on Spark.

By the method, by the sample vector distributed storage in training set in multiple back end, during data increase, It can carry out extending transversely.

S2 is concentrated from the training sample and is extracted the sample vector V for violating KKT condition maximums₂, while choose with sample to Measure V₂The centre of sphere away from the maximum sample vector V of difference₁。

Specifically, Optimized Iterative process uses SMO algorithms, i.e., once two sample vectors is selected to optimize.General mark It is V to know two sample vectors optimized₁And V₂, according to the stop condition of selection can determine how selected element can to calculate Method convergence contribution is maximum, such as using the method for monitoring feasible gap, optimizes those point conducts for most violating KKT conditions first V₂, according to KKT conditions, V₁, V₂Iterative relation can be determined as formula：

λ₁=α₁+α₂-λ₂

In formula, K is kernel function, and α is Lagrange multiplier, d²For the centre of sphere away from.

In order to make the update step-length of each largest optimization maximum, it is seen that needs are foundMaximum, i.e.,Most Small value, so as to find V₁。

S3, to the sample vector V₁And V₂It is iterated optimization to calculate, obtains updated sample vector V₁ ^newWith V₂ ^new。

S4, by the updated sample vector V₁ ^newAnd V₂ ^newIt is broadcast in the back end of the multiple Spark, The sample vector V is calculated in each back end₁And V₂The difference of generation, according to the difference calculated in each back end Point, it calculates and obtains updated centre of sphere a^new。

Specifically, Optimized Iterative process uses SMO algorithms, according to One Class SVM model minimum sphere body Models, mesh Scalar functions formula is：

s.t.||Φ(x_i)-a||²≤R²+ζ

ζ_i≥0

In formula, middle R is radius of sphericity, and a is the centre of sphere, and ζ is slack variable.

Solve the following formula quadratic programming problem, you can acquire the centre of sphere and radius.

All parameters are updated according to the step of Fig. 3, wherein newer parameter includes V₁, V₂Lagrange multiplier alpha parameter； Centre of sphere a, update the centre of sphere of each sample point vector away fromRadius of sphericity R, specific steps include：According to the following formula with new V₁ And V₂Lagrange multiplier α.

λ₁=α₁+α₂-λ₂

V is updated₁And V₂Afterwards, updated sample vector V is obtained₁ ^newAnd V₂ ^new, by V₁、V₂、V₁ ^newAnd V₂ ^newAnd protocorm Heart a is broadcast in each back end of Spark, updates centre of sphere a, and more new formula is：

In formula, α i and α j concentrate any two sample vector, Ki for the training sample_jFor kernel function, due to there was only V₁, V₂The parameter alpha of sample vector changes, thus only with feature vector V₁And V₂The data of related feature vector can be become Change, it is possible to be calculated using differential pair a, specific formula is as follows：

In formula, a^oldFor the protocorm heart, a^newIt is kernel function for updated centre of sphere K.The process of Difference Calculation is in each Spark Data fragmentation on carry out Distributed Calculation, and added up on the Driver Program of Spark.

It, can be to the centre of sphere of each sample vector away from being updated, by applying difference formula after with new centre of sphere parameter：

It can realize to the centre of sphere of each sample vector away from being updated, in formulaFor the new centre of sphere away from,For original The centre of sphere is away from a is the centre of sphere, and K is kernel function.This step carries out Distributed Calculation in the back end of Spark.

Finally, the update to radius of sphericity R is further included, specifically, working as V₁And V₂When being all unbounded sample, i.e. ξ ＜ α_i During ＜ C, ξ is the decimal close to 0, and C is penalty factor, then the more new formula of R is：

Work as V₁And V₂When being all sample in boundary, i.e. α_i≤ ξ, or α_iWhen >=C, then more new formula is：

By the method, the computation-intensive work of unit is distributed to each work section using Spark distributed computing frameworks Point；Unit is largely stored to nuclear matrix Kii and is distributed to each working node.During data increase, extending transversely, calculating can be carried out Time since operating point is independent, will not substantially increase；Memory space limits from unit.On the other hand, using incremental computations Mode saves a large amount of computations cycles by the full dose calculation that each iteration will carry out is avoided, and accelerates to solve and calculated Journey.

On the basis of above-described embodiment, the step S1 is further included：Corresponding be somebody's turn to do is read in each back end Sample vector described in back end in training sample generates sample vector each described one unique data mark.

Specifically, before starting optimization and calculating, when all sample vector distributed storages that training sample is concentrated exist After in the back end of Spark frames, on each back end, the data data in the block of corresponding local can be read in, each Sample vector can generate a not repeating random number formation unique data mark id.Due to the sample that training sample is concentrated to Amount can carry out area there may be the identical situation of parameter, therefore here by unique data mark id to all sample vectors Point.

Preferably, id can be composed by burst sequence number and local timestamp.The unique id of data can be used for area The sample vector with identical memory address in point on difference Executor.

On the basis of the various embodiments described above, further included in the step S1 needed for the initialization iteration optimization calculating Calculating parameter；Wherein, the calculating parameter includes Lagrange multiplier α, centre of sphere a and each sample vector of all sample vectors The centre of sphere away from d²。

Preferably, the calculating parameter that the initialization iteration optimization calculates specifically includes：

Initialize square R of radius of sphericity²So that R²=0；

The centre of sphere is initialized according to the following formula：

A is the centre of sphere in formula, and α i and α j concentrate any two sample vector for the training sample, and Kij is kernel function；

Specifically, before iteration optimization calculating, the iterative calculation parameter of support vector machines is initialized first, it is first First initialize the Lagrange multiplier α of each sample vector, it is preferred that initial value is arranged to 1/N, wherein, N is the training The number of all sample vectors in sample set.This process is Distributed Calculation, is calculated respectively on each back end.

Then, square R of radius of sphericity is initialized², it is preferred that radius of sphericity square is arranged to 0, i.e. R²=0.

Then, the centre of sphere is initialized according to the following formula：

In formula, a is the centre of sphere, and α i and α j concentrate any two sample vector for the training sample,For gaussian kernel function.

Finally, according to formula：

Each sample vector is calculated to the distance d of centre of sphere a², the step need on data set carry out full dose calculating, obtain The result gone out is stored in using sample vector as in the HashMap of key.

On the basis of above-described embodiment, in the step S2, concentrate to extract from the training sample and violate KKT conditions most Big sample vector V₂Extraction type be without putting back to extraction.

Specifically, as the sample vector V for extracting violation KKT condition maximums₂When, selection is that nothing puts back to extraction, is made It obtains in entire big iteration cycle, all samples are traversed.

On the basis of the various embodiments described above, chosen and sample vector V in the step S2₂The centre of sphere it is away from difference maximum Sample vector V₁It specifically includes：

Specifically, as shown in figure 3, extracting sample vector V₂Afterwards, in each back end of Spark, look for respectively Go out in the back end with sample vector V₂The centre of sphere away from the maximum sample vector of difference；Thereafter, under Spark frames The global centre of sphere is chosen in Driver Program away from the maximum sample vector of difference as V₁。

On the basis of the various embodiments described above, in the step S4, calculate and obtain updated centre of sphere a^newThe step of, tool Body includes：The difference being calculated in all back end is added up in the Driver Program of Spark frames, It calculates and obtains new centre of sphere a^new。

Specifically, as shown in figure 3, when to the sample vector V₁And V₂It is iterated optimization to calculate, obtains updated sample This vector V₁ ^newAnd V₂ ^newAfterwards, the V after will be updated₁And V₂It, can be in each data after being broadcast to each back end of Spark V is calculated in node₁And V₂The difference generated after variation, then, to all data in the Driver Program of Spark frames The difference being calculated in node is added up, and is calculated and is obtained new centre of sphere a^new。

On the basis of the various embodiments described above, further included after the step S5：According to updated each vector The centre of sphere away from radius of sphericity R, remove sample vector in boundary, retain the samples of all unbounded samples to returning and perform S1.

Specifically, after an iteration optimization calculates completion, next group of V is reselected₁And V₂, carry out next round iteration It calculates, using heuristic selection method, the unbounded sample of prioritizing selection is calculated, sample in suboptimization circle.It preferably, can be with All sample vectors of sample in boundary are removed, it is follow-up to differentiate that calculated value needs to use the sample vector of unbounded sample.

On the basis of the various embodiments described above, according to judgement when the training sample concentrates the glug of all sample vectors bright Day multiplier all meets KKT conditions or the sample vector V₁And V₂Target loss function loss be less than predetermined threshold value when, stop instruction Practice.

Specifically, all the points Lagrange multiplier ɑ meets KKT conditions or reaches optimization aim after certain iterations When loss function loss is less than a predetermined threshold value, then it is assumed that optimization reaches approximately KKT conditions.It can stop instructing at this time Practice.

By the method, when target loss function loss is less than a predetermined threshold value, then may indicate that follow-up excellent The effect of change is not apparent enough, at this time deconditioning, to reduce whole calculation amount.

Finally, the present processes are only preferable embodiment, are not intended to limit the scope of the present invention.It is all Within the spirit and principles in the present invention, any modifications, equivalent replacements and improvements are made should be included in the protection of the present invention Within the scope of.

Claims

1. a kind of training method of support vector machine based on Spark frames, which is characterized in that including：

S1 obtains training sample set, and all sample vector distributed storages that the training sample is concentrated are in Spark frames In back end；

S2 is concentrated from the training sample and is extracted the sample vector V for violating KKT condition maximums₂, while choose and sample vector V₂ The centre of sphere away from the maximum sample vector V of difference₁；

S3, to the sample vector V₁And V₂It is iterated optimization to calculate, obtains updated sample vector V₁ ^newAnd V₂ ^new；

S4, by the updated sample vector V₁ ^newAnd V₂ ^newIt is broadcast in the back end of the Spark, in each data The sample vector V is calculated in node₁And V₂The difference of generation, according to the difference calculated in each back end, calculating obtains Obtain updated centre of sphere a^new；

S5, according to the updated centre of sphere a^new, update the centre of sphere of each sample vector in the back end of the Spark away from, Update radius of sphericity R simultaneously.

2. according to the method described in claim 1, it is characterized in that, the step S1 is further included：To each back end The sample vector in training sample described in the corresponding back end is read in, one is generated to sample vector each described only One Data Identification.

3. according to the method described in claim 2, it is characterized in that, the unique data identifies the burst by the back end The timestamp of area code and back end local is composed.

4. according to the method described in claim 2, it is characterized in that, the initialization iteration optimization is further included in the step S1 Calculating parameter needed for calculating；

Wherein, the calculating parameter includes Lagrange multiplier α, the centre of sphere a of all sample vectors and the ball of each sample vector The heart is away from d²。

5. the according to the method described in claim 4, it is characterized in that, calculating parameter that the initialization iteration optimization calculates It specifically includes：

The Lagrange multiplier α values for initializing all sample vectors are 1/N；

Wherein, N is the number of sample vector described in the training sample set；

Initialize square R2 of radius of sphericity so that R2=0；

The centre of sphere is initialized according to the following formula：

<mrow> <msup> <mi>a</mi> <mn>2</mn> </msup> <mo>=</mo> <munder> <mo>&Sigma;</mo> <mi>i</mi> </munder> <munder> <mo>&Sigma;</mo> <mi>j</mi> </munder> <mi>&alpha;</mi> <mi>i</mi> <mo>&CenterDot;</mo> <mi>&alpha;</mi> <mi>j</mi> <mo>&CenterDot;</mo> <mi>K</mi> <mi>i</mi> <mi>j</mi> </mrow>

6. according to the method described in claim 1, it is characterized in that, in the step S2, concentrate and extract from the training sample Violate the sample vector V of KKT condition maximums₂Extraction type be without putting back to extraction.

7. it according to the method described in claim 1, it is characterized in that, is chosen and sample vector V in the step S2₂The centre of sphere away from Differ maximum sample vector V₁It specifically includes：

For any one of back end, obtain in the back end with the sample vector V₂The centre of sphere it is maximum away from difference Sample vector；

In the Driver Program of Spark frames according in each back end with the sample vector V₂The centre of sphere Away from the sample vector that difference is maximum, obtain and sample vector V₂The centre of sphere away from the maximum sample vector V of difference₁。

8. according to the method described in claim 1, it is characterized in that, in the step S4, calculate and obtain updated centre of sphere a^new The step of, it specifically includes：To the difference being calculated in all back end in the Driver Program of Spark frames It is added up, calculates and obtain new centre of sphere a^new。

9. it according to the method described in claim 1, it is characterized in that, is further included after the step S5：According to updated institute State the centre of sphere of each vector away from radius of a ball R, remove sample vector in boundary, retain the samples of all unbounded samples and perform to returning S1。

10. it according to the method described in claim 8, it is characterized in that, is further included after the step S5, according to judging to work as Training sample concentrates the Lagrange multiplier of all sample vectors all to meet KKT conditions or the sample vector V₁And V₂Target When loss function loss is less than predetermined threshold value, deconditioning.