CN1588342A - Cross merge method for reducing support vector and training time - Google Patents

Cross merge method for reducing support vector and training time Download PDF

Info

Publication number
CN1588342A
CN1588342A CN 200410053659 CN200410053659A CN1588342A CN 1588342 A CN1588342 A CN 1588342A CN 200410053659 CN200410053659 CN 200410053659 CN 200410053659 A CN200410053659 A CN 200410053659A CN 1588342 A CN1588342 A CN 1588342A
Authority
CN
China
Prior art keywords
training
support vector
sets
training set
sample
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN 200410053659
Other languages
Chinese (zh)
Other versions
CN100353355C (en
Inventor
文益民
吕宝粮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shanghai Jiaotong University
Original Assignee
Shanghai Jiaotong University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shanghai Jiaotong University filed Critical Shanghai Jiaotong University
Priority to CNB200410053659XA priority Critical patent/CN100353355C/en
Publication of CN1588342A publication Critical patent/CN1588342A/en
Application granted granted Critical
Publication of CN100353355C publication Critical patent/CN100353355C/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The cross-merging method for reducing back-up vector and training time in the field of intelligent information processing technology includes three steps. The training set decomposing step includes classifying the training specimen sets, extracting specimen, decomposing each specimen set into two son sets, and combining the son sets to obtain four training sets. The layered data screening step based on back-up vector includes processing the four training sets in back-up vector machine method to obtain four back-up vector sets, merges the four back-up vector sets in cross-merging regulations into two groups as two training sets, parallelly processing two classified problems represented by these two training sets in back-up vector machine method to obtain two back-up vector sets, and merging the two back-up vector sets to obtain final training set. Training back-up vector machine with the final training set can obtain final classifier.

Description

Reduce the merging method of intersecting of support vector and training time
Technical field
The present invention relates to a kind of layering parallel machine learning method based on support vector essence, specifically is a kind of merging method of intersecting that reduces support vector and training time.Be used for the intelligent information processing technology field.
Background technology
Along with science and technology development, the mankind have accumulated mass data in every field, and these data are also increasing with higher speed.To the analysis and the understanding of these data,, even may cause human to the prior discovery of nature for the very important meaning of having further developed of human society.On the other hand, owing to Statistical Learning Theory is arranged as solid theory, support vector machine method has become a kind of pandemic method for classifying modes.Use support vector machine method to solve extensive pattern classification problem two kinds of methods are arranged.The incremental learning method is divided into the plurality of sub problem with an extensive problem, then with each subproblem serial processing.The working set method of training support vector machine just belongs to this class.A major advantage of this method is that it has only linear demand to internal memory, and promptly the size of required memory is directly proportional with the training sample number.When handling extensive pattern classification problem, use the incremental learning method can cause iterations too much and problem such as the training time is long, the training time complexity of this method is O (N normally 2) about.The collateral learning method becomes the plurality of sub problem according to the principle of dividing and rule with former PROBLEM DECOMPOSITION, carries out each subproblem parallel processing integrated later on again.The advantage of collateral learning method is to be based upon on the basis of parallel computation, can shorten the training time, has good alterability and expandability, all needs to keep but training process finishes the result of all submodules of back, thereby causes the support vector number to increase.
Support vector is the key concept in the support vector machine method.Find by prior art documents, essence about support vector, syed in 1999, N.A. at document (Incremental Learning withSupport Vector Machines.In:Proceedings of the Workshop on Support VectorMachines at the International Joint Conference on Artificial Intelligence.Sweden, Stockholm, 1999) (the incremental learning of support vector machine, come from: 1999 support vector machine research group of International Joint Conferences on Artificial Intelligence proceedings) by a large amount of numerical simulation evidences: the support vector collection has comprised the classified information that training sample is concentrated, and this support vector collection is necessary, the number that is support vector cannot reduce to above 10% of its sum, but the number that does not have support vector has further argumentation.So far do not have yet about with the report of same document of the present invention.
Summary of the invention
The objective of the invention is at existing long deficiency of training time when using support vector machine method to solve extensive problem, a kind of merging method of intersecting that reduces support vector and training time is provided, make it reduce learning time, reduce support vector simultaneously.The present invention adopts a kind of combined method that merges of intersecting in the process of training sample screening, with the training sample set that guarantees to obtain at last and the consistance of former training sample set.
The present invention is achieved by the following technical solutions, and the inventive method comprises training set decomposing, generates three steps based on the individual-layer data screening of support vector, final sorter.
1) training set decomposing: after will including the sub-category extraction sample of training sample set of two class samples, according to predefined decomposition rate r, all kinds of sample sets in the training set are resolved into two subclass respectively, to make up from different classes of sample subclass then, and then obtain four training sets.The scale of two class classification problems of these four training set representatives is all little than former training sample set.
2) individual-layer data based on support vector screens: with these four two class classification problems of support vector machine method parallel processing, will obtain four support vectors set.According to intersect merging rule, divide two combinations also with the set of four support vectors obtaining, thereby can obtain two training sets.Two classification problems with these two training set representatives of support vector machine method parallel processing obtain the set of two support vectors.The set of these two support vectors is merged, produce a training set.This training set is final training set.Because the support vector collection of a training set has comprised the classified information in the training set, so said process progressively screens non-support vector, reduces the training time thereby reduced training sample.The present invention by two-layer data screening finally obtain with former training set equivalence comprise the less training set of number of samples.
3) generation of final sorter: the final training set training support vector machine of utilizing hierarchical screening to obtain obtains final sorter.
Below the inventive method is further described:
1, training set decomposing
Suppose to belong to class C in the former two class classification problems 1Sample be: P = { X i } i = 1 L m , Belong to class C 2Sample be: N = { X i } i = 1 L n , X iRepresent a sample, L mAnd L nThe number of representing two class samples respectively, then all training sample set can be expressed as T=P ∪ N.According to pre-determined decomposition rate r (0<r≤0.5) former training set P and N are decomposed into two subclass respectively:
P 1 = { X i } i = 1 L P 1 , P 2 = { X i } i = L P 1 + 1 L m , N 1 = { X i } i = 1 L n 1 , N 2 = { X i } i = L n 1 + 1 L n - - - ( 1 )
Wherein
Figure A20041005365900053
L P1And L N1Represent P respectively 1And N 1The number of middle sample.So former two class classification problem T can resolve into two less class classification problems of following four scales:
T 1=P 1∪N 1,T 2=P 2∪N 2,T 4=P 2∪N 2,T 4=P 2∪N 1????(2)
If these two classes classification problems are still too big, can in them each further be resolved into four two class classification problems that scale is littler according to above method.
2, the individual-layer data based on support vector screens
The support vector machine method of employing standard, parallel training obtains four support vector machine on these four two less class classification problems.The set of their support vector is respectively: SV 1, SV 2, SV 3And SV 4Adopt crossbinding normally, with T 1And T 2Support vector S set V 1And SV 2Be merged into T 12, with T 3And T 4Support vector S set V 3And SV 4Be merged into T 34The so-called intersection merges rule, is to avoid at T 1And T 2Or T 3And T 4In belong to repeating of of a sort subclass, thereby avoid the people for causing T 12And T 34The imbalance of middle training data and the loss of classified information.
T 12=SV 1∪SV 2,T 34=SV 3∪SV 4???????????????????????????(3)
Because concentrating, support vector comprised classified information so T 12And T 34Preserved the information that former training sample is concentrated from two different angles, the classified information loss of having avoided factor to bring according to division.Simultaneously, from T 1And T 2To T 12Or T 3And T 4To T 34The sample of non-support vector is screened to be fallen.With T 12And T 34As training set, obtain two support vector machine respectively through parallel processing.Their support vector set is respectively: SV 12And SV 34, both are merged:
T final=SV 12∪SV 34?????????????????????????????(4)
Get training set to the end.So T FinalTo comprise all classification information among the training set T.Owing to only stay support vector, but not support vector is fallen by screening progressively in above process.Compare T with former training set T FinalIn will only stay less relatively training data.
3, the generation of final sorter
Use T FinalAs new training set, supported vector machine SVMfinal.This support vector machine is as last pattern classifier, and its employed support vector is less, and this will shorten recognition time.
Above process can be used arthmetic statement:
Known:
Training sample set T=P ∪ N and decomposition rate r
Algorithm:
(1) according to r P and N are decomposed, be combined into four classification problem T that scale is less then 1, T 2, T 3And T 4
(2) if T 1, T 2, T 3And T 4Problem scale meet internal memory restriction, then change (3), otherwise change (1);
(3) adopt support vector machine method with T 1, T 2, T 3And T 4Parallel processing obtains four support vector set: the SVs corresponding with them 1, SV 2, SV 3And SV 4
(4) according to the intersection combination principle they are combined into two classification problem T 12And T 34, adopt support vector machine method that their parallel processings are obtained two support vector S set V 12And SV 34
(5) make T Final=SV 12∪ SV 34
(6) with T FinalObtain final support vector machine as new training set, with its pattern classifier as cognitive phase.
The invention enables the classified information that comprises in the final training set that obtains behind the hierarchical screening and the former training set to be consistent, thereby make and utilize the recognition accuracy of hierarchical screening training sample sorter that obtains and the sorter that utilizes former whole training set to obtain to be consistent.According to adopting a plurality of tests that the present invention carried out to show: method proposed by the invention has reduced training time and support vector number.Another effect of the present invention is: do not reduce under the prerequisite of sorter recognition accuracy in assurance, adopt decomposition method to reduce problem scale.
Description of drawings
Fig. 1 the inventive method process flow diagram
The DATA DISTRIBUTION and the decomposing schematic representation of Fig. 2 embodiment of the invention experiment one
Embodiment
Below in the mode of example and the invention will be further described in conjunction with the accompanying drawings:
As shown in Figure 1, if the multiclass problem need be carried out the conversion of multiclass two classes.The inventive method may further comprise the steps then:
The first, by the pre-service of training sample the training sample classification is extracted, belong to set of composition of sample of each class.This preprocessing process can carry out when gathering training sample, can reduce the time complexity of preprocessing process like this.Under the situation of two classes, the training sample pre-service is become T=P ∪ N, wherein P and N represent to belong to the training sample set of two classifications respectively.
The second, P and N are decomposed according to predefined decomposition rate r, resolve into P respectively 1, P 2And N 1, N 2Chessboard such as one [0,200] * [0,200] in Fig. 2 is divided into four, and all sample points are evenly distributed on these four.Be positioned at [0,100] * sample in [0,100] and [100,200] * [100,200] is the positive example sample, and the sample that is arranged in remaining space is the counter-example sample.Getting decomposition rate is r=0.5, can make division as shown in Figure 2.Then according to method shown in Figure 1, carry out hierarchical screening and get to the end training set T FinalWith SV 12And SV 34Merge and obtain T FinalProcess be one go to overlap and process.In order to reduce time complexity, merging SV 12And SV 34The time, can get SV respectively 12And SV 34In the corresponding sequence number of each training sample in former training set T constitute two set, go then to overlap also, again according to go to overlap and the result training sample of correspondence is fetched, finally constitute T Final
Three, with T FinalAs training set, use general training method of support vector machine can get to the end sorter SVM FinalAttention: each support vector collection among Fig. 1 is by adopting identical parameter to obtain.Such as: when adopting gaussian kernel function, need adopt identical C and σ.
Use sorter SVM FinalThe sample that will discern is discerned.
Two test figures in the present embodiment are respectively from artificial and practical problems.Experiment porch is: 2.4GHz512MB RAM Pentium 4 PC.
In experiment one,, four different training sets and a common test set have been generated at random in order to check robustness of the present invention.Constitute four two class problem: A like this 1, A 2, A 3And A 4Each training set comprises 5000 positive example samples and 5000 counter-example samples, comprises 10000 positive example samples and 10000 counter-example samples in the test set.Adopt gaussian kernel function, parameter is chosen as: c=1000, σ=31.62.
The experimental data collection of table 1 experiment one
?Training ?Testing
?Positive ?samples ?Negative ?samples ?Positive ?samples ?Negative ?samples
????A 1 ?5000 ?5000 ?10000 ?10000
????A 2 ?5000 ?5000
????A 3 ?5000 ?5000
????A 4 ?5000 ?5000
In experiment two, the text classification database that the text classification test for data adopts Japanese Yomiuri Shimbun to provide.Through after the feature extraction, the dimension of feature space is 5000.The present invention has extracted three class data as shown in table 2 from this database.Optional two classes wherein constitute one two class classification problem, so obtain three two class problem: A 5, A 6And A 7Being chosen as of parameter: σ=2, C=64 and r=0.5.
The experimental data collection of table 2 experiment two
Category ??Data
??Training ??Test
Accidents Health By-time ??34044 ??35932 ??33590 ??8483 ??7004 ??7702
In order to verify the actual effect of method proposed by the invention, respectively the support vector machine method of the hierarchical screening training sample that the present invention is proposed with the support vector machine method of the disposable study of whole training sample set is tested comparison.For convenience, the method that the present invention is proposed is designated as C-SVM (Cascade SVM), and a kind of method in back is designated as S-SVM (Standard SVM).Experimental result sees Table 3 and table 4:
The experimental result of table 3 experiment one
??Method ??Accuracy(%) Training time(s) Number of?SVs
??Train ??Test
??A 1 ??S-SVM ??C-SVM ??99.84 ??99.78 ??99.81 ??99.72 46.39 13.08 93 81
??A 2 ??S-SVM ??C-SVM ??99.89 ??99.85 ??99.72 ??99.70 38.00 15.34 96 83
??A 3 ??S-SVM ??C-SVM ??99.93 ??99.86 ??99.84 ??99.75 32.44 13.45 88 79
??A 4 ??S-SVM ??C-SVM ??99.89 ??99.92 ??99.81 ??99.83 35.50 19.87 94 84
??av ??S-SVM ??C-SVM ??99.89 ??99.85 ??99.80 ??99.75 38.08 15.44 93 82
The experimental result of table 4 experiment two
??Method ?A 5 ?A 6 ?A 7
??Training ??accuracy(%) ??S-SVM ??C-SVM ?97.74 ?97.73 ?97.93 ?97.75 ?96.67 ?96.67
??Test ??accuracy(%) ??S-SVM ??C-SVM ?95.81 ?95.83 ?96.01 ?96.02 ?93.62 ?93.62
??Training ??time(s) ??S-SVM ??C-SVM ?12664 ?9519 ?7458 ?4491 ?18566 ?15060
??Number ??of?SVs ??S-SVM ??C-SVM ?10933 ?10553 ?9445 ?9222 ?12750 ?12387
Can know by above data:
1, the present invention can reduce the training time under the prerequisite that guarantees the sorter recognition accuracy.This method has robustness to training sample simultaneously; 2, the present invention has reduced the number of support vector, do not have contradiction with the achievement in research of Syed N.A in 1999, but has provided the illustration what degree support vector can reduce to actually.This is for the recognition speed that improves sorter, and being used for sorter in real time, monitoring has important meaning.

Claims (3)

1, a kind of merging method of intersecting that reduces support vector and training time is characterized in that, comprises training set decomposing, generates three steps based on the individual-layer data screening of support vector, final sorter:
1) training set decomposing: after will including the sub-category extraction sample of training sample set of two class samples, according to predefined decomposition rate r, all kinds of sample sets in the training set are resolved into two subclass respectively, to make up from sample subclass of all categories then, and then obtaining four training sets, the scale of two class classification problems of these four training set representatives is all little than former training sample set;
2) individual-layer data based on support vector screens: with these four two class classification problems of support vector machine method parallel processing, to obtain four support vector set, merge rule according to intersecting, divide two combinations also with the set of four support vectors obtaining, thereby obtain two training sets, two classification problems with these two training set representatives of support vector machine method parallel processing, obtain the set of two support vectors, the set of these two support vectors is merged, produce a training set, this training set is final training set;
3) generation of final sorter: the final training set training support vector machine of utilizing hierarchical screening to obtain obtains final sorter.
2, the merging method of intersecting of minimizing support vector as claimed in claim 1 and training time, it is characterized in that, in the step 1), after training sample classification extraction, according to predefined decomposition rate r, be combined into four two class classification problems after all kinds of sample sets in the training set are decomposed, if each problem is still too big, then further continue to decompose according to same decomposition method, decomposition rate r decision is with one deck calculation burden apportionment of falling into a trap.
3, the merging method of intersecting of minimizing support vector as claimed in claim 1 and training time, it is characterized in that, step 2) in, four classification problems are after extracting support vector, the rule that merges according to intersection becomes two classification problems with four support vector collection integrations, each classification problem has embodied the classified information of certain angle of former training set, two classification problems that obtain are extracted through parallel support vector, two support vector set that will obtain then merge, will be together from the classified information integration of two angles, thereby make that the classified information that comprises in SV12USV34 and the former whole training set is consistent, finally make the sorter that obtains that consistent recognition accuracy is arranged.
CNB200410053659XA 2004-08-12 2004-08-12 Cross merge method for reducing support vector and training time Expired - Fee Related CN100353355C (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNB200410053659XA CN100353355C (en) 2004-08-12 2004-08-12 Cross merge method for reducing support vector and training time

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNB200410053659XA CN100353355C (en) 2004-08-12 2004-08-12 Cross merge method for reducing support vector and training time

Publications (2)

Publication Number Publication Date
CN1588342A true CN1588342A (en) 2005-03-02
CN100353355C CN100353355C (en) 2007-12-05

Family

ID=34602950

Family Applications (1)

Application Number Title Priority Date Filing Date
CNB200410053659XA Expired - Fee Related CN100353355C (en) 2004-08-12 2004-08-12 Cross merge method for reducing support vector and training time

Country Status (1)

Country Link
CN (1) CN100353355C (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206667B (en) * 2007-12-06 2010-06-02 上海交通大学 Method for reducing training time and supporting vector
CN107194411A (en) * 2017-04-13 2017-09-22 哈尔滨工程大学 A kind of SVMs parallel method of improved layering cascade

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6760715B1 (en) * 1998-05-01 2004-07-06 Barnhill Technologies Llc Enhancing biological knowledge discovery using multiples support vector machines
US6192360B1 (en) * 1998-06-23 2001-02-20 Microsoft Corporation Methods and apparatus for classifying text and for building a text classifier
AU780050B2 (en) * 1999-05-25 2005-02-24 Health Discovery Corporation Enhancing knowledge discovery from multiple data sets using multiple support vector machines
CN1245696C (en) * 2003-06-13 2006-03-15 北京大学计算机科学技术研究所 Text classification incremental training learning method supporting vector machine by compromising key words

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101206667B (en) * 2007-12-06 2010-06-02 上海交通大学 Method for reducing training time and supporting vector
CN107194411A (en) * 2017-04-13 2017-09-22 哈尔滨工程大学 A kind of SVMs parallel method of improved layering cascade

Also Published As

Publication number Publication date
CN100353355C (en) 2007-12-05

Similar Documents

Publication Publication Date Title
CN108388927A (en) Small sample polarization SAR terrain classification method based on the twin network of depth convolution
CN100595780C (en) Handwriting digital automatic identification method based on module neural network SN9701 rectangular array
US20040054499A1 (en) System and method for identifying an object
CN1530857A (en) Method and device for document and pattern distribution
Walsh A multivariate method for determining optimal subsample size in the analysis of macroinvertebrate samples
Hruby Using similarity measures in benthic impact assessments
CN107679244A (en) File classification method and device
CN1588342A (en) Cross merge method for reducing support vector and training time
CN110458120B (en) Method and system for identifying different vehicle types in complex environment
CN110706004A (en) Farmland heavy metal pollutant tracing method based on hierarchical clustering
CN102034117A (en) Image classification method and apparatus
CN115861956A (en) Yolov3 road garbage detection method based on decoupling head
Dhandapani et al. Bearing Faults Diagnosis and Classification Using Generalized Gaussian Distribution Multiscale Dispersion Entropy Features
Rathore et al. Approximate cluster heat maps of large high-dimensional data
CN111833297A (en) Disease association method of marrow cell morphology automatic detection system
CN1403984A (en) Method and system for helping bonus organization estimate and improve profits from customs
CN112131106A (en) Test data construction method and device based on small probability data
CN1204526C (en) Preclassifying method and system for Chinese handwriting character recognition
Barceló et al. Cutting or scrapping? Using neural networks to distinguish kinematics in use-wear analysis
JP4004584B2 (en) Clustering apparatus and method
CN114996256B (en) Data cleaning method based on class balance
CN100357961C (en) Task decomposition for determining planoid based on main component analysis
CN111709908B (en) Helium bubble segmentation counting method based on deep learning
Martınez-Trinidad Unsupervised Feature Selection Methodology for Analysis of Bacterial Taxonomy Profiles
CN113779275A (en) Feature extraction method, device and equipment based on medical data and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
C17 Cessation of patent right
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20071205

Termination date: 20100812