CN108596268A - A kind of data classification method - Google Patents

A kind of data classification method Download PDF

Info

Publication number
CN108596268A
CN108596268A CN201810415714.7A CN201810415714A CN108596268A CN 108596268 A CN108596268 A CN 108596268A CN 201810415714 A CN201810415714 A CN 201810415714A CN 108596268 A CN108596268 A CN 108596268A
Authority
CN
China
Prior art keywords
training
sample
ripper
weak classifier
classification
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810415714.7A
Other languages
Chinese (zh)
Inventor
赵寒枫
陈佐
杨胜刚
陈邦道
梅雪松
余湘军
李浩之
王芍
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan Huda Jinke Technology Development Co ltd
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN201810415714.7A priority Critical patent/CN108596268A/en
Publication of CN108596268A publication Critical patent/CN108596268A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2413Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on distances to training or reference patterns

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The present invention discloses a kind of data classification method, and step includes:S1. the training set sample for training grader is obtained, and the training set sample of acquisition is subjected to decile according to iterations needed for training, obtains multiple training subset samples;S2. each training subset sample is trained respectively using multiple Weak Classifiers based on Adaboost algorithm, and when each Weak Classifier is trained, the partial error sample combination that selected section training subset sample and a upper Weak Classifier obtain constitutes final training sample, and final ADB strong classifiers are obtained by each Weak Classifier after the completion of training;S3. the ADB strong classifiers obtained using training are treated grouped data and classified, output category result.Data are complete when classification based training of the present invention, and can avoid training data into multiple growth and over-fitting, have many advantages, such as that simple realization principle, classification effectiveness and precision are high.

Description

A kind of data classification method
Technical field
The present invention relates to technical field of data processing more particularly to a kind of data classification methods.
Background technology
Data classification is that data are mapped as to specified classification, and Adaboost (Adaptive Boostin, it is adaptive to increase It is a kind of self-adapting data sorting algorithm by force), different grader (Weak Classifier) is trained for same training set, then by this A little weak classifier sets are got up, and are constituted a stronger final classification device (strong classifier), are adaptively:Upper one weak point The sample of class device misclassification can be strengthened, and all samples after weighting are used to train next basic classification device again, together When a new Weak Classifier is added in each round, until reaching some scheduled sufficiently small error rate, or reach advance Specified maximum iteration.Adaboost algorithm has very strong cycle learning ability, can preferably Weak Classifier into Row combination is reinforced, thus has important researching value.
Adaboost algorithm has the advantages that a variety of:1. precision is high;2. various methods can be used to build (weak point of sub-classifier Class device), Adaboost provides algorithm frame;3. when using simple classification device, the result is that it should be understood that and weak typing Device simple structure etc..But nicety of grading, the performance of traditional Adaboost algorithm still have the space further increased. All it is that training is originally iterated using bulk sample, specifically first the weights of initialization training data divide when Adaboost algorithm is trained Cloth, if there is N number of sample, each sample assigns identical weight 1/N, training Weak Classifier, in training process when starting In, if some sample is accurately classified, under construction when a training set, its weight will be lowered; On the contrary, if some sample is not by Accurate classification, its weight is just improved;Then, the sample of updated weight For next grader, so iteration continues for entire training process;The Weak Classifier that each training obtains is combined into strong point Class device.After the training process of each Weak Classifier, the weight of the smaller Weak Classifier of error in classification rate can be increased, make its Play larger decisive action in final classification function, and reduce the weight of the big Weak Classifier of error in classification rate, make its Smaller decisive action is played in final classification function.I.e. the core concept of Adaboost is exactly the data for mistake classification Improve the weight of its sample, and the data correctly classified, the weight of sample is reduced, it is made respectively to account for 50%.
But Adaboost algorithm classify when using above-mentioned bulk sample be originally iterated it is trained by the way of can have the following problems:
(1) if being originally iterated training using bulk sample, each time after iteration, the quantity of sample will increase at multiple, Increase trained difficulty;
(2) if constituting corresponding weight proportion using random sampling, and the sample of part can be caused to miss, caused Training is imperfect;
(3) for repeatedly wrong sample, primal algorithm can unanimously increase the weight of its sample, if this sample is to peel off Point, then subsequent classifier will be caused excessively to train outlier, to deviate real data sample.
Include at present mainly two methods for the improvement of sorting algorithm:The mode that be improved to algorithm itself and right Polyalgorithm is combined the mode of superposition, is usually all by algorithm wherein for the improved mode of sorting algorithm itself Some characteristics of itself are improved, and are such as increased discrimination formula, are merged other algorithms and innovatory algorithm structure, but due to machine The universal complexity of device learning algorithm itself is higher, and innovatory algorithm is also to be established substantially in specific application scenarios, is not had pervasive Property, furthermore the problems such as to be algorithmically improved that there is difficulty big, algorithm redundancy is complicated;And for the mode of combination, due to The design feature of algorithm itself will not be upset, can mutually be made up, be had prodigious according to the characteristic between algorithms of different Advantage and with strong applicability, but be usually all at present simple combination side for the combination improved procedure of Adaboost algorithm Formula can still have the number such as above-mentioned sample there is no the above problem of Adaboost algorithm in the training process itself is considered The growth of amount meeting at double, training difficulty is big or training is imperfect and will produce the problems such as training.
Invention content
The technical problem to be solved in the present invention is that:For technical problem of the existing technology, the present invention provides one Kind of realization principle is simple, classification effectiveness and precision are high, and data are complete when classification based training and avoidable training data increases at multiple And the data classification method of over-fitting.
In order to solve the above technical problems, technical solution proposed by the present invention is:
A kind of data classification method, which is characterized in that step includes:
S1. the training set sample for training grader is obtained, and will be described in acquisition according to iterations needed for training Training set sample carries out decile, obtains multiple training subset samples;
S2. each training subset sample is trained respectively using multiple Weak Classifiers based on Adaboost algorithm, And when each Weak Classifier training, partial error sample that selected section training subset sample and a upper Weak Classifier obtain Combination constitutes final training sample, and final ADB strong classifiers are obtained by each Weak Classifier after the completion of training;
S3. the ADB strong classifiers obtained using training are treated grouped data and classified, output category result.
As a further improvement on the present invention:Ripper (rule inductive learning) weak typing is specifically used in the step S2 Device, to use Ripper algorithms to be trained each training subset sample, by the Ripper Weak Classifiers after the completion of training Obtain final Ripper-ADB strong classifiers.
As a further improvement on the present invention:Training subset sample, upper one in final training sample in the step S2 The error sample that a Weak Classifier obtains accounts for 50% respectively.
As a further improvement on the present invention:Equal timesharing are carried out in the step S1, isodisperse takes needed for the training repeatedly The training set sample S is divided into N parts of training subset sample S by generation number according to iterations N needed for training1,S2,SN
As a further improvement on the present invention:The isodisperse of iterations needed for the training and the training set sample Specially it is not less than 10.
As a further improvement on the present invention:Further include the institute to acquisition when obtaining training set sample in the step S1 It states training subset sample and carries out Feature Selection step respectively to reduce training characteristics.
As a further improvement on the present invention:When the Feature Selection, deleted out after classifying especially by each RIPPER Occurrence number is less than the characteristic attribute of specified threshold, and the characteristic attribute collection after being screened re-starts RIPPER classification, until raw At RIPPER rating models precision or feature quantity reach preset requirement, obtain final training subset sample output.
As a further improvement on the present invention:The Feature Selection step is specially the training subset sample to acquisition Execute multiple RIPPER classification, after each RIPPER classification according to classification results to characteristic attribute in the training subset sample into Training subset sample after screening is re-started RIPPER classification by row screening, until the RIPPER disaggregated models needed for generating, Export final training subset sample.
As a further improvement on the present invention, the Feature Selection the specific steps are:
S11. classified using RIPPER graders to current training subset sample, according to each feature in classification results The number that attribute occurs counts the weight of each characteristic attribute, and is ranked up to each characteristic attribute according to the weight of statistics, obtains Characteristic attribute collection after sequence;
S12. the characteristic attribute that characteristic attribute after the sequence concentrates occurrence number to be less than predetermined threshold value is deleted, is obtained more Characteristic attribute collection after new;
S13. the updated characteristic attribute collection step S12 obtained carries out RIPPER classification, judges currently available RIPPER disaggregated models precision or feature quantity whether reach preset requirement, if so, obtaining final RIPPER gradings Otherwise model, output current signature property set return to step S11 as final training subset sample.
As a further improvement on the present invention, the step S2 the specific steps are:
S21. first part of training subset sample is obtained, and classification based training is carried out using Weak Classifier, obtains Weak Classifier a1 And mistake divides sample R1, carries out statistics calculating to the classification results of the Weak Classifier a1, obtains the power of the Weak Classifier a1 Weight w1;
S22. sample Ri is divided to carry out duplicate sampling expansion according to designated ratio quantity the obtained mistakes of upper Weak Classifier ai, The error sample collection Rip expanded, and the error sample collection Rip of the expansion is added to lower a training subset sample Si In+1, new training sample subset Si+1R, wherein i=1,2 ..., N are obtained, N is training subset sample number;
S23. classification based training is carried out using Weak Classifier to the new sample training subset Si+1R, obtains Weak Classifier Ai+1 and mistake divide sample Ri+1, carry out statistics calculating to the classification results of the Weak Classifier ai+1, obtain the weak typing The weight w2 of device ai+1;
S24. step S22, S23 is repeated, until all training subset sample trainings finish, obtains each Weak Classifier a1, a2,…,ai,…,aN;
S25. each Weak Classifier a1, the a2 that will be obtained ... ai ..., aN and corresponding weight w1, w2 ..., wi ..., wN into After row weighting, final ADB strong classifiers are obtained.
Compared with the prior art, the advantages of the present invention are as follows:
1) data classification method of the present invention, on the basis of Adaboost algorithm frame, by the way that training set is carried out decile, The partial error sample combination that selected section training subset sample and a upper Weak Classifier obtain when training constitutes final Training sample be trained, may be implemented cycle interpenterating sample training method, since each selected section etc. divides sample Be trained so that the error sample of expansion is definite value, will not be increased at multiple, and due to total data carry out decile after, often A data can all be overlapped training, will not omit data from the sample survey, it can be ensured that training is complete, while carrying out wrong sample every time When this expansion, the effect of accumulation training is not only played to wrong data, but also due to the addition of new samples, it can be to avoid repeatedly The excessive training of the data of mistake.
2) data classification method of the present invention uses Ripper Weak Classifiers, Ke Yijie on the basis of Adaboost algorithm The characteristic for closing Ripper classification, realizes the assembled classification method of Ripper-ADB, solves in Adaboost algorithm training process Training data at multiple increase, over-fitting the problems such as, while can give full play to Ripper classification, Adaboost algorithm it is excellent Gesture effectively improves the precision and performance of data classification.
Description of the drawings
Fig. 1 is the implementation process schematic diagram of the present embodiment data classification method.
Fig. 2 is the idiographic flow schematic diagram of the present embodiment training Ripper-ADB graders.
Fig. 3 be in the specific embodiment of the invention with the experimental result of Ripper algorithms (training set) contrast schematic diagram.
Fig. 4 be in the specific embodiment of the invention with the experimental result of Ripper algorithms (test set) contrast schematic diagram.
Fig. 5 is to compare to illustrate with the experimental result of other two kinds of conventional algorithms (training set) in the specific embodiment of the invention Figure.
Fig. 6 is to compare to illustrate with the experimental result of other two kinds of conventional algorithms (test set) in the specific embodiment of the invention Figure.
Specific implementation mode
Below in conjunction with Figure of description and specific preferred embodiment, the invention will be further described, but not therefore and It limits the scope of the invention.
As shown in Figure 1, the present embodiment data classification method, step include:
S1. the training set sample for training grader is obtained, and according to iterations needed for training by the training of acquisition Collect sample and carry out decile, obtains multiple training subset samples;
S2. each training subset sample is trained respectively using multiple Weak Classifiers based on Adaboost algorithm, and every When a Weak Classifier training, the partial error sample that selected section training subset sample and a upper Weak Classifier obtain combines Final training sample is constituted, final ADB strong classifiers are obtained by each Weak Classifier after the completion of training;
S3. the ADB strong classifiers obtained using training are treated grouped data and classified, output category result.
The present embodiment, by the way that training set is carried out decile, only selects on the basis of Adaboost algorithm frame when training The combination of partial error sample that part training subset sample and a upper Weak Classifier obtain constitute final training sample into Row training, may be implemented the training method of cycle interpenterating sample, since each selected section etc. divides sample to be trained so that The error sample of expansion is definite value, will not be increased at multiple, and after carrying out decile due to total data, and each part of data will be into Row superposition training, will not omit data from the sample survey, it can be ensured that when training completely, while carrying out error sample expansion every time, not only The effect of accumulation training is played to wrong data, and due to the addition of new samples, it can be to avoid repeatedly wrong data mistake The training divided.
Illustrate the detailed process of Adaboost algorithm training grader first, as follows:
(1) the weights distribution of initialization training data, each training sample most start to be endowed identical weight:1/ N。
D1Indicate the training set sample set of Weight;W indicates the weight of each training sample;N indicates that sample set is total Number;I indicates the counting of each sample.
(2) more wheel iteration are carried out, wherein with m=1,2 ..., M indicates iteration round:
A. it is distributed D using with weightsmTraining dataset study, obtain basic classification device:
Gm(x):χ→{-1,+1} (2)
Gm(x) m-th of grader is indicated;χ presentation class device distributions.
B. error in classification rates of the Gm (x) on training dataset is calculated:
P is error probability values calculation expression;Gm(xi) indicate prediction result of m-th of grader for i-th of sample; yiIndicate the classification results of i-th of sample;I indicates Rule of judgment;W indicates the weight of error sample.
By above-mentioned formula it is found that error rate ems of the Gm (x) on training dataset is exactly by the power of Gm (x) misclassification samples The sum of value.
C. G is calculatedm(x) coefficient, to obtain basic classification device weight shared in final classification device, amIndicate Gm(x) Significance level in final classification device:
By above-mentioned formula it is found that em<When=1/2, am>=0, and amWith emReduction and increase, i.e. error in classification rate is got over Effect of the small basic classification device in final classification device is bigger.
D. the weights distribution of update training dataset, to obtain the weights distribution for the sample of next round iteration.
Dm+1(wm+1,1,wm+1,2,…wm+1,i…,wm+1,N)
Exp be with e be low exponential function, ZmIt is standardizing factor so that Dm+1As a probability distribution:
Pass through above-mentioned update so that by basic classification device Gm(x) weights of misclassification sample increase, and by sample of correctly classifying This weights reduce, so that " can focus on " on those more difficult point samples.
Each Weak Classifier is combined as the following formula:
Final classification device is obtained, it is as follows:
Sign indicates sign function, i.e., when functional value is more than 0, result 1, and when being equal to 0, result 0, when being less than 0, knot Fruit is -1.
The present embodiment trains the frame of grader based on above-mentioned Adaboost algorithm, and training set is carried out decile first, instructs The partial error sample combination that a selected section training subset sample and a upper Weak Classifier obtain when practicing constitutes final Training sample is trained, with realize cycle interpenterating sample training method, can to avoid training data at multiple increase and Over-fitting etc., while ensuring that training is complete.
In the present embodiment, Ripper graders are specifically used in step S2, to use Ripper algorithms to each training subset Sample is trained, and final Ripper-ADB strong classifiers, Ke Yijie are obtained by the Ripper Weak Classifiers after the completion of training The characteristic for closing Ripper classification, realizes the assembled classification method of Ripper-ADB, solves in Adaboost algorithm training process Training data at multiple increase, over-fitting the problems such as, while can give full play to Ripper classification, Adaboost algorithm it is excellent Gesture effectively improves the precision and performance of data classification.
In the present embodiment, training subset sample, a upper Weak Classifier obtain in final training sample in step S2 Error sample accounts for 50% respectively, that is, when training Weak Classifier, training subset sample is chosen with error sample according to same ratio To final training sample.Specifically after a upper Weak Classifier obtains classification results, by mistake point sample according to equal portions sample This identical magnitude (50%) carries out duplicate sampling expansion, the error sample collection expanded, then by the error sample collection of expansion It is added in current training subset sample (50%) and is trained to get to final training sample.Using training sample, mistake point The superposition training method of sample same ratio, may further ensure that training sample is adequately trained, and the mistake expanded Sample data volume is not too large, has enough training samples and wrong data to participate in training when training every time, thus into one Step improves training effectiveness and performance.
Equal timesharing are carried out in the present embodiment, in step S1, isodisperse takes iterations needed for training, i.e., according to needed for training Training set sample S is divided into N parts of training subset sample S by iterations N1,S2,Sn, subsequently again respectively to the N parts of training subset Sample S1,S2,Sn
Further include the training subset sample point to acquisition when obtaining training set sample in the present embodiment, in step S1 Not carry out Feature Selection step to reduce training characteristics, i.e., in training set sample carry out Feature Selection, to remove the spy of redundancy Sign reduces feature quantity, training effectiveness is can further improve, in conjunction with above-mentioned cycle interpenterating sample training method, due to wrong sample Originally sample information is increased, it is ensured that while classification accuracy, improve classification effectiveness.
In the present embodiment, Feature Selection step is specially to execute multiple RIPPER to the training subset sample of acquisition to divide Class screens characteristic attribute in the training subset sample according to classification results after each RIPPER classification, after screening Training subset sample re-start RIPPER classification, until generate needed for RIPPER disaggregated models, export final training Subset sample.By combining RIPPER classification to carry out Feature Selection, sieved according to classification results after executing RIPPER classification every time Characteristic attribute is selected, the task amount of multidimensional characteristic training can greatly be reduced, improves data classification effectiveness.
Occurrence number, which is deleted, when the present embodiment Feature Selection, after classifying especially by each RIPPER is less than specified threshold Characteristic attribute, the characteristic attribute collection after being screened re-start RIPPER classification, until the RIPPER rating models generated Precision or feature quantity reach preset requirement, obtain final training subset sample output.By after each RIPPER classification The characteristic attribute that occurrence number is less than specified threshold is deleted, that is, deletes and does not occur or characteristic attribute that occurrence number is less, to go Except uncorrelated and redundancy feature makes characteristic reduce, i.e. the value of characteristic N becomes smaller, and due to the reduction of characteristic, can also remove Some examples repeated make instance number P also reduce, so as to be effectively prevented from " dimension disaster " and " multiple shot array ", simultaneously Due to the reduction of N and P, it is possible to reduce the time of model learning, to further increase classification effectiveness.
In the present embodiment, Feature Selection the specific steps are:
S11. classified using RIPPER graders to current training subset sample, according to each feature in classification results The number that attribute occurs counts the weight of each characteristic attribute, and is ranked up to each characteristic attribute according to the weight of statistics, obtains Characteristic attribute collection after sequence;
S12. the characteristic attribute that characteristic attribute after sequence concentrates occurrence number to be less than predetermined threshold value is deleted, after obtaining update Characteristic attribute collection;
S13. updated characteristic attribute collection step S12 obtained carries out RIPPER classification, judges currently available Whether the precision or feature quantity of RIPPER disaggregated models reach preset requirement, if so, obtaining final RIPPER grading moulds Otherwise type, output current signature property set return to step S11 as final training subset sample.
Current signature selection specifically can also use other Feature Selection algorithms according to demand.
In the present embodiment, step S2 the specific steps are:
S21. first part of training subset sample is obtained, and classification based training is carried out using Weak Classifier, obtains Weak Classifier a1 And mistake divides sample R1, carries out statistics calculating to the classification results of Weak Classifier a1, obtains the weight w1 of Weak Classifier a1;
S22. sample Ri is divided to carry out duplicate sampling expansion according to designated ratio quantity the obtained mistakes of upper Weak Classifier ai, The error sample collection Rip expanded, and the error sample collection Rip of expansion is added to lower a training subset sample Si+1 In, new training sample subset Si+1R, wherein i=1,2 ..., N are obtained, N is training subset sample number;
S23. classification based training is carried out using Weak Classifier to new sample training subset Si+1R, obtains Weak Classifier ai+1 And mistake divides sample Ri+1, carries out statistics calculating to the classification results of Weak Classifier ai+1, obtains the weight of Weak Classifier ai+1 w2;
S24. step S22, S23 is repeated, until all training subset sample trainings finish, obtains each Weak Classifier a1, a2,…,ai,…,aN;
S25. each Weak Classifier a1, the a2 that will be obtained ... ai ..., aN and corresponding weight w1, w2 ..., wi ..., wN into After row weighting, final ADB strong classifiers are obtained.
Through the above steps, in the frame foundation of Adaboost algorithm, using NSL-KDD data sets (KDD CUP data Excavate the modified version of 1999 annual data collection of match), training sample is superimposed by cycle, mistake divides sample, realization cycle interpenterating sample Training method, the above-mentioned specifically used Ripper graders of Weak Classifier, finally obtains Ripper-ADB strong classifiers.
As shown in Fig. 2, the detailed process of the present embodiment training Ripper-ADB graders is:
1. training set sample is carried out decile first, in accordance with iterations, N parts of training subset sample S are obtained1,S2,Sn
2. by first part of training sample S1Classification based training is carried out using Ripper algorithms, obtains grader a1, error sample R1
3. to a1Classification results carry out statistics calculating, obtain a1The weight w of grader1
4. by a1The sample R of mistake point1Duplicate sampling expansion is carried out according to magnitude (50%) identical with equal portions sample, is obtained The error sample R of expansion1p
5. by the error sample R of expansion1pIt is added to second part of training sample S2In, obtain new sample S2R
6. to new samples S2RThe classification based training of Ripper algorithms is carried out again, generates grader a2, error sample w2
7. to grader a2Classification results carry out statistics calculating, obtain a2The weight w of grader2
8. steps be repeated alternatively until that all sample trainings finish;
9. the skilled weighting classification device of institute is overlapped, final strong classifier Ripper-ADB is constituted.
In the present embodiment, when above-mentioned carry out classification based training, successive ignition is specifically executed, until obtained grader reaches finger Fixed accuracy rate or while reaching preset iterations, exit.
To verify effectiveness of the invention, said combination grader Ripper-ADB and original Ripper algorithms are used respectively Classify to same data, while considering the influence of iterations and data from the sample survey amount to entire grader, assembled classifier Ripper-ADB specifically includes the grader of 3 kinds of different iterations, and the data division detailed rules and regulations of three kinds of different iterations are as follows:
The first:Ripper-3ADB
Classifier training step is:
Initial data is about divided into 3 parts by a, every part of about 40,000 data volumes;
B. the training of grader a1 is carried out firstly for first part of data;
C. for the data of grader a1 mistakes point, wrong divided data resampling 50% is carried out, is filled up to second part of data In, form about 80,000 training data;
D. 80,000 data that step c is newly formed are carried out with the training of grader a2.
E. it repeats the above steps, until all training of all graders finish.
Second:Ripper-6ADB
Classifier training step is:
A., initial data is about divided into 6 parts, every part of about 20,000 data volumes;
B. the training of grader a1 is carried out firstly for first part of data;
C. for the data of grader a1 mistakes point, wrong divided data resampling 50% is carried out, is filled up to second part of data In, form about 40,000 training data;
D. 40,000 data newly formed in step c are carried out with the training of grader a2;
E. it repeats the above steps, until all training of all graders finish.
The third:Ripper-10ADB
Classifier training step is:
A., initial data is about divided into 10 parts, every part of about 1.2 ten thousand data volumes;
B. the training of grader a1 is carried out firstly for first part of data;
C. for the data of grader a1 mistakes point, wrong divided data resampling 50% is carried out, is filled up to second part of data In, form about 2.4 ten thousand training data;
D. 2.4 ten thousand data that step c is newly formed are carried out with the training of grader a2;
E. it repeats the above steps, until all training of all graders finish.
Use the assembled classifier Ripper-ADB of above-mentioned three kinds different iterations and original Ripper algorithms respectively again Training set carry out class test, classification accuracy is compared in different characteristic attribute, experimental result compare such as Shown in Fig. 3, specific experiment data are as shown in table 1, and the comparing result on test set is as shown in figure 4, specific experiment data such as table 2 It is shown.It is found that 4 graders can keep higher classification accuracy substantially before 17 attributes from Fig. 3, table 1, wherein Since original Ripper is to be directed to bulk sample to be originally trained, so the fitness to training set is preferable, and Ripper- of the present invention For ADB graders under 3 kinds of iterations, the more fitness of sample that when each iteration trains are better;17 characteristic attributes it Afterwards, since the reduction of characteristic attribute makes data information lack, classification accuracy is begun to decline, although traditional classification arithmetic accuracy It is higher, but due to being originally trained to bulk sample, it can lead to the over-fitting of grader, and Ripper-6ADB of the present invention, Tri- kinds of graders of Ripper-6ADB and Ripper-10ADB are to be based on multi-categorizer, there is each grader in Ripper-ADB Specific selection can play balanced action well between grader, fitting training data that can't be excessive, can be to avoid Over-fitting.
Table 1:Training set Ripper-ADB and Ripper classification accuracies.
As shown in Fig. 4, table 2, in Ripper-3ADB graders, before characteristic attribute is reduced to 17 attributes, substantially On the classification accuracy that can be consistent with Ripper, and after 17 attributes, due to the reduction of attribute, Ripper- The counterbalance effect of 3ADB graders is limited, i.e., when grader weight difference is smaller, two graders, which possess, absolutely judges right, leads Cause classification accuracy relatively low, and Ripper-6ADB, Ripper-10ADB can reach in 41 and attribute of beginning and 32 attributes 88% or so classification accuracy, since between more grader, good counterbalance effect can be generated, and for training The accumulation training of the wrong divided data of collection plays the role of apparent, and the accuracy rate when attribute reduces is also above original grader. The experimental result of contrast test collection and training set also further demonstrates the original Ripper algorithms using full sample training, deposits The over-fitting the problem of.
Table 2:Test set Ripper-ADB and Ripper classification accuracies.
It can be seen that from above-mentioned experimental result, it is more original using the above-mentioned Ripper-ADB assembled classification methods of the present invention Ripper algorithms have higher classification accuracy.
The present embodiment further use the preferable Ripper-10ADB of said effect and common decision tree, SVM algorithm into Row contrast test also has higher to verify Ripper-ADB assembled classification methods of the present invention in face of other machines learning algorithm Classification accuracy, performance, experimental result difference it is as shown in Figure 5,6, specific test data is as shown in Table 3, 4.
As shown in figure 5, in training set, Ripper-ADB sorting techniques of the present invention are gradual in attribute with decision tree (C4.5) During reduction, almost possess consistent classification accuracy, and SVM is due to big for large-scale data intractability, and The data for including many noises can not be effectively handled, therefore classification accuracy is relatively low.
Table 3:Training set Ripper-ADB and other algorithm classification accuracys rate.
As shown in fig. 6, due to the effect of Ripper-ADB multi-categorizers, and for the repetition training of wrong data Journey, the phenomenon that causing it not occur for training set over-fitting, classification accuracy is above C4.5 and SVM algorithm, 32 The classification accuracy of highest 88.5814% can be reached when a attribute.
Table 4:Test set Ripper-ADB and other algorithm classification accuracys rate.
Above-mentioned only presently preferred embodiments of the present invention, is not intended to limit the present invention in any form.Although of the invention Disclosed above with preferred embodiment, however, it is not intended to limit the invention.Therefore, every without departing from technical solution of the present invention Content, technical spirit any simple modifications, equivalents, and modifications made to the above embodiment, should all fall according to the present invention In the range of technical solution of the present invention protection.

Claims (10)

1. a kind of data classification method, which is characterized in that step includes:
S1. the training set sample for training grader is obtained, and according to iterations needed for training by the training of acquisition Collect sample and carry out decile, obtains multiple training subset samples;
S2. each training subset sample is trained respectively using multiple Weak Classifiers based on Adaboost algorithm, and every When a Weak Classifier training, the partial error sample that selected section training subset sample and a upper Weak Classifier obtain combines Final training sample is constituted, final ADB strong classifiers are obtained by each Weak Classifier after the completion of training;
S3. the ADB strong classifiers obtained using training are treated grouped data and classified, output category result.
2. data classification method according to claim 1, which is characterized in that specifically use Ripper weak in the step S2 Grader, to use Ripper algorithms to be trained each training subset sample, by weak point of the Ripper after the completion of training Class device obtains final Ripper-ADB strong classifiers.
3. data classification method according to claim 1, which is characterized in that in the step S2 in final training sample The error sample that training subset sample, a upper Weak Classifier obtain accounts for 50% respectively.
4. data classification method according to claim 1, it is characterised in that:Equal timesharing, decile are carried out in the step S1 Number takes iterations needed for the training, i.e., the training set sample S is divided into N parts of instructions according to iterations N needed for training Practice subset sample S1,S2,SN, i is the number for dividing sample, SiFor i-th point of sample.
5. data classification method according to claim 4, it is characterised in that:Iterations needed for the training and described The isodisperse of training set sample is specially to be not less than 10.
6. the data classification method according to any one of Claims 1 to 5, which is characterized in that obtained in the step S1 Further include carrying out Feature Selection step respectively to the training subset sample of acquisition to train spy to reduce when taking training set sample Sign.
7. data classification method according to claim 6, which is characterized in that the Feature Selection step is specially to obtaining The training subset sample execute multiple RIPPER classification, according to classification results to training after each RIPPER classification Characteristic attribute is screened in collection sample, the training subset sample after screening is re-started RIPPER classification, until generating institute The RIPPER disaggregated models needed, export final training subset sample.
8. data classification method according to claim 7, which is characterized in that when the Feature Selection, especially by each The characteristic attribute that occurrence number is less than specified threshold is deleted after RIPPER classification, the characteristic attribute collection after being screened re-starts RIPPER classifies, until the precision or feature quantity of the RIPPER rating models generated reach preset requirement, obtains final instruction Practice subset sample output.
9. data classification method according to claim 8, which is characterized in that the Feature Selection the specific steps are:
S11. classified using RIPPER graders to current training subset sample, according to each characteristic attribute in classification results The number of appearance counts the weight of each characteristic attribute, and is ranked up to each characteristic attribute according to the weight of statistics, is sorted Characteristic attribute collection afterwards;
S12. the characteristic attribute that characteristic attribute after the sequence concentrates occurrence number to be less than predetermined threshold value is deleted, after obtaining update Characteristic attribute collection;
S13. the updated characteristic attribute collection step S12 obtained carries out RIPPER classification, judges currently available Whether the precision or feature quantity of RIPPER disaggregated models reach preset requirement, if so, obtaining final RIPPER grading moulds Otherwise type, output current signature property set return to step S11 as final training subset sample.
10. the data classification method according to any one of Claims 1 to 5, which is characterized in that the tool of the step S2 Body step is:
S21. obtain first part of training subset sample, and using Weak Classifier carry out classification based training, obtain Weak Classifier a1 and Mistake divides sample R1, carries out statistics calculating to the classification results of the Weak Classifier a1, obtains the weight w1 of the Weak Classifier a1;
S22. divide sample Ri to carry out duplicate sampling expansion according to designated ratio quantity the obtained mistakes of upper Weak Classifier ai, obtain The error sample collection Rip of expansion, and the error sample collection Rip of the expansion is added to lower a training subset sample Si+1 In, new training sample subset Si+1R, wherein i=1,2 ..., N are obtained, N is training subset sample number;
S23. classification based training is carried out using Weak Classifier to the new sample training subset Si+1R, obtains Weak Classifier ai+1 And mistake divides sample Ri+1, carries out statistics calculating to the classification results of the Weak Classifier ai+1, obtains the Weak Classifier ai+ 1 weight w2;
S24. step S22, S23 is repeated, until all training subset sample trainings finish, obtains each Weak Classifier a1, a2 ..., ai,…,aN;
S25. each Weak Classifier a1, the a2 that will be obtained ... ai ..., aN and corresponding weight w1, w2 ..., wi ..., wN is added Quan Hou obtains final ADB strong classifiers.
CN201810415714.7A 2018-05-03 2018-05-03 A kind of data classification method Pending CN108596268A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810415714.7A CN108596268A (en) 2018-05-03 2018-05-03 A kind of data classification method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810415714.7A CN108596268A (en) 2018-05-03 2018-05-03 A kind of data classification method

Publications (1)

Publication Number Publication Date
CN108596268A true CN108596268A (en) 2018-09-28

Family

ID=63619700

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810415714.7A Pending CN108596268A (en) 2018-05-03 2018-05-03 A kind of data classification method

Country Status (1)

Country Link
CN (1) CN108596268A (en)

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671076A (en) * 2018-12-20 2019-04-23 上海联影智能医疗科技有限公司 Blood vessel segmentation method, apparatus, electronic equipment and storage medium
CN109961090A (en) * 2019-02-28 2019-07-02 广州杰赛科技股份有限公司 A kind of behavior classification method and system based on intelligent wearable device
CN110390272A (en) * 2019-06-30 2019-10-29 天津大学 A kind of EEG signal feature dimension reduction method based on weighted principal component analyzing
CN110443289A (en) * 2019-07-19 2019-11-12 清华大学 Detection deviates the method and system of distribution sample
CN110708321A (en) * 2019-10-12 2020-01-17 广元市公安局 Internet new media matrix supervision system and method
CN111126444A (en) * 2019-11-28 2020-05-08 天津津航技术物理研究所 Classifier integration method
CN111339910A (en) * 2020-02-24 2020-06-26 支付宝实验室(新加坡)有限公司 Text processing method and device and text classification model training method and device
CN112633900A (en) * 2020-12-16 2021-04-09 北京国电通网络技术有限公司 Industrial Internet of things data verification method based on machine learning
CN113657460A (en) * 2021-07-28 2021-11-16 上海影谱科技有限公司 Boosting-based attribute identification method and device
CN114860797A (en) * 2022-03-16 2022-08-05 电子科技大学 Data derivation processing method

Cited By (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109671076A (en) * 2018-12-20 2019-04-23 上海联影智能医疗科技有限公司 Blood vessel segmentation method, apparatus, electronic equipment and storage medium
CN109961090A (en) * 2019-02-28 2019-07-02 广州杰赛科技股份有限公司 A kind of behavior classification method and system based on intelligent wearable device
CN110390272A (en) * 2019-06-30 2019-10-29 天津大学 A kind of EEG signal feature dimension reduction method based on weighted principal component analyzing
CN110390272B (en) * 2019-06-30 2023-07-18 天津大学 EEG signal feature dimension reduction method based on weighted principal component analysis
CN110443289B (en) * 2019-07-19 2022-02-08 清华大学 Method and system for detecting deviating distributed samples
CN110443289A (en) * 2019-07-19 2019-11-12 清华大学 Detection deviates the method and system of distribution sample
CN110708321A (en) * 2019-10-12 2020-01-17 广元市公安局 Internet new media matrix supervision system and method
CN111126444A (en) * 2019-11-28 2020-05-08 天津津航技术物理研究所 Classifier integration method
CN111339910A (en) * 2020-02-24 2020-06-26 支付宝实验室(新加坡)有限公司 Text processing method and device and text classification model training method and device
CN111339910B (en) * 2020-02-24 2023-11-28 支付宝实验室(新加坡)有限公司 Text processing and text classification model training method and device
CN112633900A (en) * 2020-12-16 2021-04-09 北京国电通网络技术有限公司 Industrial Internet of things data verification method based on machine learning
CN113657460A (en) * 2021-07-28 2021-11-16 上海影谱科技有限公司 Boosting-based attribute identification method and device
CN114860797A (en) * 2022-03-16 2022-08-05 电子科技大学 Data derivation processing method

Similar Documents

Publication Publication Date Title
CN108596268A (en) A kind of data classification method
Kim et al. Sample-level CNN architectures for music auto-tagging using raw waveforms
CN107644057B (en) Absolute imbalance text classification method based on transfer learning
Seiffert et al. RUSBoost: Improving classification performance when training data is skewed
CN108319987A (en) A kind of filtering based on support vector machines-packaged type combined flow feature selection approach
CN109872232A (en) It is related to illicit gain to legalize account-classification method, device, computer equipment and the storage medium of behavior
CN108932950A (en) It is a kind of based on the tag amplified sound scenery recognition methods merged with multifrequency spectrogram
CN108335200A (en) A kind of credit rating method that feature based is chosen
CN106056098B (en) A kind of pulse signal cluster method for separating based on categories combination
CN110515845B (en) Combined test case optimization generation method based on improved IPO strategy
CN110428045A (en) Depth convolutional neural networks compression method based on Tucker algorithm
CN103985381A (en) Voice frequency indexing method based on parameter fusion optimized decision
CN110533116A (en) Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN114556360A (en) Generating training data for machine learning models
CN108596758A (en) A kind of credit rating method based on classification rule-based classification
CN112733999B (en) Service mode construction method based on self-error correction mechanism particle swarm optimization algorithm
CN108446214A (en) Test case evolution generation method based on DBN
CN108765146A (en) The method and apparatus that a kind of basis has tracing pattern selection specific curves stock
CN105989375A (en) Classifier, classification device and classification method for classifying handwritten character images
CN110493262A (en) It is a kind of to improve the network attack detecting method classified and system
CN109033921A (en) A kind of training method and device of identification model
CN110502432A (en) Intelligent test method, device, equipment and readable storage medium storing program for executing
CN106127208A (en) Method and system that multiple objects in image are classified, computer system
Chen et al. Topological pooling on graphs
CN106056160A (en) User fault-reporting prediction method in unbalanced IPTV data set

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
TA01 Transfer of patent application right

Effective date of registration: 20200515

Address after: Guanxi Town, Dingcheng District, Changde, Hunan Province

Applicant after: Hunan Huda Jinke Technology Development Co.,Ltd.

Address before: Yuelu District City, Hunan province 410082 Changsha Lushan South Road, Hunan University College of information science and Engineering

Applicant before: HUNAN University

TA01 Transfer of patent application right
RJ01 Rejection of invention patent application after publication

Application publication date: 20180928

RJ01 Rejection of invention patent application after publication