CN101604394A - Increment study classification method under a kind of limited storage resources - Google Patents

Increment study classification method under a kind of limited storage resources Download PDF

Info

Publication number
CN101604394A
CN101604394A CNA2008102463159A CN200810246315A CN101604394A CN 101604394 A CN101604394 A CN 101604394A CN A2008102463159 A CNA2008102463159 A CN A2008102463159A CN 200810246315 A CN200810246315 A CN 200810246315A CN 101604394 A CN101604394 A CN 101604394A
Authority
CN
China
Prior art keywords
sample
subclass
sorter
new samples
screening
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CNA2008102463159A
Other languages
Chinese (zh)
Inventor
桑农
程婷
张天序
曹治国
唐奇伶
程志利
张�荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Huazhong University of Science and Technology
Original Assignee
Huazhong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huazhong University of Science and Technology filed Critical Huazhong University of Science and Technology
Priority to CNA2008102463159A priority Critical patent/CN101604394A/en
Publication of CN101604394A publication Critical patent/CN101604394A/en
Pending legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention provides the increment study classification method under a kind of limited storage resources, belong to mode identification technology.This method adopts minimum distance classifier, be specially: at first new samples is presorted, the correct new samples of presorting is added in the sorter respective subset, the wrong new samples of presorting is added in the error sample set, the sample in the error sample set is carried out the K-mean cluster.Be respectively each the subclass screening representative sample in the set of sorter and error sample then, after the screening subclass in the error sample set added in the sorter, upgrade sorter, adopt the sorter that upgrades that new samples is classified at last.The present invention had both preserved learning knowledge by the screening of representative sample, obtained new knowledge again, had higher sample recognition accuracy on the basis that reduces storage overhead and computing cost.

Description

Increment study classification method under a kind of limited storage resources
Technical field
The invention belongs to area of pattern recognition, be specifically related to a kind of increment study classification method, this method by the study new samples, is brought in constant renewal in sorter under the situation of limited storage resources, improve classification capacity.
Background technology
Rapid increase along with rapid network development and quantity of information, traditional information excavating and knowledge acquisition technology have been subjected to great challenge, data qualification technology with incremental learning function becomes one of the intellectuality discovery of current information and gordian technique of excavating just gradually, compare with common data qualification technology, the incremental learning sorting technique has significant superiority, this mainly shows two aspects: on the one hand, because it need not save historical data, reduced taking of storage space; On the other hand, because it can make full use of the historical results of study in new training, thereby make study have continuity, and reduced the time of follow-up training significantly.
The various incremental learning methods of Ti Chuing all existed either large or small defective in recent years, were specially: (1) needs previous sample of having trained when realizing incremental learning, occupy storage space; (2) can not increase new classification information when study, classification performance is relatively poor; (3) based on a kind of on-line study method, adopt one by one the alternative manner of sample learning to realize incremental learning, classification effectiveness is low.
Summary of the invention
The object of the invention is to provide a kind of increment study classification method, by constantly improving sorter, has improved classify accuracy and efficient.
2, the increment study classification method under a kind of limited storage resources adopts minimum distance classifier, and this method is specific as follows:
Step (1) adopts current sorter that all new samples are presorted, and the correct new samples of presorting is added in the corresponding subclass of current sorter, and the wrong sample of presorting is added among the set W;
Whether step (2) judges the sample number of gathering among the W greater than set sample number predetermined value, if, enter step (3), otherwise, finish;
Step (3) is at the subclass Ω of current sorter i, i=1,2 ... in the m, m is the subclass sum in the current sorter, according to accounting for subclass Ω iThere has been sample number before presorting
Figure G2008102463159D00021
And subclass Ω iPresort back newly-increased sample number
Figure G2008102463159D00022
Ratio screening representative sample, the N value is 9*r i~10*r i, r iBe subclass Ω iAlready present sample number before presorting, k iBe subclass Ω iNewly-increased sample number after presorting, p iBe subclass Ω iThe representative sample given figure that the screening back keeps, p iSatisfy following relation:
Σ i = 1 m p i ≤ T p 1 V 1 = p 2 V 2 = . . . = p m V m
Wherein, T is each the subclass representative sample given figure summation predetermined value in the current sorter, V iBe subclass Ω iSample covariance matrix;
The sample that step (4) will be gathered among the W carries out the K-mean cluster by its classification;
Step (5) is searched in all subclass of set W and current sorter, judges whether to exist sample overlapping, if exist, then carries out adjusting between subclass; If do not exist, then enter step (6); The overlapping sample that should belong to first subclass that is meant of described sample is divided in second subclass by mistake, and first subclass and second subclass belong to a different category;
Step (6) is in set W, for the subclass of sample number more than or equal to subclass sample number threshold value, be this subclass random screening representative sample at first according to the ratio that accounts for the current sample number 1/10~1/9 of this subclass, again the subclass after the screening is added in the current sorter, revise the sorter structure and parameter to upgrade sorter;
Sorter after step (7) adopt to be upgraded is classified to all new samples, and the new samples of classification error is added among the set W, if the sample number among the set W keeps threshold value greater than new samples, then pair set W carries out the error sample screening, otherwise finishes;
Described error sample screening is carried out in the following manner:
(7a) among the set of computations W sample range difference of each sample and sample distance and;
(7b) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;
If (7c) sample number among the set W keeps threshold value more than or equal to new samples, return step (7b), otherwise finish;
Sample range difference and sample distance and computing method be: sample t ∈ W, in current sorter, belong in the generic subclass with sample t, search for the minor increment d of these subclass and sample t S_minIn the subclass that in current sorter, belongs to a different category, search for the ultimate range d of these subclass and sample t with sample t T_max, the sample range difference Δ d of sample t then T-=d T_min-d D_max, sample distance and Δ d T+=d S_min+ d D_min
Described step (5) carries out adjusting between subclass in the following manner: will exist the first overlapping subclass of sample and second subclass to carry out the k-mean cluster respectively, be decomposed into two littler subclass separately, if the sample overlapping phenomenon is not eliminated, continue to decompose, sample number is less than subclass sample number threshold value in the subclass that obtains after overlapping phenomenon is eliminated or decomposed.
The present invention on the sorting technique basis of minimum distance classifier (for example minimum mahalanobis distance), provided a kind of under the limited storage resources condition increment study classification method of minimum distance classifier.This method makes sorter can preserve the knowledge of having learnt under the limited storage resources condition, can learn new knowledge effectively again, sorter can constantly improve the recognition capability to sample by this, only use a small amount of representative sample and limited storage resources when helping sorter to review original knowledge, reduced storage overhead and computing cost widely.Though experimental result shows this method and has stored limited resources, still representative sample is maintained higher recognition accuracy, in effective identification new samples, learning sample is also kept very high discrimination, fast operation, other Incremental Learning Algorithm is little relatively in storage space consumption.Adapt to the situation and the dynamic environment of " only know and partly do not know the overall situation ", this helps sorter possesses the identification new samples rapidly when new type sample occurring ability.
Description of drawings
Fig. 1 is overall flow figure of the present invention;
Fig. 2 is representative sample screening process figure of the present invention;
Fig. 3 is a sample overlapping phenomenon synoptic diagram of the present invention.
Embodiment
The sorter that the present invention adopts is a minimum distance classifier, can select mahalanobis distance, Euclidean distance etc.Fig. 1 is an embodiment of the invention synoptic diagram, and embodiment adopts mahalanobis distance as the classification benchmark.
1. obtain current all new samples, adopt current sorter that it is presorted:
Calculate the mahalanobis distance of each subclass in new samples t and the current sorter, find out the subclass S of minimum mahalanobis distance correspondence, whether classification is identical with the known class alias of new samples t under judging subclass S, if it is identical, then new samples t is directly added among the subclass S, and the newly-increased sample record of renewal subclass S is counted k=k+1; If different, then new samples t is joined among the set W.
The mahalanobis distance of new samples t and subclass S d = ( x → - μ → ) ′ V - 1 ( x → - μ → ) , Wherein,
Figure G2008102463159D00052
Be the coordinate figure of new samples t,
Figure G2008102463159D00053
Be the centre coordinate value of subclass S, V is the sample covariance matrix of subclass S.
2. whether judge the sample number of gathering among the W greater than set sample number predetermined value, this value is definite by the user, if, enter step 3, otherwise, finish.
3. representative sample screening: in order to reduce taking to storage space, the present invention screens some representative samples from the sample of current each subclass of sorter, these representative samples are enough to reflect the characteristic of subclass under it, and the representative sample given figure of a subclass accounts for 1/10~1/9 of total sample number in this subclass.Adopt the isodensity filtering algorithm of representative sample, its theoretic validity is to be based upon on the prerequisite of " each sample no less important ".
If the subclass in the sorter is { Ω 1, Ω 2... Ω m, m is the subclass number of sorter, V iBe subclass Ω i, the sample covariance matrix of i ∈ [1, m].
Definition of T is the representative sample sum predetermined value of all subclass in the sorter, makes subclass Ω iScreening after the representative sample predetermined quantity that keeps be p i, then according to isodensity principle, p iShould satisfy following requirement:
Σ i = 1 m p i ≤ T p 1 V 1 = p 2 V 2 = . . . = p m V m
If subclass Ω iHaving had sample number before presorting is r i, the parameter N span is 9*r i~10*r i, subclass Ω iNewly-increased sample number is k in the back of presorting i, subclass Ω then iBelong in the representative sample that obtains of screening and existed sample number to occupy sample number r before presorting i
Figure G2008102463159D00062
Belong to the sample number that increases newly after presorting in the representative sample and occupy sample number k i
Figure G2008102463159D00063
Fig. 2 is a kind of embodiment schematic flow sheet of representative sample screening, specifically describes as follows:
Subset of computations Ω iSample { x 1, x 2... x NAnd subclass center mahalanobis distance, by mahalanobis distance is ascending sample to be sorted, the compartment of terrain is taken out the part sample and is remained from the formation that ordering obtains.
With subclass Ω iInterior sample is divided into two small sets, and one of them is the preceding sample set of presorting, and another one is screened representative sample for the newly-increased sample set in the back of presorting respectively according to aforementioned ratio in two set.
With the newly-increased sample set in the back of presorting is example, calculates each sample and subclass Ω in the newly-increased sample set in the back of presorting iThe mahalanobis distance at center sorts to sample according to mahalanobis distance is ascending.In the sample queue after ordering, sample of every l Sample selection is as representative sample, and the sample that removal is not selected from set.Spacing distance l equals
Figure G2008102463159D00064
Inverse.
4. the sample that will gather among the W carries out the K-mean cluster by classification under it.
5, the subclass unification in pair set W and the current sorter is adjusted: obtain in step 4
Search in the subclass in subclass and the current sorter, search whether there is the sample overlapping phenomenon, then carry out adjusting between subclass, then not need not to adjust if do not exist if exist.
The sample overlapping phenomenon is meant: with reference to Fig. 3, have the space overlap of action scope between subclass A2 and the B2, promptly existence originally belongs to the sample of subclass A2 in the overlapping region
Figure G2008102463159D00071
But Be divided into subclass B2 mistakenly, subclass A2 and B2 do not belong to same classification, and this paper claims this situation to be " sample is overlapping ".
Method of adjustment as shown in Figure 3 between subclass: will exist overlapping subclass A2 of sample and B2 to be divided into two littler subclass by the method for k-average (k=2 this moment) cluster, if the sample overlapping phenomenon is not eliminated, continue to decompose, a certain subclass number of samples is few when occurring, during less than subclass sample number threshold value, then this subclass stops to decompose, and subclass sample number threshold value determined voluntarily by the user, and subclass sample number threshold value value can be 1/150~1/100 of the new samples sum that is used for this subseries.
6. will gather that number of samples carries out following processing greater than the subclass of subclass sample number threshold value among the W: some samples of first random screening, the screening ratio is 1/10~1/9 of a subclass sample number, subclass after will screening again joins in the sorter, revise sorter structure and parameter (classification number and number of subsets), to upgrade sorter.
7. discern new samples once more with the sorter after upgrading, the new samples of identification error adds among the set W.
If the sample number of set among the W greater than new samples keep threshold value (new samples keep the threshold value value be the new samples sum 1/10~1/9), then carry out the error sample screening.
The error sample screening technique specifically describes as follows:
(8.1) among the set of computations W sample range difference of each sample and sample distance and.With the sample t among the set W is example, at first calculates the mahalanobis distance of sample t and current each subclass of sorter, searches the minimum mahalanobis distance d of t and generic subclass again S_min, and the maximum mahalanobis distance d of sample t and different classes of subclass T_max, calculate the sample range difference Δ d of sample t at last T-=d T_min-d D_max, sample distance and Δ d T+=d S_min+ d D_minGeneric subclass is meant that these subclass are identical with the classification of sample t, and different classes of subclass then is meant the classification difference.
(8.2) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;
(8.3) if the sample number of set among the W more than or equal to new samples keep threshold value (new samples keep the threshold value value be the new samples sum 1/10~1/9), return step (8.2), otherwise finish.
So far, finished an incremental learning classification, the sample that comprises among the set W adds in the new samples of increase when learning next time, it is emptied again.

Claims (2)

1, the increment study classification method under a kind of limited storage resources adopts minimum distance classifier, and this method is specific as follows:
Step (1) adopts current sorter that all new samples are presorted, and the correct new samples of presorting is added in the corresponding subclass of current sorter, and the wrong sample of presorting is added among the set W;
Whether step (2) judges the sample number of gathering among the W greater than set sample number predetermined value, if, enter step (3), otherwise, finish;
Step (3) is at the subclass Ω of current sorter i, i=1,2 ... in the m, m is the subclass sum in the current sorter, according to accounting for subclass Ω iThere has been sample number before presorting
Figure A2008102463150002C1
And subclass Ω iPresort back newly-increased sample number
Figure A2008102463150002C2
Ratio screening representative sample, the N value is 9*r i~10*r i, r iBe subclass Ω iAlready present sample number before presorting, k iBe subclass Ω iNewly-increased sample number after presorting, p iBe subclass Ω iThe representative sample given figure that the screening back keeps, p iSatisfy following relation:
Σ i = 1 m p i ≤ T p 1 V 1 = p 2 V 2 = . . . = p m V m
Wherein, T is each the subclass representative sample given figure summation predetermined value in the current sorter, V iBe subclass Ω iSample covariance matrix;
The sample that step (4) will be gathered among the W carries out the K-mean cluster by its classification;
Step (5) is searched in all subclass of set W and current sorter, judges whether to exist sample overlapping, if exist, then carries out adjusting between subclass; If do not exist, then enter step (6); The overlapping sample that should belong to first subclass that is meant of described sample is divided in second subclass by mistake, and first subclass and second subclass belong to a different category;
Step (6) is in set W, for the subclass of sample number more than or equal to subclass sample number threshold value, be this subclass random screening representative sample at first according to the ratio that accounts for the current sample number 1/10~1/9 of this subclass, again the subclass after the screening is added in the current sorter, revise the sorter structure and parameter to upgrade sorter;
Sorter after step (7) adopt to be upgraded is classified to all new samples, and the new samples of classification error is added among the set W, if the sample number among the set W keeps threshold value greater than new samples, then pair set W carries out the error sample screening, otherwise finishes;
Described error sample screening is carried out in the following manner:
(7a) among the set of computations W sample range difference of each sample and sample distance and;
(7b) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;
If (7c) sample number among the set W keeps threshold value more than or equal to new samples, return step (7b), otherwise finish;
Sample range difference and sample distance and computing method be: sample t ∈ W, in current sorter, belong in the generic subclass with sample t, search for the minor increment d of these subclass and sample t S_minIn the subclass that in current sorter, belongs to a different category, search for the ultimate range d of these subclass and sample t with sample t T_max, the sample range difference Δ d of sample t then T-=d T_min-d D_max, sample distance and Δ d U+=d S_min+ d D_min
2, the increment study classification method under a kind of limited storage resources according to claim 1, it is characterized in that, described step (5) carries out adjusting between subclass in the following manner: will exist the first overlapping subclass of sample and second subclass to carry out the k-mean cluster respectively, be decomposed into two littler subclass separately, if the sample overlapping phenomenon is not eliminated, continue to decompose, sample number is less than subclass sample number threshold value in the subclass that obtains after overlapping phenomenon is eliminated or decomposed.
CNA2008102463159A 2008-12-30 2008-12-30 Increment study classification method under a kind of limited storage resources Pending CN101604394A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CNA2008102463159A CN101604394A (en) 2008-12-30 2008-12-30 Increment study classification method under a kind of limited storage resources

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CNA2008102463159A CN101604394A (en) 2008-12-30 2008-12-30 Increment study classification method under a kind of limited storage resources

Publications (1)

Publication Number Publication Date
CN101604394A true CN101604394A (en) 2009-12-16

Family

ID=41470114

Family Applications (1)

Application Number Title Priority Date Filing Date
CNA2008102463159A Pending CN101604394A (en) 2008-12-30 2008-12-30 Increment study classification method under a kind of limited storage resources

Country Status (1)

Country Link
CN (1) CN101604394A (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831432A (en) * 2012-05-07 2012-12-19 江苏大学 Redundant data reducing method suitable for training of support vector machine
CN104866587A (en) * 2015-05-28 2015-08-26 成都艺辰德迅科技有限公司 Data mining method based on Internet of Things
CN105844286A (en) * 2016-03-11 2016-08-10 博康智能信息技术有限公司 Newly added vehicle logo identification method and apparatus
CN105844660A (en) * 2015-01-16 2016-08-10 江苏慧眼数据科技股份有限公司 Particle filter pedestrian tracking method based on spatial BRS
CN106127257A (en) * 2016-06-30 2016-11-16 联想(北京)有限公司 A kind of data classification method and electronic equipment
CN109389162B (en) * 2018-09-28 2019-11-19 北京达佳互联信息技术有限公司 Sample image screening technique and device, electronic equipment and storage medium
CN110659667A (en) * 2019-08-14 2020-01-07 平安科技(深圳)有限公司 Picture classification model training method and system and computer equipment
CN111092894A (en) * 2019-12-23 2020-05-01 厦门服云信息科技有限公司 Webshell detection method based on incremental learning, terminal device and storage medium
CN111368926A (en) * 2020-03-06 2020-07-03 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102831432A (en) * 2012-05-07 2012-12-19 江苏大学 Redundant data reducing method suitable for training of support vector machine
CN105844660A (en) * 2015-01-16 2016-08-10 江苏慧眼数据科技股份有限公司 Particle filter pedestrian tracking method based on spatial BRS
CN104866587A (en) * 2015-05-28 2015-08-26 成都艺辰德迅科技有限公司 Data mining method based on Internet of Things
CN105844286A (en) * 2016-03-11 2016-08-10 博康智能信息技术有限公司 Newly added vehicle logo identification method and apparatus
CN106127257A (en) * 2016-06-30 2016-11-16 联想(北京)有限公司 A kind of data classification method and electronic equipment
CN109389162B (en) * 2018-09-28 2019-11-19 北京达佳互联信息技术有限公司 Sample image screening technique and device, electronic equipment and storage medium
CN110659667A (en) * 2019-08-14 2020-01-07 平安科技(深圳)有限公司 Picture classification model training method and system and computer equipment
CN111092894A (en) * 2019-12-23 2020-05-01 厦门服云信息科技有限公司 Webshell detection method based on incremental learning, terminal device and storage medium
CN111368926A (en) * 2020-03-06 2020-07-03 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium
CN111368926B (en) * 2020-03-06 2021-07-06 腾讯科技(深圳)有限公司 Image screening method, device and computer readable storage medium

Similar Documents

Publication Publication Date Title
CN101604394A (en) Increment study classification method under a kind of limited storage resources
CN110764063B (en) Radar signal sorting method based on combination of SDIF and PRI transformation method
Kotsiantis et al. Mixture of expert agents for handling imbalanced data sets
Duarte et al. Vehicle classification in distributed sensor networks
Yang et al. Proportional k-interval discretization for naive-Bayes classifiers
CN105389480A (en) Multiclass unbalanced genomics data iterative integrated feature selection method and system
CN101496035B (en) Method for classifying modes
US6532305B1 (en) Machine learning method
CN100560025C (en) The method for detecting human face that has the combination coefficient of Weak Classifier
CN103617429A (en) Sorting method and system for active learning
CN106228389A (en) Network potential usage mining method and system based on random forests algorithm
CN103942562A (en) Hyperspectral image classifying method based on multi-classifier combining
CN102779281A (en) Vehicle type identification method based on support vector machine and used for earth inductor
CN104391835A (en) Method and device for selecting feature words in texts
CN103092931A (en) Multi-strategy combined document automatic classification method
CN103412888A (en) Point of interest (POI) identification method and device
CN105574547A (en) Integrated learning method and device adapted to weight of dynamically adjustable base classifier
CN105404901A (en) Training method of classifier, image detection method and respective system
CN101251896B (en) Object detecting system and method based on multiple classifiers
CN105825232A (en) Classification method and device for electromobile users
Hu et al. A new rough sets model based on database systems
CN103336771A (en) Data similarity detection method based on sliding window
WO2020024444A1 (en) Group performance grade recognition method and apparatus, and storage medium and computer device
CN110188196A (en) A kind of text increment dimension reduction method based on random forest
US6938049B2 (en) Creating ensembles of decision trees through sampling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)
WD01 Invention patent application deemed withdrawn after publication

Open date: 20091216