CN101604394A - Increment study classification method under a kind of limited storage resources - Google Patents
Increment study classification method under a kind of limited storage resources Download PDFInfo
- Publication number
- CN101604394A CN101604394A CNA2008102463159A CN200810246315A CN101604394A CN 101604394 A CN101604394 A CN 101604394A CN A2008102463159 A CNA2008102463159 A CN A2008102463159A CN 200810246315 A CN200810246315 A CN 200810246315A CN 101604394 A CN101604394 A CN 101604394A
- Authority
- CN
- China
- Prior art keywords
- sample
- subclass
- sorter
- new samples
- screening
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention provides the increment study classification method under a kind of limited storage resources, belong to mode identification technology.This method adopts minimum distance classifier, be specially: at first new samples is presorted, the correct new samples of presorting is added in the sorter respective subset, the wrong new samples of presorting is added in the error sample set, the sample in the error sample set is carried out the K-mean cluster.Be respectively each the subclass screening representative sample in the set of sorter and error sample then, after the screening subclass in the error sample set added in the sorter, upgrade sorter, adopt the sorter that upgrades that new samples is classified at last.The present invention had both preserved learning knowledge by the screening of representative sample, obtained new knowledge again, had higher sample recognition accuracy on the basis that reduces storage overhead and computing cost.
Description
Technical field
The invention belongs to area of pattern recognition, be specifically related to a kind of increment study classification method, this method by the study new samples, is brought in constant renewal in sorter under the situation of limited storage resources, improve classification capacity.
Background technology
Rapid increase along with rapid network development and quantity of information, traditional information excavating and knowledge acquisition technology have been subjected to great challenge, data qualification technology with incremental learning function becomes one of the intellectuality discovery of current information and gordian technique of excavating just gradually, compare with common data qualification technology, the incremental learning sorting technique has significant superiority, this mainly shows two aspects: on the one hand, because it need not save historical data, reduced taking of storage space; On the other hand, because it can make full use of the historical results of study in new training, thereby make study have continuity, and reduced the time of follow-up training significantly.
The various incremental learning methods of Ti Chuing all existed either large or small defective in recent years, were specially: (1) needs previous sample of having trained when realizing incremental learning, occupy storage space; (2) can not increase new classification information when study, classification performance is relatively poor; (3) based on a kind of on-line study method, adopt one by one the alternative manner of sample learning to realize incremental learning, classification effectiveness is low.
Summary of the invention
The object of the invention is to provide a kind of increment study classification method, by constantly improving sorter, has improved classify accuracy and efficient.
2, the increment study classification method under a kind of limited storage resources adopts minimum distance classifier, and this method is specific as follows:
Step (1) adopts current sorter that all new samples are presorted, and the correct new samples of presorting is added in the corresponding subclass of current sorter, and the wrong sample of presorting is added among the set W;
Whether step (2) judges the sample number of gathering among the W greater than set sample number predetermined value, if, enter step (3), otherwise, finish;
Step (3) is at the subclass Ω of current sorter
i, i=1,2 ... in the m, m is the subclass sum in the current sorter, according to accounting for subclass Ω
iThere has been sample number before presorting
And subclass Ω
iPresort back newly-increased sample number
Ratio screening representative sample, the N value is 9*r
i~10*r
i, r
iBe subclass Ω
iAlready present sample number before presorting, k
iBe subclass Ω
iNewly-increased sample number after presorting, p
iBe subclass Ω
iThe representative sample given figure that the screening back keeps, p
iSatisfy following relation:
Wherein, T is each the subclass representative sample given figure summation predetermined value in the current sorter, V
iBe subclass Ω
iSample covariance matrix;
The sample that step (4) will be gathered among the W carries out the K-mean cluster by its classification;
Step (5) is searched in all subclass of set W and current sorter, judges whether to exist sample overlapping, if exist, then carries out adjusting between subclass; If do not exist, then enter step (6); The overlapping sample that should belong to first subclass that is meant of described sample is divided in second subclass by mistake, and first subclass and second subclass belong to a different category;
Step (6) is in set W, for the subclass of sample number more than or equal to subclass sample number threshold value, be this subclass random screening representative sample at first according to the ratio that accounts for the current sample number 1/10~1/9 of this subclass, again the subclass after the screening is added in the current sorter, revise the sorter structure and parameter to upgrade sorter;
Sorter after step (7) adopt to be upgraded is classified to all new samples, and the new samples of classification error is added among the set W, if the sample number among the set W keeps threshold value greater than new samples, then pair set W carries out the error sample screening, otherwise finishes;
Described error sample screening is carried out in the following manner:
(7a) among the set of computations W sample range difference of each sample and sample distance and;
(7b) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;
If (7c) sample number among the set W keeps threshold value more than or equal to new samples, return step (7b), otherwise finish;
Sample range difference and sample distance and computing method be: sample t ∈ W, in current sorter, belong in the generic subclass with sample t, search for the minor increment d of these subclass and sample t
S_minIn the subclass that in current sorter, belongs to a different category, search for the ultimate range d of these subclass and sample t with sample t
T_max, the sample range difference Δ d of sample t then
T-=d
T_min-d
D_max, sample distance and Δ d
T+=d
S_min+ d
D_min
Described step (5) carries out adjusting between subclass in the following manner: will exist the first overlapping subclass of sample and second subclass to carry out the k-mean cluster respectively, be decomposed into two littler subclass separately, if the sample overlapping phenomenon is not eliminated, continue to decompose, sample number is less than subclass sample number threshold value in the subclass that obtains after overlapping phenomenon is eliminated or decomposed.
The present invention on the sorting technique basis of minimum distance classifier (for example minimum mahalanobis distance), provided a kind of under the limited storage resources condition increment study classification method of minimum distance classifier.This method makes sorter can preserve the knowledge of having learnt under the limited storage resources condition, can learn new knowledge effectively again, sorter can constantly improve the recognition capability to sample by this, only use a small amount of representative sample and limited storage resources when helping sorter to review original knowledge, reduced storage overhead and computing cost widely.Though experimental result shows this method and has stored limited resources, still representative sample is maintained higher recognition accuracy, in effective identification new samples, learning sample is also kept very high discrimination, fast operation, other Incremental Learning Algorithm is little relatively in storage space consumption.Adapt to the situation and the dynamic environment of " only know and partly do not know the overall situation ", this helps sorter possesses the identification new samples rapidly when new type sample occurring ability.
Description of drawings
Fig. 1 is overall flow figure of the present invention;
Fig. 2 is representative sample screening process figure of the present invention;
Fig. 3 is a sample overlapping phenomenon synoptic diagram of the present invention.
Embodiment
The sorter that the present invention adopts is a minimum distance classifier, can select mahalanobis distance, Euclidean distance etc.Fig. 1 is an embodiment of the invention synoptic diagram, and embodiment adopts mahalanobis distance as the classification benchmark.
1. obtain current all new samples, adopt current sorter that it is presorted:
Calculate the mahalanobis distance of each subclass in new samples t and the current sorter, find out the subclass S of minimum mahalanobis distance correspondence, whether classification is identical with the known class alias of new samples t under judging subclass S, if it is identical, then new samples t is directly added among the subclass S, and the newly-increased sample record of renewal subclass S is counted k=k+1; If different, then new samples t is joined among the set W.
The mahalanobis distance of new samples t and subclass S
Wherein,
Be the coordinate figure of new samples t,
Be the centre coordinate value of subclass S, V is the sample covariance matrix of subclass S.
2. whether judge the sample number of gathering among the W greater than set sample number predetermined value, this value is definite by the user, if, enter step 3, otherwise, finish.
3. representative sample screening: in order to reduce taking to storage space, the present invention screens some representative samples from the sample of current each subclass of sorter, these representative samples are enough to reflect the characteristic of subclass under it, and the representative sample given figure of a subclass accounts for 1/10~1/9 of total sample number in this subclass.Adopt the isodensity filtering algorithm of representative sample, its theoretic validity is to be based upon on the prerequisite of " each sample no less important ".
If the subclass in the sorter is { Ω
1, Ω
2... Ω
m, m is the subclass number of sorter, V
iBe subclass Ω
i, the sample covariance matrix of i ∈ [1, m].
Definition of T is the representative sample sum predetermined value of all subclass in the sorter, makes subclass Ω
iScreening after the representative sample predetermined quantity that keeps be p
i, then according to isodensity principle, p
iShould satisfy following requirement:
If subclass Ω
iHaving had sample number before presorting is r
i, the parameter N span is 9*r
i~10*r
i, subclass Ω
iNewly-increased sample number is k in the back of presorting
i, subclass Ω then
iBelong in the representative sample that obtains of screening and existed sample number to occupy sample number r before presorting
i Belong to the sample number that increases newly after presorting in the representative sample and occupy sample number k
i
Fig. 2 is a kind of embodiment schematic flow sheet of representative sample screening, specifically describes as follows:
Subset of computations Ω
iSample { x
1, x
2... x
NAnd subclass center mahalanobis distance, by mahalanobis distance is ascending sample to be sorted, the compartment of terrain is taken out the part sample and is remained from the formation that ordering obtains.
With subclass Ω
iInterior sample is divided into two small sets, and one of them is the preceding sample set of presorting, and another one is screened representative sample for the newly-increased sample set in the back of presorting respectively according to aforementioned ratio in two set.
With the newly-increased sample set in the back of presorting is example, calculates each sample and subclass Ω in the newly-increased sample set in the back of presorting
iThe mahalanobis distance at center sorts to sample according to mahalanobis distance is ascending.In the sample queue after ordering, sample of every l Sample selection is as representative sample, and the sample that removal is not selected from set.Spacing distance l equals
Inverse.
4. the sample that will gather among the W carries out the K-mean cluster by classification under it.
5, the subclass unification in pair set W and the current sorter is adjusted: obtain in step 4
Search in the subclass in subclass and the current sorter, search whether there is the sample overlapping phenomenon, then carry out adjusting between subclass, then not need not to adjust if do not exist if exist.
The sample overlapping phenomenon is meant: with reference to Fig. 3, have the space overlap of action scope between subclass A2 and the B2, promptly existence originally belongs to the sample of subclass A2 in the overlapping region
But
Be divided into subclass B2 mistakenly, subclass A2 and B2 do not belong to same classification, and this paper claims this situation to be " sample is overlapping ".
Method of adjustment as shown in Figure 3 between subclass: will exist overlapping subclass A2 of sample and B2 to be divided into two littler subclass by the method for k-average (k=2 this moment) cluster, if the sample overlapping phenomenon is not eliminated, continue to decompose, a certain subclass number of samples is few when occurring, during less than subclass sample number threshold value, then this subclass stops to decompose, and subclass sample number threshold value determined voluntarily by the user, and subclass sample number threshold value value can be 1/150~1/100 of the new samples sum that is used for this subseries.
6. will gather that number of samples carries out following processing greater than the subclass of subclass sample number threshold value among the W: some samples of first random screening, the screening ratio is 1/10~1/9 of a subclass sample number, subclass after will screening again joins in the sorter, revise sorter structure and parameter (classification number and number of subsets), to upgrade sorter.
7. discern new samples once more with the sorter after upgrading, the new samples of identification error adds among the set W.
If the sample number of set among the W greater than new samples keep threshold value (new samples keep the threshold value value be the new samples sum 1/10~1/9), then carry out the error sample screening.
The error sample screening technique specifically describes as follows:
(8.1) among the set of computations W sample range difference of each sample and sample distance and.With the sample t among the set W is example, at first calculates the mahalanobis distance of sample t and current each subclass of sorter, searches the minimum mahalanobis distance d of t and generic subclass again
S_min, and the maximum mahalanobis distance d of sample t and different classes of subclass
T_max, calculate the sample range difference Δ d of sample t at last
T-=d
T_min-d
D_max, sample distance and Δ d
T+=d
S_min+ d
D_minGeneric subclass is meant that these subclass are identical with the classification of sample t, and different classes of subclass then is meant the classification difference.
(8.2) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;
(8.3) if the sample number of set among the W more than or equal to new samples keep threshold value (new samples keep the threshold value value be the new samples sum 1/10~1/9), return step (8.2), otherwise finish.
So far, finished an incremental learning classification, the sample that comprises among the set W adds in the new samples of increase when learning next time, it is emptied again.
Claims (2)
1, the increment study classification method under a kind of limited storage resources adopts minimum distance classifier, and this method is specific as follows:
Step (1) adopts current sorter that all new samples are presorted, and the correct new samples of presorting is added in the corresponding subclass of current sorter, and the wrong sample of presorting is added among the set W;
Whether step (2) judges the sample number of gathering among the W greater than set sample number predetermined value, if, enter step (3), otherwise, finish;
Step (3) is at the subclass Ω of current sorter
i, i=1,2 ... in the m, m is the subclass sum in the current sorter, according to accounting for subclass Ω
iThere has been sample number before presorting
And subclass Ω
iPresort back newly-increased sample number
Ratio screening representative sample, the N value is 9*r
i~10*r
i, r
iBe subclass Ω
iAlready present sample number before presorting, k
iBe subclass Ω
iNewly-increased sample number after presorting, p
iBe subclass Ω
iThe representative sample given figure that the screening back keeps, p
iSatisfy following relation:
Wherein, T is each the subclass representative sample given figure summation predetermined value in the current sorter, V
iBe subclass Ω
iSample covariance matrix;
The sample that step (4) will be gathered among the W carries out the K-mean cluster by its classification;
Step (5) is searched in all subclass of set W and current sorter, judges whether to exist sample overlapping, if exist, then carries out adjusting between subclass; If do not exist, then enter step (6); The overlapping sample that should belong to first subclass that is meant of described sample is divided in second subclass by mistake, and first subclass and second subclass belong to a different category;
Step (6) is in set W, for the subclass of sample number more than or equal to subclass sample number threshold value, be this subclass random screening representative sample at first according to the ratio that accounts for the current sample number 1/10~1/9 of this subclass, again the subclass after the screening is added in the current sorter, revise the sorter structure and parameter to upgrade sorter;
Sorter after step (7) adopt to be upgraded is classified to all new samples, and the new samples of classification error is added among the set W, if the sample number among the set W keeps threshold value greater than new samples, then pair set W carries out the error sample screening, otherwise finishes;
Described error sample screening is carried out in the following manner:
(7a) among the set of computations W sample range difference of each sample and sample distance and;
(7b) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;
If (7c) sample number among the set W keeps threshold value more than or equal to new samples, return step (7b), otherwise finish;
Sample range difference and sample distance and computing method be: sample t ∈ W, in current sorter, belong in the generic subclass with sample t, search for the minor increment d of these subclass and sample t
S_minIn the subclass that in current sorter, belongs to a different category, search for the ultimate range d of these subclass and sample t with sample t
T_max, the sample range difference Δ d of sample t then
T-=d
T_min-d
D_max, sample distance and Δ d
U+=d
S_min+ d
D_min
2, the increment study classification method under a kind of limited storage resources according to claim 1, it is characterized in that, described step (5) carries out adjusting between subclass in the following manner: will exist the first overlapping subclass of sample and second subclass to carry out the k-mean cluster respectively, be decomposed into two littler subclass separately, if the sample overlapping phenomenon is not eliminated, continue to decompose, sample number is less than subclass sample number threshold value in the subclass that obtains after overlapping phenomenon is eliminated or decomposed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008102463159A CN101604394A (en) | 2008-12-30 | 2008-12-30 | Increment study classification method under a kind of limited storage resources |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CNA2008102463159A CN101604394A (en) | 2008-12-30 | 2008-12-30 | Increment study classification method under a kind of limited storage resources |
Publications (1)
Publication Number | Publication Date |
---|---|
CN101604394A true CN101604394A (en) | 2009-12-16 |
Family
ID=41470114
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CNA2008102463159A Pending CN101604394A (en) | 2008-12-30 | 2008-12-30 | Increment study classification method under a kind of limited storage resources |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN101604394A (en) |
Cited By (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831432A (en) * | 2012-05-07 | 2012-12-19 | 江苏大学 | Redundant data reducing method suitable for training of support vector machine |
CN104866587A (en) * | 2015-05-28 | 2015-08-26 | 成都艺辰德迅科技有限公司 | Data mining method based on Internet of Things |
CN105844286A (en) * | 2016-03-11 | 2016-08-10 | 博康智能信息技术有限公司 | Newly added vehicle logo identification method and apparatus |
CN105844660A (en) * | 2015-01-16 | 2016-08-10 | 江苏慧眼数据科技股份有限公司 | Particle filter pedestrian tracking method based on spatial BRS |
CN106127257A (en) * | 2016-06-30 | 2016-11-16 | 联想(北京)有限公司 | A kind of data classification method and electronic equipment |
CN109389162B (en) * | 2018-09-28 | 2019-11-19 | 北京达佳互联信息技术有限公司 | Sample image screening technique and device, electronic equipment and storage medium |
CN110659667A (en) * | 2019-08-14 | 2020-01-07 | 平安科技(深圳)有限公司 | Picture classification model training method and system and computer equipment |
CN111092894A (en) * | 2019-12-23 | 2020-05-01 | 厦门服云信息科技有限公司 | Webshell detection method based on incremental learning, terminal device and storage medium |
CN111368926A (en) * | 2020-03-06 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Image screening method, device and computer readable storage medium |
-
2008
- 2008-12-30 CN CNA2008102463159A patent/CN101604394A/en active Pending
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102831432A (en) * | 2012-05-07 | 2012-12-19 | 江苏大学 | Redundant data reducing method suitable for training of support vector machine |
CN105844660A (en) * | 2015-01-16 | 2016-08-10 | 江苏慧眼数据科技股份有限公司 | Particle filter pedestrian tracking method based on spatial BRS |
CN104866587A (en) * | 2015-05-28 | 2015-08-26 | 成都艺辰德迅科技有限公司 | Data mining method based on Internet of Things |
CN105844286A (en) * | 2016-03-11 | 2016-08-10 | 博康智能信息技术有限公司 | Newly added vehicle logo identification method and apparatus |
CN106127257A (en) * | 2016-06-30 | 2016-11-16 | 联想(北京)有限公司 | A kind of data classification method and electronic equipment |
CN109389162B (en) * | 2018-09-28 | 2019-11-19 | 北京达佳互联信息技术有限公司 | Sample image screening technique and device, electronic equipment and storage medium |
CN110659667A (en) * | 2019-08-14 | 2020-01-07 | 平安科技(深圳)有限公司 | Picture classification model training method and system and computer equipment |
CN111092894A (en) * | 2019-12-23 | 2020-05-01 | 厦门服云信息科技有限公司 | Webshell detection method based on incremental learning, terminal device and storage medium |
CN111368926A (en) * | 2020-03-06 | 2020-07-03 | 腾讯科技(深圳)有限公司 | Image screening method, device and computer readable storage medium |
CN111368926B (en) * | 2020-03-06 | 2021-07-06 | 腾讯科技(深圳)有限公司 | Image screening method, device and computer readable storage medium |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN101604394A (en) | Increment study classification method under a kind of limited storage resources | |
CN110764063B (en) | Radar signal sorting method based on combination of SDIF and PRI transformation method | |
Kotsiantis et al. | Mixture of expert agents for handling imbalanced data sets | |
Duarte et al. | Vehicle classification in distributed sensor networks | |
Yang et al. | Proportional k-interval discretization for naive-Bayes classifiers | |
CN105389480A (en) | Multiclass unbalanced genomics data iterative integrated feature selection method and system | |
CN101496035B (en) | Method for classifying modes | |
US6532305B1 (en) | Machine learning method | |
CN100560025C (en) | The method for detecting human face that has the combination coefficient of Weak Classifier | |
CN103617429A (en) | Sorting method and system for active learning | |
CN106228389A (en) | Network potential usage mining method and system based on random forests algorithm | |
CN103942562A (en) | Hyperspectral image classifying method based on multi-classifier combining | |
CN102779281A (en) | Vehicle type identification method based on support vector machine and used for earth inductor | |
CN104391835A (en) | Method and device for selecting feature words in texts | |
CN103092931A (en) | Multi-strategy combined document automatic classification method | |
CN103412888A (en) | Point of interest (POI) identification method and device | |
CN105574547A (en) | Integrated learning method and device adapted to weight of dynamically adjustable base classifier | |
CN105404901A (en) | Training method of classifier, image detection method and respective system | |
CN101251896B (en) | Object detecting system and method based on multiple classifiers | |
CN105825232A (en) | Classification method and device for electromobile users | |
Hu et al. | A new rough sets model based on database systems | |
CN103336771A (en) | Data similarity detection method based on sliding window | |
WO2020024444A1 (en) | Group performance grade recognition method and apparatus, and storage medium and computer device | |
CN110188196A (en) | A kind of text increment dimension reduction method based on random forest | |
US6938049B2 (en) | Creating ensembles of decision trees through sampling |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
C06 | Publication | ||
PB01 | Publication | ||
C10 | Entry into substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
C02 | Deemed withdrawal of patent application after publication (patent law 2001) | ||
WD01 | Invention patent application deemed withdrawn after publication |
Open date: 20091216 |