CN101604394A

CN101604394A - Increment study classification method under a kind of limited storage resources

Info

Publication number: CN101604394A
Application number: CNA2008102463159A
Authority: CN
Inventors: 桑农; 程婷; 张天序; 曹治国; 唐奇伶; 程志利; 张�荣
Original assignee: Huazhong University of Science and Technology
Current assignee: Huazhong University of Science and Technology
Priority date: 2008-12-30
Filing date: 2008-12-30
Publication date: 2009-12-16

Abstract

The invention provides the increment study classification method under a kind of limited storage resources, belong to mode identification technology.This method adopts minimum distance classifier, be specially: at first new samples is presorted, the correct new samples of presorting is added in the sorter respective subset, the wrong new samples of presorting is added in the error sample set, the sample in the error sample set is carried out the K-mean cluster.Be respectively each the subclass screening representative sample in the set of sorter and error sample then, after the screening subclass in the error sample set added in the sorter, upgrade sorter, adopt the sorter that upgrades that new samples is classified at last.The present invention had both preserved learning knowledge by the screening of representative sample, obtained new knowledge again, had higher sample recognition accuracy on the basis that reduces storage overhead and computing cost.

Description

Increment study classification method under a kind of limited storage resources

Technical field

The invention belongs to area of pattern recognition, be specifically related to a kind of increment study classification method, this method by the study new samples, is brought in constant renewal in sorter under the situation of limited storage resources, improve classification capacity.

Background technology

Rapid increase along with rapid network development and quantity of information, traditional information excavating and knowledge acquisition technology have been subjected to great challenge, data qualification technology with incremental learning function becomes one of the intellectuality discovery of current information and gordian technique of excavating just gradually, compare with common data qualification technology, the incremental learning sorting technique has significant superiority, this mainly shows two aspects: on the one hand, because it need not save historical data, reduced taking of storage space; On the other hand, because it can make full use of the historical results of study in new training, thereby make study have continuity, and reduced the time of follow-up training significantly.

The various incremental learning methods of Ti Chuing all existed either large or small defective in recent years, were specially: (1) needs previous sample of having trained when realizing incremental learning, occupy storage space; (2) can not increase new classification information when study, classification performance is relatively poor; (3) based on a kind of on-line study method, adopt one by one the alternative manner of sample learning to realize incremental learning, classification effectiveness is low.

Summary of the invention

The object of the invention is to provide a kind of increment study classification method, by constantly improving sorter, has improved classify accuracy and efficient.

2, the increment study classification method under a kind of limited storage resources adopts minimum distance classifier, and this method is specific as follows:

Step (1) adopts current sorter that all new samples are presorted, and the correct new samples of presorting is added in the corresponding subclass of current sorter, and the wrong sample of presorting is added among the set W;

Whether step (2) judges the sample number of gathering among the W greater than set sample number predetermined value, if, enter step (3), otherwise, finish;

Step (3) is at the subclass Ω of current sorter _i, i=1,2 ... in the m, m is the subclass sum in the current sorter, according to accounting for subclass Ω _iThere has been sample number before presorting

And subclass Ω _iPresort back newly-increased sample number

Ratio screening representative sample, the N value is 9*r _i～10*r _i, r _iBe subclass Ω _iAlready present sample number before presorting, k _iBe subclass Ω _iNewly-increased sample number after presorting, p _iBe subclass Ω _iThe representative sample given figure that the screening back keeps, p _iSatisfy following relation:

\{\begin{matrix} Σ_{i = 1}^{m} p_{i} \leq T \\ \frac{p_{1}}{V_{1}} = \frac{p_{2}}{V_{2}} = . . . = \frac{p_{m}}{V_{m}} \end{matrix}

Wherein, T is each the subclass representative sample given figure summation predetermined value in the current sorter, V _iBe subclass Ω _iSample covariance matrix;

The sample that step (4) will be gathered among the W carries out the K-mean cluster by its classification;

Step (5) is searched in all subclass of set W and current sorter, judges whether to exist sample overlapping, if exist, then carries out adjusting between subclass; If do not exist, then enter step (6); The overlapping sample that should belong to first subclass that is meant of described sample is divided in second subclass by mistake, and first subclass and second subclass belong to a different category;

Step (6) is in set W, for the subclass of sample number more than or equal to subclass sample number threshold value, be this subclass random screening representative sample at first according to the ratio that accounts for the current sample number 1/10～1/9 of this subclass, again the subclass after the screening is added in the current sorter, revise the sorter structure and parameter to upgrade sorter;

Sorter after step (7) adopt to be upgraded is classified to all new samples, and the new samples of classification error is added among the set W, if the sample number among the set W keeps threshold value greater than new samples, then pair set W carries out the error sample screening, otherwise finishes;

Described error sample screening is carried out in the following manner:

(7a) among the set of computations W sample range difference of each sample and sample distance and;

(7b) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;

If (7c) sample number among the set W keeps threshold value more than or equal to new samples, return step (7b), otherwise finish;

Sample range difference and sample distance and computing method be: sample t ∈ W, in current sorter, belong in the generic subclass with sample t, search for the minor increment d of these subclass and sample t _{S_min}In the subclass that in current sorter, belongs to a different category, search for the ultimate range d of these subclass and sample t with sample t _{T_max}, the sample range difference Δ d of sample t then _T-=d _{T_min}-d _{D_max}, sample distance and Δ d _T+=d _{S_min}+ d _{D_min}

Described step (5) carries out adjusting between subclass in the following manner: will exist the first overlapping subclass of sample and second subclass to carry out the k-mean cluster respectively, be decomposed into two littler subclass separately, if the sample overlapping phenomenon is not eliminated, continue to decompose, sample number is less than subclass sample number threshold value in the subclass that obtains after overlapping phenomenon is eliminated or decomposed.

The present invention on the sorting technique basis of minimum distance classifier (for example minimum mahalanobis distance), provided a kind of under the limited storage resources condition increment study classification method of minimum distance classifier.This method makes sorter can preserve the knowledge of having learnt under the limited storage resources condition, can learn new knowledge effectively again, sorter can constantly improve the recognition capability to sample by this, only use a small amount of representative sample and limited storage resources when helping sorter to review original knowledge, reduced storage overhead and computing cost widely.Though experimental result shows this method and has stored limited resources, still representative sample is maintained higher recognition accuracy, in effective identification new samples, learning sample is also kept very high discrimination, fast operation, other Incremental Learning Algorithm is little relatively in storage space consumption.Adapt to the situation and the dynamic environment of " only know and partly do not know the overall situation ", this helps sorter possesses the identification new samples rapidly when new type sample occurring ability.

Description of drawings

Fig. 1 is overall flow figure of the present invention;

Fig. 2 is representative sample screening process figure of the present invention;

Fig. 3 is a sample overlapping phenomenon synoptic diagram of the present invention.

Embodiment

The sorter that the present invention adopts is a minimum distance classifier, can select mahalanobis distance, Euclidean distance etc.Fig. 1 is an embodiment of the invention synoptic diagram, and embodiment adopts mahalanobis distance as the classification benchmark.

1. obtain current all new samples, adopt current sorter that it is presorted:

Calculate the mahalanobis distance of each subclass in new samples t and the current sorter, find out the subclass S of minimum mahalanobis distance correspondence, whether classification is identical with the known class alias of new samples t under judging subclass S, if it is identical, then new samples t is directly added among the subclass S, and the newly-increased sample record of renewal subclass S is counted k=k+1; If different, then new samples t is joined among the set W.

The mahalanobis distance of new samples t and subclass S

d = {(\overset{&RightArrow;}{x} - \overset{&RightArrow;}{μ})}^{'} V^{- 1} (\overset{&RightArrow;}{x} - \overset{&RightArrow;}{μ}),

Wherein,

Be the coordinate figure of new samples t,

Be the centre coordinate value of subclass S, V is the sample covariance matrix of subclass S.

2. whether judge the sample number of gathering among the W greater than set sample number predetermined value, this value is definite by the user, if, enter step 3, otherwise, finish.

3. representative sample screening: in order to reduce taking to storage space, the present invention screens some representative samples from the sample of current each subclass of sorter, these representative samples are enough to reflect the characteristic of subclass under it, and the representative sample given figure of a subclass accounts for 1/10～1/9 of total sample number in this subclass.Adopt the isodensity filtering algorithm of representative sample, its theoretic validity is to be based upon on the prerequisite of " each sample no less important ".

If the subclass in the sorter is { Ω ₁, Ω ₂... Ω _m, m is the subclass number of sorter, V _iBe subclass Ω _i, the sample covariance matrix of i ∈ [1, m].

Definition of T is the representative sample sum predetermined value of all subclass in the sorter, makes subclass Ω _iScreening after the representative sample predetermined quantity that keeps be p _i, then according to isodensity principle, p _iShould satisfy following requirement:

\{\begin{matrix} Σ_{i = 1}^{m} p_{i} \leq T \\ \frac{p_{1}}{V_{1}} = \frac{p_{2}}{V_{2}} = . . . = \frac{p_{m}}{V_{m}} \end{matrix}

If subclass Ω _iHaving had sample number before presorting is r _i, the parameter N span is 9*r _i～10*r _i, subclass Ω _iNewly-increased sample number is k in the back of presorting _i, subclass Ω then _iBelong in the representative sample that obtains of screening and existed sample number to occupy sample number r before presorting _i

Belong to the sample number that increases newly after presorting in the representative sample and occupy sample number k _i

Fig. 2 is a kind of embodiment schematic flow sheet of representative sample screening, specifically describes as follows:

Subset of computations Ω _iSample { x ₁, x ₂... x _NAnd subclass center mahalanobis distance, by mahalanobis distance is ascending sample to be sorted, the compartment of terrain is taken out the part sample and is remained from the formation that ordering obtains.

With subclass Ω _iInterior sample is divided into two small sets, and one of them is the preceding sample set of presorting, and another one is screened representative sample for the newly-increased sample set in the back of presorting respectively according to aforementioned ratio in two set.

With the newly-increased sample set in the back of presorting is example, calculates each sample and subclass Ω in the newly-increased sample set in the back of presorting _iThe mahalanobis distance at center sorts to sample according to mahalanobis distance is ascending.In the sample queue after ordering, sample of every l Sample selection is as representative sample, and the sample that removal is not selected from set.Spacing distance l equals

Inverse.

4. the sample that will gather among the W carries out the K-mean cluster by classification under it.

5, the subclass unification in pair set W and the current sorter is adjusted: obtain in step 4

Search in the subclass in subclass and the current sorter, search whether there is the sample overlapping phenomenon, then carry out adjusting between subclass, then not need not to adjust if do not exist if exist.

The sample overlapping phenomenon is meant: with reference to Fig. 3, have the space overlap of action scope between subclass A2 and the B2, promptly existence originally belongs to the sample of subclass A2 in the overlapping region

But Be divided into subclass B2 mistakenly, subclass A2 and B2 do not belong to same classification, and this paper claims this situation to be " sample is overlapping ".

Method of adjustment as shown in Figure 3 between subclass: will exist overlapping subclass A2 of sample and B2 to be divided into two littler subclass by the method for k-average (k=2 this moment) cluster, if the sample overlapping phenomenon is not eliminated, continue to decompose, a certain subclass number of samples is few when occurring, during less than subclass sample number threshold value, then this subclass stops to decompose, and subclass sample number threshold value determined voluntarily by the user, and subclass sample number threshold value value can be 1/150～1/100 of the new samples sum that is used for this subseries.

6. will gather that number of samples carries out following processing greater than the subclass of subclass sample number threshold value among the W: some samples of first random screening, the screening ratio is 1/10～1/9 of a subclass sample number, subclass after will screening again joins in the sorter, revise sorter structure and parameter (classification number and number of subsets), to upgrade sorter.

7. discern new samples once more with the sorter after upgrading, the new samples of identification error adds among the set W.

If the sample number of set among the W greater than new samples keep threshold value (new samples keep the threshold value value be the new samples sum 1/10～1/9), then carry out the error sample screening.

The error sample screening technique specifically describes as follows:

(8.1) among the set of computations W sample range difference of each sample and sample distance and.With the sample t among the set W is example, at first calculates the mahalanobis distance of sample t and current each subclass of sorter, searches the minimum mahalanobis distance d of t and generic subclass again _{S_min}, and the maximum mahalanobis distance d of sample t and different classes of subclass _{T_max}, calculate the sample range difference Δ d of sample t at last _T-=d _{T_min}-d _{D_max}, sample distance and Δ d _T+=d _{S_min}+ d _{D_min}Generic subclass is meant that these subclass are identical with the classification of sample t, and different classes of subclass then is meant the classification difference.

(8.2) search maximum sample range difference corresponding sample or smallest sample distance and corresponding sample are removed it from set W;

(8.3) if the sample number of set among the W more than or equal to new samples keep threshold value (new samples keep the threshold value value be the new samples sum 1/10～1/9), return step (8.2), otherwise finish.

So far, finished an incremental learning classification, the sample that comprises among the set W adds in the new samples of increase when learning next time, it is emptied again.

Claims

1, the increment study classification method under a kind of limited storage resources adopts minimum distance classifier, and this method is specific as follows:

And subclass Ω _iPresort back newly-increased sample number

\{\begin{matrix} Σ_{i = 1}^{m} p_{i} \leq T \\ \frac{p_{1}}{V_{1}} = \frac{p_{2}}{V_{2}} = . . . = \frac{p_{m}}{V_{m}} \end{matrix}

Described error sample screening is carried out in the following manner:

Sample range difference and sample distance and computing method be: sample t ∈ W, in current sorter, belong in the generic subclass with sample t, search for the minor increment d of these subclass and sample t _{S_min}In the subclass that in current sorter, belongs to a different category, search for the ultimate range d of these subclass and sample t with sample t _{T_max}, the sample range difference Δ d of sample t then _T-=d _{T_min}-d _{D_max}, sample distance and Δ d _U+=d _{S_min}+ d _{D_min}

2, the increment study classification method under a kind of limited storage resources according to claim 1, it is characterized in that, described step (5) carries out adjusting between subclass in the following manner: will exist the first overlapping subclass of sample and second subclass to carry out the k-mean cluster respectively, be decomposed into two littler subclass separately, if the sample overlapping phenomenon is not eliminated, continue to decompose, sample number is less than subclass sample number threshold value in the subclass that obtains after overlapping phenomenon is eliminated or decomposed.