CN101827002B

CN101827002B - Concept drift detection method of data flow classification

Info

Publication number: CN101827002B
Application number: CN2010101847267A
Authority: CN
Inventors: 文益民
Original assignee: Guilin University of Electronic Technology
Current assignee: Guilin University of Electronic Technology
Priority date: 2010-05-27
Filing date: 2010-05-27
Publication date: 2012-05-09
Anticipated expiration: 2030-05-27
Also published as: CN101827002A

Abstract

The invention discloses a concept drift detection method of data flow classification, comprising the following steps: (1) data flow partitioning: according to the preset scale d of data blocks, training a classifier when d training samples are collected according to data arriving sequence; (2) adjustment of sliding window: setting the amount K of the classifiers hi in the sliding window; when the amount of the classifiers hi in the sliding window is less than K, automatically adding the newest training classifier hi in the sliding window; when the amount of the classifiers hi in the sliding window is equal to K, updating the classifiers hi in the sliding window; (3) detection of concept drift: when concept detection is required, selecting proper classifier to give out concept judgment from the sliding window with credible majority voting. The invention is the concept drift detection method of data flow classification with simple principle, reliable operation, high detection precision, quick detection speed and broad application range.

Description

A kind of concept drift detection method of data flow classification

Technical field

The present invention is mainly concerned with the intelligent information processing technology field, refers in particular to a kind of detection method of concept drift, is applicable to network invasion monitoring, the user data flow classification problem such as product classification on prediction, the streamline of doing shopping.

Background technology

In social practice, it is the notion time to time change that data comprise that one type of problem is arranged, and just notion produces drift.On the automatic production line, the defective product of close reason can occur continuously, and the variation owing to reason causes the characteristic of defective product also to change thereupon then; In the commercial activity, client's purchase interest time to time change; In the network security, the access module of network changes with the user is different.The common feature of these problems is: the data that constantly produce form a stream; It is unpredictable when new ideas in the data flow produce; The quantity of the notion that data flow comprises is uncertain.Concept drift detects selects proper classifier that new test data is carried out the classification judgement exactly from existing grader, to realize the classification judgement more accurately of this test data.

The data flow classification problem has caused numerous scholars' concern.Schlimmer has studied the data flow classification problem first; STAGGER algorithm (Incremental learning from noisy data [J] Machine Learning has been proposed; 1986; 1 (3): the Incremental Learning Algorithm [J] of a 317-354 noise data. machine learning, 1986,1 (3): 317-354).Widmer, Salganicoff, Harries and Domingos five equilibrium you can well imagine out FLORA, PECS, SPLICE and VFDT.Behind the improvement VFDT such as Wang Tao fVFDT has been proposed.The research of Wang etc. shows: the model that above algorithm is learnt has only reflected the notion that the part latest data comprises; This can cause usually than mistake (Mining concept-drifting data streams usingensemble classifiers [C] //Proceeding of the 9th ACM SIGKDD International Conference onKnowledge Discovery and Data Mining.USA; Washington; 2003:226-23 5 uses data flow [C] // 9th Knowledge Discovery and the data mining international conference collection of thesis that integrated classifier excavates has concept drift; The U.S., Washington, 2003:226-235).Therefore, Chinese scholars begins to attempt utilizing the integrated study strategy to come the concept drift problem of deal with data traffic classification.Street etc. proposed the SEA algorithm (A streaming ensemble algorithm for large-scaleclassification [C] //Proceeding of the 7th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining.USA; San Francisco, 2001:377-382 are used to solve integrated classifier flow algorithm [C] // 7th Knowledge Discovery and the data mining international conference proceeding of extensive classification problem for one kind.The U.S.; San Francisco; 2001:377-382), this algorithm at first keeps the constant method of grader sum to realize the study to concept drift according to old grader in the superseded sliding window of standards of grading, adopts most algorithms of voting to realize concept drift is detected then.Wang etc. then use the most ballot of cum rights algorithm to realize concept drift is detected; The weights of each grader be inversely proportional to its error rate respectively to the data set of most recent collection (Mining concept-drifting data streams using ensembleclassifiers [C] //Proceeding of the 9th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining.USA; Washington; 2003:226-23 5 uses data flow [C] // 9th Knowledge Discovery and the data mining international conference collection of thesis that integrated classifier excavates has concept drift; The U.S.; Washington, 2003:226-235).Kolter etc. proposed the most ballot of dynamic cum rights algorithms (Dynamic weighted majority:a new ensemble method fortracking concept drift [C] //Proceedings of the 3th IEEE Conference on Data Mining.USA; LosAlamitos; The 2003:123-130 most ballot methods [C] of dynamic cum rights // the 3rd data mining international conference of following the tracks of concept drift. the U.S.; Los Alamitos, 2003:123-130).This algorithm is made amendment to the weights of the grader in the sliding window according to the sample that most recent collects; Also use this sample that the grader in the sliding window is carried out incremental learning simultaneously or train a new grader, to improve the detection speed of algorithm concept drift.Sun Yue etc. have proposed a kind of concept drift mining algorithm based on multi-categorizer and (have excavated [J] based on the concept drift in the data flow of multi-categorizer.The automation journal, 2008,34 (1): 93-96).With respect to the SEA algorithm, the common feature of the algorithm of Wang, Kolter and Sun Yue is to eliminate the grader in the sliding window according to weights, utilize weights to realize the detection to concept drift simultaneously, and the calculating of weights all is the sample of gathering according to most recent.Therefore, effective realization of whole algorithms all has individual prerequisite more than---need set the size of sliding window in advance.Yet, in practical problem, be difficult to accomplish this point.

Summary of the invention

The technical problem that the present invention will solve just is: to the technical problem that prior art exists, the present invention provides the concept drift detection method of the data flow classification that a kind of principle is simple, reliable, accuracy of detection is high, detection speed is fast, applied widely.

For solving the problems of the technologies described above, the present invention adopts following technical scheme:

A kind of concept drift detection method of data flow classification is characterized in that step is:

1. data flow piecemeal: the scale d of setting data piece; Sequencing according to data arrives in the data flow; Whenever collect d data, just provide the classification of this d data and be a training set, the data block that is collected is docile and obedient preface is designated as S with the data block that this d data are formed _i, wherein the maximum of 0≤i and i is by the total quantity decision of current training sample, and first data block is designated as S ₀At each S _iGrader h of last training _i, with S _iAs test set by h _iProvide test result TR _i, storage S _i, h _iAnd TR _i

2. sliding window adjustment: set grader h in the sliding window _iQuantity K, grader h in sliding window _iQuantity when being less than K, the grader h of up-to-date training _iAutomatically add sliding window; Grader h in sliding window _iQuantity when equaling K, to the grader h in the sliding window _iUpgrade;

3. concept drift detects: establish grader h in the current sliding window _iQuantity be K ₀, K ₀≤K, carry out when needs carry out two steps of concept drift detection time-division to test data X:

3.1, with all the grader h in the test data X input sliding window _i, calculate by grader in order

Classification results that provides and classification confidence level,

3.2, select in the sliding window the higher grader of classification confidence level to carry out majority ballot automatically, provide the classification of test data X judged that completion is to the detection of concept drift.

As further improvement of the present invention:

In the said step 3.1, establishing current grader is h _j, 0≤j＜K wherein ₀, y is the real classification of X, T _j(X) be grader h _jTo the classification confidence level of test data X, the classification confidence level computational methods as shown in the formula shown in (1),

T_{j} (X) = \{\begin{matrix} \frac{Tp + 1}{Tp + Fp + 1} & if & h_{j} (X) = y \\ \frac{Tp}{Tp + Fp + 1} & if & h_{j} (X) &NotEqual; y \end{matrix} - - - (1)

Tp in the following formula (1) is that test data X is at S _jIn m neighbour in by h _jBe judged as ω _jType and really belong to ω _jThe quantity of the data of class, and Fp is that test data X is at S _jIn m neighbour in by h _jBe judged as ω _jType and don't belong to ω _jThe quantity of the data of class.

The idiographic flow of said step 3.2 is: at first will

By ordering from small to large, with array A [K ₀] storage the adjusted confidence level of respectively classifying subscript, still use

Value after the expression ordering; Calculate T _Shift[j]=T _J+1(X)-T _j(X), 0≤j＜K ₀-1; Scan array T from small to large _Shift, the maximum jump of judgment value is made as k, be designated as under like this in the sliding window A [k+1], A [k+2] ..., A [K ₀-1] grader } is the higher grader of classification confidence level, uses these graders to carry out the majority ballot, provides at last the classification of test data X is judged

Compared with prior art; The invention has the advantages that: the principle of the invention is simple, reliable, accuracy of detection is high, detection speed is fast, applied widely; Through foundation classification confidence level selection sort device; Automatically shielded those graders of those unlikely correct classification X, and selected relatively more sure those graders that X is correctly classified to carry out the majority ballot as far as possible, thereby the real concept drift detects.Therefore, as long as include the relatively more sure grader that X is correctly classified in the sliding window, the size of sliding window does not constitute influence to the classification of X, thereby has reduced the influence that the sliding window size detects concept drift.A plurality of experiments according to adopting this method to carry out show: the present invention has improved generalization ability; Can in the very first time that new ideas produce, detect concept drift, the detectability of concept drift and the learning ability of new ideas are not received the big or small influence of sliding window.

Description of drawings

Fig. 1 is a schematic flow sheet of the present invention;

Fig. 2 is the detailed process sketch map of the present invention in instantiation;

Fig. 3 is the schematic flow sheet when carrying out the concept drift detection among the present invention;

Fig. 4 is an accuracy rate sketch map relatively in the time of can comprising 13 graders at most in the sliding window;

Fig. 5 is an accuracy rate sketch map relatively in the time of can comprising 25 graders at most in the sliding window;

Fig. 6 is an accuracy rate sketch map relatively in the time of can comprising 37 graders at most in the sliding window;

Fig. 7 is an accuracy rate sketch map relatively in the time of can comprising 50 graders at most in the sliding window;

Fig. 8 is an accuracy rate sketch map relatively in the time of can comprising 67 graders at most in the sliding window;

Fig. 9 is the sketch map that how to use training set and test set in the data flow classification;

Figure 10 is the grader quantity K in sliding window ₀Sliding window adjustment sketch map during＜K;

Figure 11 is the grader quantity K in sliding window ₀Sliding window adjustment sketch map during=K.

Embodiment

Below with reference to Figure of description and specific embodiment the present invention is explained further details.

Like Fig. 1, Fig. 2 and shown in Figure 3, the concept drift detection method of data flow classification of the present invention, its idiographic flow is:

1, data flow piecemeal:

The scale d of setting data piece rule of thumb; Sequencing according to data arrives in the data flow whenever collects d data; Just provide the classification of this d data and be a training set, the data block that is collected is docile and obedient preface is designated as S with the data block that this d data are formed by the expert _i, wherein the maximum of 0≤i and i is by the total quantity decision of current training sample, and first data block is designated as S ₀At each S _iGrader h of last training _i, with S _iAs test set by h _iProvide test result TR _i, storage S _i, h _iAnd TR _i

2, sliding window adjustment:

Set the quantity K of grader in the sliding window in advance, when grader quantity was less than K in the sliding window, the grader of up-to-date training added sliding window automatically; And when grader quantity equals K in the sliding window, the grader in the sliding window is upgraded.Promptly when 1≤i＜K+1, grader h _I-1Automatically add sliding window, be designated as E _I-1(like Fig. 2 and shown in Figure 10); When K+1≤i, then the grader in the sliding window is upgraded.The mode of upgrading can take document (A streamingensemble algorithm for large-scale classification [C] //Proceeding of the 7th ACM SIGKDDInternational Conference on Knowledge Discovery and Data Mining.USA; San Francisco, 2001:377-382 are used to solve integrated classifier flow algorithm [C] // 7th Knowledge Discovery and the data mining international conference proceeding of extensive classification problem for one kind.The U.S., San Francisco, the method in 2001:377-382) is calculated grader and grader h in the sliding window respectively _I-1Scoring., the minimum grader of scoring (is made as E when being arranged in sliding window _J0), use grader h _I-1Replacement E _J0, use S simultaneously _I-1And TR _I-1Upgrade S _J0And TR _J0(like Fig. 2 and shown in Figure 11).

The parameter of learning algorithm is relevant with particular problem.As shown in Figure 9, the d value can be set at 4, and the K value can be set at 6, and the i value is 5 to the maximum.

3, concept drift detects:

According to training data stream in the consistent order of sequencing that occurs of notion test data is imported the grader in the sliding window, can check that every study finishes behind the training data piece grader in the sliding window to the detectability (as shown in Figure 9) of concept drift.(the grader quantity of establishing in the current sliding window was K when concept drift detected when carrying out test data X ₀, K ₀≤K) carry out in two steps:

The first step: with all graders in the test data X input sliding window, classification results and classification confidence level that order computation is provided by grader

.If current grader is h _j(0≤j＜K ₀), y is the real classification of X, T _j(X) be grader h _jClassification confidence level to X.The computational methods of classification confidence level are suc as formula shown in (1).

T_{j} (X) = \{\begin{matrix} \frac{Tp + 1}{Tp + Fp + 1} & if & h_{j} (X) = y \\ \frac{Tp}{Tp + Fp + 1} & if & h_{j} (X) &NotEqual; y \end{matrix} - - - (1)

(1) Tp in is that X is at S _jIn m neighbour in by h _jBe judged as ω _jType and really belong to ω _jThe quantity of the data of class, and Fp is that X is at S _jIn m neighbour in by h _jBe judged as ω _jType and don't belong to ω _jThe quantity of the data of class.When each grader is to the classification confidence level of X in calculating sliding window, need to set in advance the big or small m of neighborhood, the size of m is relevant with particular problem, needs the dependence experience to confirm.

Second step: the higher grader of confidence level of selecting automatically to classify in the sliding window carries out the majority ballot.Method is following: right By ordering from small to large, with array A [K ₀] storage the adjusted confidence level of respectively classifying subscript, still use

Value after the expression ordering.Calculate T _Shift[j]=T _J+1(X)-T _j(X), 0≤j＜K ₀-1.Scan array T from small to large _Shift, the maximum jump of judgment value is made as k.Be designated as under like this in the sliding window A [k+1], A [k+2] ..., A [K ₀-1] grader } is the higher grader of classification confidence level.Use these graders to carry out the majority ballot, provide at last the classification of test data X is judged.

Through above step, can be test data X (grader that comprises in the sliding window) from existing grader and select proper classifier to come it is carried out the classification judgement, thereby realize detection concept drift.

Application example: experiment porch is 2.8GHz CPU and 4G RAM; Operating system platform is windows; LibSVM is used in the training of base grader, and the size of buffer memory is used default setting.

The classical data set SEA of test data traffic classification algorithm has been used in experiment.These data centralization data are three-dimensional vector (x ₁, x ₂, x ₃), x _i∈ R, 0.0≤x _i≤10.0.Notion is described as x by order ₁+ x ₂≤b, b ∈ 8,9,7,9.5}, x ₃With x ₁And x ₂Uncorrelated.Therefore, the SEA data set comprises 4 kinds of SEA notions in proper order.Each notion is produced 12500 data respectively at random to be used for training and to be used for testing with 2500 data.D=500, m=5 in experiment.Because d=500, therefore the training set of every conception of species has comprised 25 data blocks in proper order.When sliding window is configured to K=25, can guarantee that each the basic grader in sliding window sometime belongs to a notion.

Experiment divides two kinds, and the notion that sliding window comprises in first kind of experiment is no more than 3 kinds.In this experiment, notion successively is arranged to b=8, b=9, b=7, b=9.5.Therefore, concept drift will appear in the data flow 3 times.In each time experiment, sliding window is arranged to K=13, K=25, K=37, K=50 respectively.The sliding window size is configured to K=63 in second kind of experiment, and the notion that comprises in the sliding window has 3 kinds at least.Notion successively is arranged to b=8, b=9, b=7, b=8, b=9.5, and just notion b=8 is repeated once.4 concept drifts appear in the data flow.Therefore, when the notion of second b=8 occurs, also include the data block that belongs to first b=8 notion in the sliding window certainly.

Each experiment is repeated 100 times, and experimental result is the mean value of 100 experiments.Experimental result such as Fig. 4-shown in Figure 8.SEA method among Fig. 4-Fig. 8 from (A streaming ensemble algorithm for large-scaleclassification [C] //Proceeding of the 7th ACM SIGKDD International Conference on KnowledgeDiscovery and Data Mining.USA; San Francisco, 2001:377-382 are used to solve integrated classifier flow algorithm [C] // 7th Knowledge Discovery and the data mining international conference proceeding of extensive classification problem for one kind.The U.S., San Francisco, 2001:377-382), and the method that CMV-SEA is the present invention to be proposed.

Can find out from Fig. 4-Fig. 7: (1) under various sliding window size conditions, the CMV_SEA algorithm is all fast than SEA algorithm to the detection speed of concept drift.After first data block that belongs to new ideas was learnt, the generalization ability of CMV_SEA algorithm was obviously promoted at once.And the SEA algorithm need be waited until several data blocks that belong to new ideas and learnt later generalization ability and just can get a promotion; When (2) the sliding window size was K=37 or K=50, the SEA algorithm descended to the recognition capability of new ideas, the detection appearance of new ideas is delayed time, and the recognition capability of new ideas is difficult to recover, and the CMV_SEA algorithm is very stable to the recognition capability of new ideas.As can beappreciated from fig. 8: when notion b=7 changed to second b=8 notion, the CMV_SEA algorithm did not occur occurring changing significantly as the accuracy rate of SEA algorithm before and after occurring when second b=8 notion, but remains unchanged.

Can know that by Fig. 4-Fig. 8 effect of the present invention is: through foundation classification confidence level selection sort device; Automatically those graders of those unlikely correct classification X have been shielded; And select relatively more sure those graders that X is correctly classified to carry out the majority ballot as far as possible, thereby the real concept drift detects.Therefore, as long as include the relatively more sure grader that X is correctly classified in the sliding window, the size of sliding window does not constitute influence to the classification of X, thereby has reduced the influence that the sliding window size detects concept drift.A plurality of experiments according to adopting this method to carry out show: the present invention has improved generalization ability; Can in the very first time that new ideas produce, detect concept drift; The detectability of concept drift and the learning ability of new ideas are not received the big or small influence of sliding window.

Below only be preferred implementation of the present invention, protection scope of the present invention also not only is confined to the foregoing description, and all technical schemes that belongs under the thinking of the present invention all belong to protection scope of the present invention.Should be pointed out that for those skilled in the art some improvement and retouching not breaking away under the principle of the invention prerequisite should be regarded as protection scope of the present invention.

Claims

1. the concept drift detection method of a data flow classification is characterized in that step is:

Classification results that provides and classification confidence level; In the said step 3.1, establishing current grader is h _j, 0≤j＜K wherein ₀, y is the real classification of X, T _j(X) be grader h _jTo the classification confidence level of test data X, the classification confidence level computational methods as shown in the formula shown in (1),

T_{j} (X) = \{\begin{matrix} \frac{Tp + 1}{Tp + Fp + 1} & if & h_{j} (X) = y \\ \frac{Tp}{Tp + Fp + 1} & if & h_{j} (X) &NotEqual; y \end{matrix} - - - (1)

Tp in the following formula (1) is that test data X is at S _jIn m neighbour in by h _jBe judged as ω _jType and really belong to ω _jThe quantity of the data of class, and Fp is that test data X is at S _jIn m neighbour in by h _jBe judged as ω _jType and don't belong to ω _jThe quantity of the data of class;

3.2, select in the sliding window the higher grader of classification confidence level to carry out majority ballot automatically, provide the classification of test data X judged that completion is to the detection of concept drift; The idiographic flow of said step 3.2 is: at first will

Value after the expression ordering; Calculate T _Shift[j]=T _J+1(X)-T _j(X), 0≤j＜K ₀-1; Scan array T from small to large _Shift, the maximum jump of judgment value is made as k, be designated as under like this in the sliding window A [k+1], A [k+2] ..., A [K ₀-1] grader } is the higher grader of classification confidence level, uses these graders to carry out the majority ballot, provides at last the classification of test data X is judged.