CN103593470A

CN103593470A - Double-degree integrated unbalanced data stream classification algorithm

Info

Publication number: CN103593470A
Application number: CN201310624425.5A
Authority: CN
Inventors: 张重生
Original assignee: Henan University
Current assignee: Henan University
Priority date: 2013-11-29
Filing date: 2013-11-29
Publication date: 2014-02-19
Anticipated expiration: 2033-11-29
Also published as: CN103593470B

Abstract

The invention discloses a double-degree integrated unbalanced data stream classification algorithm. The double-degree integrated unbalanced data stream classification algorithm includes a balanced data stream classification model prediction stage, a classification reliability evaluation stage and an unbalanced data stream classification model prediction stage. In the balanced data stream classification model prediction stage, firstly, a balanced data stream classification model predicts the classification of each data record. In the classification reliability evaluation stage, reliability evaluation is conducted on the classification results obtained in the balanced data stream classification model prediction stage, the classification results of the records with high reliability are directly sent back to a user, and the data records with low reliability need to be classified again in the unbalanced data stream classification model prediction stage. The method embodied in the double-degree integrated unbalanced data stream classification algorithm can be widely applied to applications such as computer-assisted clinical diagnosis and real-time intrusion detection, and the invention belongs to the field of artificial intelligence applications.

Description

The integrated unbalanced data flow classification algorithm of a kind of two degree

Technical field

The present invention relates to a kind of data flow classification algorithm, relate in particular to the integrated unbalanced data flow classification algorithm of a kind of two degree.

Background technology

In recent years, data mining technology is more and more in the practical application of all trades and professions, comprise area of computer aided clinical diagnosis, the commending system based on internet and ad system, client segmentation, finance data analysis and abnormal transaction monitoring etc., intellectual analysis and the decision system of this Industry-oriented are accepted extensively by people.

In a lot of practical applications, the distribution of data is unbalanced, claims again to distribute, and for example, 90% data recording belongs to classification A together, claims that A is most classes; And only have 10% data recording to belong to classification B, so claim that again B is minority class.For example, in the application of analyzing at finance data, most transaction are all normal, and it is abnormal only having only a few transaction; While using sorting technique to note abnormalities conclude the business regular, how from a small amount of abnormal transaction record, to note abnormalities transaction rule and set up abnormal classification of business transaction model, be the task of extremely having challenge: this disaggregated model needs to identify comparatively exactly abnormal transaction; Can not transaction be normally mistaken for abnormal simultaneously.In other words, this disaggregated model should be classified to abnormal transaction comparatively exactly, and normal transaction also needs to classify comparatively exactly.

The practical application of a lot of data minings not only needs to process static data, and need to process a large amount of flow datas, also be data stream, such as: social media excavate, the application such as flow analysis, stock exchange analysis, event detection, sensing data processing are clicked in website.In these application, the data stream of skewness weighing apparatus, the data stream tilting that also distributes is common.Although existing sorting algorithm can improve the classify accuracy of the minority class in the data stream of skewness weighing apparatus, reduced the classify accuracy of most classes.Therefore, need a kind of sorting algorithm of more desirable unbalanced data stream, this algorithm can be predicted the minority class data recording in unbalanced data stream comparatively exactly, can guarantee the classify accuracy to most class data recording again.

Summary of the invention

The object of this invention is to provide the integrated unbalanced data flow classification algorithm of a kind of two degree, can predict comparatively exactly the minority class in unbalanced data stream, can guarantee the classify accuracy to most class data recording again.

The present invention adopts following technical proposals:

Spend an integrated unbalanced data flow classification algorithm, comprise following step:

A: equalization data traffic classification model and lack of balance data flow classification model training stage: each up-to-date data stream record block of concentrating for training data, is divided into training set and checking collection; On training set, train respectively the disaggregated model of a balanced disaggregated model and a lack of balance; Be retained in the disaggregated model of n the equilibrium that the upper classify accuracy of checking collection is the highest and the disaggregated model of n lack of balance;

B: utilize n equalization data traffic classification model and n lack of balance data flow classification model in steps A classify and carry out reliability assessment verifying concentrated data recording, finally draw the confidence level threshold value δ of optimization;

C: use n equalization data traffic classification model and n lack of balance data flow classification model in steps A to classify for each concentrated data recording of test data, and export final classification results.

The method that in described step B, usage data drives is determined the confidence level threshold value δ optimizing on checking collection, and concrete grammar is as follows:

With the accuracy of m1 presentation class, the sensitivity of m2 presentation class and the geometric mean of specificity; Initializing variable d=1.0, t=0, on verification msg collection; Circulation is carried out as is finished drilling: since 0, the value of δ is increased to 0.02 at every turn, and verify the value of the point (m1, m2) that this δ value is corresponding and the distance l of point (1,1); If this l is also less than d, d=l, t=δ; This circular flow o'clock finishes to δ=1; After circulation finishes, the currency of t is assigned to δ, δ value is now the confidence level threshold value of optimization.

Every the data recording u in described step C, test data being concentrated classify and predicts and comprise following step:

C1: first integrated retained a n equalization data traffic classification model to the u prediction of classifying;

C2: calculate the confidence level r of the classification results of u (u), the classification results that confidence level r (u) is greater than the confidence level threshold value δ of optimization directly returns to user;

C3: if to the low r of the classifying believe degree of u (u) and the confidence level threshold value δ optimizing, the disaggregated model of an integrated n lack of balance carries out subseries again to u, and returns to final classification results.

In described steps A, train equalization data traffic classification model to comprise following step:

A11: training set is carried out to simple random sampling, and sample size, for being designated as s, is not distinguished the classification of sample during sampling, and this sample is designated as T1;

A12: use sorting algorithm, train classification models on T1, claims that this disaggregated model is 1 equalization data traffic classification model;

A13: test existing equalization data traffic classification model, if the sum of equalization data traffic classification model surpasses n, on checking collection, test has equalization data traffic classification model one by one, and the poorest equalization data traffic classification model of superseded classify accuracy, until the sum of residue equalization data traffic classification model equals n;

In described steps A, train 1 lack of balance data flow classification model to comprise following step:

A21: collect the minority class data recording in the training set of each data stream record block, and put into minority class and record container, if minority class records the sum of data recording in container, surpass defined amount s, eliminate the oldest data recording in this piece, until the sum of remaining data record equals s;

A22: during sampling, first Tr is carried out to simple random sampling, sample size is s/2, does not distinguish the classification of sample during sampling; Then data recording minority class being recorded in container is carried out simple random sampling, and sample size is also s/2, and twice data from the sample survey combined and form up-to-date data from the sample survey, is designated as T2;

A23: use sorting algorithm, train classification models on T2.Claim that this disaggregated model is 1 lack of balance data flow classification model;

A24: test existing lack of balance data flow classification model: if the sum of lack of balance data flow classification model surpasses n, on Va, test has lack of balance data flow classification model one by one, and the poorest lack of balance data flow classification model of superseded classify accuracy, until the sum of remaining lack of balance data flow classification model equals n.

The present invention is by using the model prediction of equalization data traffic classification, classifying believe degree assessment and unbalanced data flow classification model prediction three phases, comparatively exactly non-classified new data in unbalanced data stream is carried out to real-time grading, use the present invention can predict comparatively exactly the record of minority class, can greatly reduce again sorter and most classes are mistaken for to the probability of minority class; Therefore, the method in the present invention, the classification of the data stream weighing for skewness, has great importance; And how the present invention can, for solving in data stream application, find classifying rules and comparatively exactly non-classified new data be carried out the problem of real-time grading from unbalanced real-time stream; The method belongs to artificial intelligence application field, can be widely used in the application such as area of computer aided clinical diagnosis, intrusion detection in real time.

Accompanying drawing explanation

Fig. 1 is schematic flow sheet of the present invention.

Embodiment

As shown in Figure 1, the integrated unbalanced data flow classification algorithm of a kind of two degree, is used pane data flow model, and by large data stream record blocks such as data stream are cut into successively, the data recording quantity of each data stream record block is identical.The parameter of using in this patent mainly contains: b: the quantity of the data recording in data stream record block.S: the sample size of sampling, s < b, s is also the size that minority class records container simultaneously.N: the quantity of the data stream record block that pane data flow model can keep.Specifically comprise following step:

A: equalization data traffic classification model and lack of balance data flow classification model training stage: for each up-to-date data stream record block, ratio with 90% and 10% is divided into training dataset Tr and verification msg collection Va two parts by this data stream record block, trains respectively 1 equalization data traffic classification model and 1 lack of balance data flow classification model on Tr;

In described steps A, train 1 equalization data traffic classification model to comprise following step:

A11: Tr is carried out to simple random sampling, and sample size is s, does not distinguish the classification of sample during sampling, and this sample is designated as T1;

A13: test existing equalization data traffic classification model: if the sum of equalization data traffic classification model surpasses n, on Va, test has equalization data traffic classification model one by one, and the poorest equalization data traffic classification model of superseded classify accuracy, until the sum of residue equalization data traffic classification model equals n;

A21: collect the minority class data recording in the training set of each data stream record block, and put into minority class and record container.If minority class records the sum of data recording in container, surpass defined amount s, eliminate the oldest data recording in this piece, until the sum of remaining data record equals s;

B: utilize n equalization data traffic classification model and n lack of balance data flow classification model in steps A the data recording in Va is classified and carry out reliability assessment, draw the confidence level threshold value δ of optimization.

E1 is n the integrated sorter of equalization data traffic classification model in steps A, and E2 is n the integrated sorter of lack of balance data flow classification model in steps A.E1 and E2 are used the classification of a data recording of method prediction of member's majority voting.

The fall into a trap method of point counting class credible result degree of described step B is as follows:

B1: for binary classification device, the value of definition r (x) is the absolute value of the difference of a data recording x of sorter prediction probability that belongs to two classes; With P (x ∈ A), represent that x belongs to the probability of class A, with P (x ∈ B) expression x, belong to the probability of class B, r (x)=| P (x ∈ A)-P (x ∈ B) |, P (x ∈ A)+P (x ∈ B)=1 wherein; Wherein the value of r (x) is larger, just shows that the confidence level of classification results of binary classification device is higher; Otherwise, if the value of r (x) is less, just show that the confidence level of classification results of binary classification device is lower.

The method of calculating confidence level threshold value δ in described step B is as follows:

B2: the method that usage data drives is determined the confidence level threshold value of optimizing: with the accuracy of m1 presentation class, the sensitivity of m2 presentation class and the geometric mean of specificity; Initializing variable d=1.0, t=0.On verification msg collection Va in steps A, following operation is carried out in circulation: since 0, the value of δ is increased to 0.02 at every turn, and verify the value of the point (m1, m2) that this δ value is corresponding and the distance of point (1,1); In each circulation, retain from (m1, the m2) of the distance minimum of point (1,1) and put corresponding δ value, this circular flow o'clock finishes to δ=1.Specific procedure is as follows:

The method of data-driven:

Input: verification msg collection Va, the n in steps A the sorter E1 that equalization data traffic classification model is integrated, the n in steps A the sorter E2 that lack of balance data flow classification model is integrated

Output: the optimum value of parameter δ

begin

1 t

0， d 1.0；

2 for t=0:0.02:1 { // circulation (t span is in [0,1], and each circulation increases progressively 0.02)

3 for each u in Va { // circulation (u is a data recording in Va)

First 4 use sorter E1 to classify to u;

Then 5 calculate r (u) according to classification results;

6 if ( r(u) < t) {

7 use sorter E2 to reclassify u;

8 calculate (m1, m2) and calculate it to the distance l of point (1,1);

9 if ( l < d) {

10 δ = t；

11 d = l； }}}

end

After circulation finishes, the currency of t is assigned to δ, δ value is now the confidence level threshold value of optimization.

C: to the prediction of classifying of every data recording in test data set Test.

In described step C, any data recording u in test data set Test is classified and comprises following steps:

C1: first use the sorter E1 in step B to classify to u;

C2: use the confidence level computing method in step B1 to calculate r (u);

C3: if r (u) >=is δ, output category result; If r (u) < is δ, uses the sorter E2 in step B to carry out subseries again to u, and export the classification results of E2.

The present invention is the model prediction of equalization data traffic classification, classifying believe degree assessment and unbalanced data flow classification model prediction three phases by the classifying and dividing of unbalanced data stream.Wherein, the equalization data traffic classification model prediction stage is used the classification that the integrated sorter E1 predicted data of n balanced sorter in step B records; Classifying believe degree assessment is carried out reliability assessment to the classification results of E1, and the classification results of record with a high credibility directly returns to user, and does not need the classification through unbalanced data flow classification model prediction.And record with a low credibility need to be integrated through n lack of balance sorter in step B sorter E2 subseries again and export the classification results of E2.

Overall flow of the present invention is as follows:

Total algorithm:

Input: the training set Train of data stream, the test set Test of data stream

Output: the classification results of Test data set

begin

1 Train is divided into n size is the data stream record block D1 of b, D2 ..., Dn;

2 for i=1:1:n { // circulation (i span is at [1, n], and each circulation increases progressively 1)

3 are divided into training set Tr and checking collection Va by data stream record block Di;

4 use the method in steps A on Tr, to train 1 balanced sorter and 1 lack of balance sorter;

5 are retained in n balanced sorter and n the lack of balance sorter of the upper classifying quality the best of Va;

6 use algorithm 1 to solve optimal threshold δ on Va;

7 for each u in Test { // circulation (u is a data recording in Test)

First 8 use by n the integrated sorter E1 of balanced sorter u classified;

Then 9 calculate r (u) according to E1 classification results;

10 if ( r(u) < δ) {

11 use by the integrated sorter E2 of n lack of balance sorter u subseries again;

end

Claims

1. the integrated unbalanced data flow classification algorithm of two degree, is characterized in that: comprise following step:

2. spend integrated unbalanced data flow classification algorithm for according to claim 1 pair, it is characterized in that: the confidence level threshold value δ of the method that in described step B, usage data drives definite optimization on checking collection, concrete grammar is as follows:

3. spend integrated unbalanced data flow classification algorithm for according to claim 1 pair, its feature exists

In: every the data recording u in described step C, test data being concentrated, classify and predicts and comprise following step:

4. according to the integrated unbalanced data flow classification algorithm of two degree described in claim 1-3, its feature exists

In: in described steps A, train equalization data traffic classification model to comprise following step:

A13: test existing equalization data traffic classification model, if the sum of equalization data traffic classification model surpasses n, on checking collection, test has equalization data traffic classification model one by one, and the poorest equalization data traffic classification model of superseded classify accuracy, until the sum of residue equalization data traffic classification model equals n.

5. spend integrated unbalanced data flow classification algorithm for according to claim 4 pair, its feature exists

In: in described steps A, train 1 lack of balance data flow classification model to comprise following step:

A23: use sorting algorithm, train classification models on T2, claims that this disaggregated model is 1 lack of balance data flow classification model;