CN112000705B

CN112000705B - Unbalanced data stream mining method based on active drift detection

Info

Publication number: CN112000705B
Application number: CN202010239770.7A
Authority: CN
Inventors: 张平; 邵亨康; 李方; 陈昕叶
Original assignee: South China University of Technology SCUT
Current assignee: South China University of Technology SCUT
Priority date: 2020-03-30
Filing date: 2020-03-30
Publication date: 2024-04-02
Anticipated expiration: 2040-03-30
Also published as: CN112000705A

Abstract

Aiming at the conceptual drift characteristic and the unbalanced characteristic of the data stream, the invention provides an unbalanced data stream mining method based on active drift detection, which comprises the following three steps: (1) Processing of unbalanced data, some sort of unbalance phenomenon may exist in the data stream, and efficient processing of unbalanced data is required. (2) The active concept drift detection requires real-time detection of the concept drift phenomenon (3) present in the data stream for adaptation, and after the detection of the concept drift, the algorithm needs to be adjusted to adapt to the concept drift. The invention has the advantages of effectively solving the unbalanced phenomenon existing in the data stream, flexibly coping with various conceptual drift scenes including sudden drift, gradual drift, increment drift, cyclic drift and the like, effectively detecting the drift scenes and responding in time, thereby improving the accuracy and efficiency of data stream information mining.

Description

Unbalanced data stream mining method based on active drift detection

Technical Field

The invention belongs to the field of intelligent manufacturing and data mining, and particularly relates to an unbalanced data stream mining method based on active drift detection.

Technical Field

With the progressive development of computer technology, people have had the ability to store large amounts of data. The computer network technology enables the collection, propagation speed and scale of data to reach unprecedented level and realizes globalization information sharing. The traditional data mining technology mainly researches a batch processing algorithm, wherein a model is trained by using all data, the data needs to be scanned for a plurality of times, the data comes from the same data distribution, and the data volume is limited. However, in many practical application scenarios, on one hand, all data cannot be acquired at one time for model training, and on the other hand, the feature distribution of the data also changes continuously with time, such as stock exchange data, medical data, mobile data, sensor network data, and the like, and such data is called data stream.

In smart manufacturing, since the environment in which the data stream is generated is an unstable environment, the data stream may undergo a conceptual drift phenomenon, that is, the characteristic distribution of the data stream may be continuously changed along with the adjustment of the environment. Among methods for handling concept drift, active drift detection methods are widely used, which first monitor the stability of data distribution by using an explicit concept drift detection mechanism, and then trigger a drift adaptation mechanism to adjust a model so as to adapt to a new environment. The method and the device for classifying stream data based on concept drift are characterized in that the method for classifying stream data based on concept drift is integrated in a method for integrating (publication number: CN 108764322A), the type of drift is judged by using the difference degree of class and feature distribution in a method for detecting concept drift in multi-label data stream based on class and feature distribution (publication number: CN 106934035B), and the method for detecting double-window concept drift based on sample distribution statistical test (publication number: CN 110717543A) is combined with fixed window and support vector regression for drift detection. However, the algorithm can only deal with a certain type of characteristic drift, cannot well identify various conceptual drift conditions, and does not consider unbalanced characteristics of the data stream, so that the algorithm has limited use prospects.

Disclosure of Invention

The invention provides an unbalanced data stream mining method based on active drift detection, which not only can process unbalanced phenomena existing in data streams, but also can detect various conceptual drift phenomena in real time, and has wider application prospect. The method can efficiently mine the data flow in the intelligent manufacturing scene. The method mainly comprises three steps: unbalanced data processing, active concept drift detection and drift adaptation.

The object of the invention is achieved by at least one of the following technical solutions.

An unbalanced data stream mining method based on active drift detection comprises the following steps:

s1, data unbalance processing is carried out, a data stream S is obtained, and the data stream S is divided into data blocks B with equal sizes ₁ ,B ₂ ,...,B _n Processing unbalanced data in the data stream by taking the data block as a unit size;

s2, drift detection is carried out, and concept drift phenomenon existing in the data stream is detected in real time;

and S3, drift adaptation, namely after the concept drift is detected, adjusting an algorithm to adapt to the concept drift.

Further, the sampling concept is utilized to process unbalanced data in a data stream, and the sampling method is divided into oversampling and downsampling, but the oversampling can cause data to be over-fitted, and the downsampling can easily lose some important information. So that the over-sampling and the under-sampling are effectively combined, and the optimal sampling effect can be achieved; the step S1 includes the steps of:

s1.1, sampling an original data block to obtain majority class data and minority class data;

s1.2, combining a plurality of balanced data sub-blocks according to the majority class data and the minority class data obtained in the step S1.1;

s1.3, combining the plurality of balanced data sub-blocks obtained in the step S1.2 into a final balanced data block.

Further, the ratio of the majority data to the minority data in the original data block is set to be IR1, the ratio of the majority data to the minority data after the sampling operation is set to be IR2, and the sampling operation of the original data block in step S1.1 is performed according to the relation between IR1 and IR2, specifically as follows:

increasing the data amount of the minority class data using the SMOTE oversampling method, and decreasing the data amount of the majority class data by identifying noise points or boundary points among the majority class data by a clustering method and then removing the identified noise points and boundary points.

Further, the SMOTE oversampling method is an improvement scheme based on a random oversampling technology, and a new sample can be synthesized through an existing sample, and the SMOTE algorithm includes the following steps:

s1.1.1 for each sample X in the minority class of data _i I= … I, I represents the number of samples in the minority class data, and sample X is calculated _i Euclidean distance to all minority class data samples, then sample X is obtained _i K neighbor samples of (a);

s1.1.2 determining a sampling rate based on the sample imbalance ratio IR1

S1.1.3 for each minority class data sample X _i From sample X _i Randomly selecting M neighbor samples Y in k-nearest neighbor of (2) _k ，k＝1…M，

S1.1.4 for each neighbor sample Y _i New samples in the minority class data are synthesized according to the following formula:

X _new ＝Y _k +rand(0,1)*|X _i -Y _k |。 (1)

further, in step S1.2, since the ratio between the majority class data and the minority class data is IR2 at this time, the majority class data is divided into n=ir2 mutually disjoint majority class sub-blocks { majoritydata_j, j=1, 2,..n }, by means of random non-subsampling, and then each of the majority class sub-blocks is combined with a new sample in the minority class data to form a new balanced data sub-block { balanceddata_j, j=1, 2,..n }, respectively.

Further, in step S1.3, data is randomly extracted from the n=ir2 balanced data sub-blocks { balanceddata_i, i=1, 2,..n } generated in step S1.2, and the data are sequentially combined into a final balanced data block.

Further, since the information of the data stream S is limited, it is necessary to use the limited information to determine whether the characteristic drift occurs, and it is possible to use both the supervised information (with a tag) and the non-supervised information (without a tag) in the data stream. The supervision information comprises classification error rate, and the non-supervision information comprises sample mean and variance, and in the step S2, d is included in each data block divided in the step S1 _i Data point, one of the data blocks B _i Is the sample mean value M of (2) _i The definition is as follows:

wherein B is _i,j Refers to the jth data point in the ith data block;

given d _i And M _i Then the ith data block B _i Sample variance V in (1) _i The definition is as follows:

the sample mean M and the sample variance V are non-supervision information of the data stream, and can be used to represent the stability of the data stream S, where M and V should follow a stable normal distribution based on the data block B _i Samples in the interior can calculate M _i And V _i Confidence intervals respectively corresponding toThe method comprises the following steps:

where α is the confidence of the t distribution and χ distribution (chi-square distribution) used to calculate the confidence intervals for M and V;

because the data stream is continuously input, the interval coincidence ratio R of the mean value of the data stream can be calculated based on the adjacent two data blocks _M Overlap ratio R of interval with variance _V The method is characterized by comprising the following steps:

the section overlap ratio is the quotient between the intersection of two adjacent sections and the union thereof, the value range is between [0,1], the closer the section overlap ratio is to 1, the closer the two sections are, the section overlap ratio of the mean value and the section overlap ratio of the variance are combined, and the internal stability measurement index R of the data stream can be obtained:

r may represent the internal stability of the current data stream, and if R is less than the interval overlap threshold θ, then the internal data stream is considered unstable at this time, and feature drift may occur.

Further, the feature drift is divided into two cases of virtual feature drift and real feature drift; the virtual feature drift does not cause the change of decision boundaries, which is caused by the incompleteness of data distribution, so that the algorithm does not need to be retrained, and only a few new data improvement models are needed to be added; the true characteristic drift has caused the change of decision boundaries, and the model needs to be retrained to adapt to the current dynamic environment;

in order to judge whether the feature drift is virtual feature drift or real feature drift, data with supervision information is needed to be introduced for assistance, the classification error rate is effective supervision information, and the data block B _i Class error rate E in (1) _i The definition is as follows:

wherein ε is _j Is data block B _i Classification result of data points, y _j Classification results representing data points, label _j Representing the actual result of the data point. Epsilon if the classification result is correct _j =0, otherwise ε _j =1; also, E can be calculated _i Confidence interval of (2)The following are provided:

where σ is an interval parameter for calculating E confidence interval, set according to the data stream drift condition, set σ to 1 when drift is slow, set to 3 when drift is fast, mean () refers to mean, var () refers to variance, based on two adjacent data blocks, ifIt is considered that a true characteristic drift occurs at this time if +.>Only spurious feature drift is considered to have occurred at this time.

Further, in the step S3, after the concept drift is detected, an algorithm needs to be adjusted to adapt to the concept drift;

the adjustment is mainly divided into two parts, namely virtual concept drift and real concept drift. Virtual concept drift does not change decision boundaries, so the current data block B is _i Supplementing the content of B into the existing algorithm _i Adding the training set to the original training set, and training the algorithm. The true conceptual drift has changed the decision boundary, which means that the previously obtained decision function has not been suitable for the current data stream environment, so the previous decision function needs to be discarded, the original training data is discarded, and B is utilized _i Is a data retraining algorithm.

Compared with the prior art, the invention has the beneficial effects that:

(1) By combining the over-sampling technique with the under-sampling technique, the data volume of the minority class data can be increased on the one hand, and the data volume of the majority class data can be reduced on the other hand. The unbalance phenomenon in the data stream can be effectively relieved, so that the accuracy of data stream mining can be improved.

(2) Meanwhile, the internal stable measurement index R is calculated by using the supervision information (classification error rate) and the non-supervision information (sample mean value and variance) of the data stream, the type of drift can be judged according to the size of R, if R is smaller than 0.3, the sudden drift occurs at this time, if R is smaller than 0.3 and smaller than 0.6, the gradual drift occurs at this time, if R is smaller than 0.6 and smaller than 0.9, the incremental drift occurs at this time, if R is larger than 0.9, the characteristic drift does not occur, and if R repeatedly occurs in a short time, the cyclic drift occurs at this time.

Drawings

FIG. 1 is a schematic diagram of an unbalanced data stream mining method based on active drift detection in an embodiment of the present invention;

FIG. 2 is a flowchart of an unbalanced data stream mining method based on active drift detection in an embodiment of the present invention;

FIG. 3 is a schematic diagram of an unbalanced data flow process according to an embodiment of the present invention.

Detailed Description

Specific implementations of the present invention will be described in further detail below with reference to examples and drawings, but embodiments of the present invention are not limited to this example.

Examples:

the invention comprises the following steps:

an unbalanced data stream mining method based on active drift detection, as shown in fig. 1, comprises the following steps:

s1, data unbalance processing. As shown in fig. 1, the data stream S can be acquired from an external sensor, and is divided into data blocks B with equal size based on a mining method of the data blocks ₁ ,B ₂ ,...,B _n Processing unbalanced data in the data stream by taking the data block as a unit size;

unbalanced data in a data stream is processed mainly by using a sampling idea, and a sampling method is divided into over sampling and under sampling, but the over sampling can cause data to be over-fitted, and the under sampling can easily lose some important information. So that the combination of oversampling and undersampling is effective in achieving an optimal sampling result, said step S1 comprises the steps of:

s1.1, sampling an original data block to obtain majority class data MajorityData and minority class data Minortydata;

firstly, determining the proportion IR1 of majority class data MajorityData and minority class data MinortyData in an original data block, setting the proportion of majority class data and minority class data after sampling operation as IR2, in order to ensure the proportion balance of the sampled data, the IR2 is usually selected to be between 0.5 and 1, and then sampling the original data block according to the relation between the IR1 and the IR2, wherein the method comprises the following steps of:

increasing the data volume of minority class data MinortyData and reducing the data volume of majority class data MajorityData, wherein the increasing the data volume of minority class data MinortyData uses an SMOTE oversampling method, the reducing the data volume of majority class data MajorityData is to identify noise points or boundary points in the majority class data MajorityData by a clustering method, and then remove the identified noise points and boundary points;

it is necessary to ensure that the ratio between the data amount of the data MajorityData of the several classes and the data amount of the data MinorityData of the few classes is IR2 after the sampling operation.

The SMOTE oversampling method is an improvement scheme based on a random oversampling technology, and can synthesize a new sample through an existing sample, and comprises the following steps:

s1.1.2 determining a sampling rate based on the sample imbalance ratio IR1

S1.1.3 for each minority class data sample X _i From sample X _i Randomly selecting M neighbor samples Y in k-nearest neighbor of (2) _k ，k＝1…M。

X _new ＝Y _k +rand(0,1)*|X _i -Y _k |。 (1)

s1.2, since the ratio between the majority data MajorityData and the minority data MinorityData is IR2 at this time, the majority data MajorityData is divided into n=ir2 mutually disjoint majority sub-blocks { majoritydata_j, j=1, 2,..n } by randomly not putting back samples, and then each of the majority sub-blocks synthesizes a new balance data block { balanceddata_j, j=1, 2,..n } with a new sample in the minority data MinorityData.

S1.3, randomly extracting data from the n=ir2 balanced data blocks { balanceddata_i, i=1, 2,..n } generated in step S1.2, and sequentially combining the data blocks into a final balanced data block.

And S2, detecting a concept drift phenomenon existing in the data stream when the characteristic drift occurs.

Since the data stream S has limited information, it is necessary to use the limited information to determine whether or not characteristic drift occurs, and it is possible to use both the supervised information (with a tag) and the unsupervised information (without a tag) in the data stream. The supervision information includes classification error rates and the non-supervision information includes sample mean and variance. The division in step S1 includes d in each data block _i Data point, one of the data blocks B _i Is the sample mean value M of (2) _i The definition is as follows:

wherein B is _i,j Refers to the jth data point in the ith data block;

r may represent an internal stability condition of the current data stream, and if R is smaller than the interval overlap ratio threshold θ, it is considered that the internal data stream is unstable at this time, and feature drift may occur, and the magnitude of the interval overlap ratio threshold θ is generally set to 0.3,0.6,0.9. If R < 0.3, it indicates that a sudden drift occurs at this time, if 0.3 < R < 0.6, it indicates that a gradual drift occurs at this time, if 0.6 < R < 0.9, it indicates that an incremental drift occurs at this time, if R > 0.9, it indicates that no characteristic drift occurs, and if R repeatedly occurs in a short period of time, it indicates that a cyclic drift occurs at this time.

The characteristic drift is divided into two cases of virtual characteristic drift and real characteristic drift; the virtual feature drift does not cause the change of decision boundaries, which is caused by the incompleteness of data distribution, so that the algorithm does not need to be retrained, and only a few new data improvement models are needed to be added; whereas true feature drift has led to a change in decision boundaries, the model needs to be retrained to accommodate the current dynamic environment.

wherein ε is _j Is data block B _i Classification result of data points, y _j Classification results representing data points, label _j Representing the true result of the data point, epsilon if the classification result is correct _j =0, otherwise ε _j =1; also, E can be calculated _i Confidence interval of (2)The following are provided:

where σ is an interval parameter for calculating E confidence interval, set according to the data stream drift condition, set σ to 1 when drift is slow, set to 3 when drift is fast, mean () refers to mean, var () refers to variance. Based on two adjacent data blocks, ifIt is considered that a true characteristic drift occurs at this time if +.>Only spurious feature drift is considered to have occurred at this time.

And S3, drift adaptation, wherein after the concept drift is detected, an algorithm needs to be adjusted so as to adapt to the concept drift.

Through the three steps, the method can effectively solve the unbalanced phenomenon in the data stream, and can detect various conceptual drift phenomena including sudden drift, gradual drift, incremental drift, cyclic drift and the like.

The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims

1. An unbalanced data stream mining method based on active drift detection is characterized by comprising the following steps:

s1, acquiring a data stream S, and dividing the data stream S into data blocks B with equal sizes ₁ ,B ₂ ,...,B _n Large in data block unitsSmall, processing unbalanced data in the data stream; the method comprises the following steps:

s1.3, combining the plurality of balanced data sub-blocks obtained in the step S1.2 into a final balanced data block; setting the ratio of the majority data and the minority data in the original data block as IR1, setting the ratio of the majority data and the minority data after sampling operation as IR2, and performing the sampling operation on the original data block in the step S1.1 according to the relation between IR1 and IR2, wherein the method specifically comprises the following steps:

increasing the data volume of minority class data and reducing the data volume of majority class data, wherein the increasing the data volume of minority class data uses an SMOTE oversampling method, the reducing the data volume of majority class data is to identify noise points or boundary points in the majority class data by a clustering method, and then remove the identified noise points and boundary points;

s2, detecting concept drift phenomena existing in the data stream in real time; judging whether characteristic drift occurs through supervised information and unsupervised information in the data stream, wherein the supervised information comprises a classification error rate, the unsupervised information comprises a sample mean value and a sample variance,

setting d to be included in each data block divided in step S1 _i Data point, one of the data blocks B _i Is the sample mean value M of (2) _i The definition is as follows:

wherein B is _i,j Refers to the jth data point in the ith data block;

the sample mean M and the sample variance V are used to represent the stability of the data stream S, and M and V follow a stable normal distribution when the data stream is in a stable state, based on the data block B _i Samples in the interior can calculate M _i And V _i Confidence intervals respectively corresponding toThe method comprises the following steps:

based on two adjacent data blocks, calculating the interval coincidence ratio R of the mean value of the two adjacent data blocks _M Overlap ratio R of interval with variance _V The method is characterized by comprising the following steps:

the interval overlap ratio is the quotient between the intersection of two adjacent intervals and the union of the two adjacent intervals, the value range is between 0 and 1, and the interval overlap ratio of the mean value and the interval overlap ratio of the variance are combined to obtain an internal stability measurement index R of the data stream:

r represents the internal stability of the current data stream, and if R is smaller than the interval coincidence degree threshold value theta, the internal data stream is considered to be unstable at the moment, and characteristic drift possibly occurs;

the feature drift is classified into a virtual feature drift and a real feature drift,

data block B _i Class error rate E in (1) _i The definition is as follows:

wherein ε _j Is data block B _i Classification result of data points, y _j Classification results representing data points, label _j Representing the true result of the data point, epsilon if the classification result is correct _j =0, otherwise ε _j =1; also, E can be calculated _i Confidence interval of (2)The following are provided:

where σ is an interval parameter for calculating E confidence interval, set according to the data stream drift condition, set σ to 1 when drift is slow, set to 3 when drift is fast, mean () refers to mean, var () refers to variance,

based on two adjacent data blocks, ifIt is considered that a true characteristic drift occurs at this time if +.>Then it is considered that only spurious feature drift has occurred at this time;

and S3, after the concept drift is detected, adjusting an algorithm to adapt to the concept drift.

2. The method of unbalanced data stream mining based on active drift detection of claim 1, wherein the SMOTE oversampling comprises the steps of:

s1.1.2 determining a sampling rate based on the sample imbalance ratio IR1

S1.1.4 for each neighbor sample Y _k Synthesizing new samples in minority class data according to the following formula;

X _new ＝Y _k +rand(0,1)*|X _i -Y _k | (1)。

3. the method according to claim 2, wherein in step S1.2, since the ratio between the majority data and the minority data is IR2, the majority data is divided into N mutually disjoint majority sub-blocks { majoritydata_j, j=1, 2,..n }, where n=ir2, and then each of the majority sub-blocks is combined with a new sample in the minority data to form a new balanced data sub-block { balanceddata_j, j=1, 2,..n }, respectively.

4. The method according to claim 1, wherein in step S1.3, data is randomly extracted from the plurality of balanced data sub-blocks { balanceddata_i, i=1, 2,..n } combined in step S1.2, and the data are sequentially combined into a final balanced data block.

5. The method according to claim 4, wherein the magnitude of the interval overlap threshold θ is set to 0.3,0.6,0.9, and if R < 0.3, it indicates that a sudden drift occurs at this time, and if 0.3 < R < 0.6, it indicates that a gradual drift occurs at this time, and if 0.6 < R < 0.9, it indicates that an incremental drift occurs at this time, and if R > 0.9, it indicates that no characteristic drift occurs, and if R repeatedly occurs in a short period of time, it indicates that a cyclic drift occurs at this time.

6. The method for mining unbalanced data streams based on active drift detection of claim 1, wherein in the step S3, the feature drift is divided into two cases of virtual feature drift and real feature drift,

when the occurrence of virtual concept drift is detected, the virtual concept drift does not change the decision boundary, so the current data block B is processed _i Supplementing the content of B into the existing algorithm _i Adding the training set into the original training set, and training the algorithm;

when detecting that the true concept drift occurs, the true concept drift changes the decision boundary, the previously obtained decision function is not suitable for the current data flow environment, so the previous decision function is discarded, the original training data is discarded, and B is utilized _i Is a data retraining algorithm.