CN112000705B - Unbalanced data stream mining method based on active drift detection - Google Patents

Unbalanced data stream mining method based on active drift detection Download PDF

Info

Publication number
CN112000705B
CN112000705B CN202010239770.7A CN202010239770A CN112000705B CN 112000705 B CN112000705 B CN 112000705B CN 202010239770 A CN202010239770 A CN 202010239770A CN 112000705 B CN112000705 B CN 112000705B
Authority
CN
China
Prior art keywords
data
drift
data stream
sample
majority
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010239770.7A
Other languages
Chinese (zh)
Other versions
CN112000705A (en
Inventor
张平
邵亨康
李方
陈昕叶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
South China University of Technology SCUT
Original Assignee
South China University of Technology SCUT
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by South China University of Technology SCUT filed Critical South China University of Technology SCUT
Priority to CN202010239770.7A priority Critical patent/CN112000705B/en
Publication of CN112000705A publication Critical patent/CN112000705A/en
Application granted granted Critical
Publication of CN112000705B publication Critical patent/CN112000705B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24568Data stream processing; Continuous queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2216/00Indexing scheme relating to additional aspects of information retrieval not explicitly covered by G06F16/00 and subgroups
    • G06F2216/03Data mining

Abstract

Aiming at the conceptual drift characteristic and the unbalanced characteristic of the data stream, the invention provides an unbalanced data stream mining method based on active drift detection, which comprises the following three steps: (1) Processing of unbalanced data, some sort of unbalance phenomenon may exist in the data stream, and efficient processing of unbalanced data is required. (2) The active concept drift detection requires real-time detection of the concept drift phenomenon (3) present in the data stream for adaptation, and after the detection of the concept drift, the algorithm needs to be adjusted to adapt to the concept drift. The invention has the advantages of effectively solving the unbalanced phenomenon existing in the data stream, flexibly coping with various conceptual drift scenes including sudden drift, gradual drift, increment drift, cyclic drift and the like, effectively detecting the drift scenes and responding in time, thereby improving the accuracy and efficiency of data stream information mining.

Description

Unbalanced data stream mining method based on active drift detection
Technical Field
The invention belongs to the field of intelligent manufacturing and data mining, and particularly relates to an unbalanced data stream mining method based on active drift detection.
Technical Field
With the progressive development of computer technology, people have had the ability to store large amounts of data. The computer network technology enables the collection, propagation speed and scale of data to reach unprecedented level and realizes globalization information sharing. The traditional data mining technology mainly researches a batch processing algorithm, wherein a model is trained by using all data, the data needs to be scanned for a plurality of times, the data comes from the same data distribution, and the data volume is limited. However, in many practical application scenarios, on one hand, all data cannot be acquired at one time for model training, and on the other hand, the feature distribution of the data also changes continuously with time, such as stock exchange data, medical data, mobile data, sensor network data, and the like, and such data is called data stream.
In smart manufacturing, since the environment in which the data stream is generated is an unstable environment, the data stream may undergo a conceptual drift phenomenon, that is, the characteristic distribution of the data stream may be continuously changed along with the adjustment of the environment. Among methods for handling concept drift, active drift detection methods are widely used, which first monitor the stability of data distribution by using an explicit concept drift detection mechanism, and then trigger a drift adaptation mechanism to adjust a model so as to adapt to a new environment. The method and the device for classifying stream data based on concept drift are characterized in that the method for classifying stream data based on concept drift is integrated in a method for integrating (publication number: CN 108764322A), the type of drift is judged by using the difference degree of class and feature distribution in a method for detecting concept drift in multi-label data stream based on class and feature distribution (publication number: CN 106934035B), and the method for detecting double-window concept drift based on sample distribution statistical test (publication number: CN 110717543A) is combined with fixed window and support vector regression for drift detection. However, the algorithm can only deal with a certain type of characteristic drift, cannot well identify various conceptual drift conditions, and does not consider unbalanced characteristics of the data stream, so that the algorithm has limited use prospects.
Disclosure of Invention
The invention provides an unbalanced data stream mining method based on active drift detection, which not only can process unbalanced phenomena existing in data streams, but also can detect various conceptual drift phenomena in real time, and has wider application prospect. The method can efficiently mine the data flow in the intelligent manufacturing scene. The method mainly comprises three steps: unbalanced data processing, active concept drift detection and drift adaptation.
The object of the invention is achieved by at least one of the following technical solutions.
An unbalanced data stream mining method based on active drift detection comprises the following steps:
s1, data unbalance processing is carried out, a data stream S is obtained, and the data stream S is divided into data blocks B with equal sizes 1 ,B 2 ,...,B n Processing unbalanced data in the data stream by taking the data block as a unit size;
s2, drift detection is carried out, and concept drift phenomenon existing in the data stream is detected in real time;
and S3, drift adaptation, namely after the concept drift is detected, adjusting an algorithm to adapt to the concept drift.
Further, the sampling concept is utilized to process unbalanced data in a data stream, and the sampling method is divided into oversampling and downsampling, but the oversampling can cause data to be over-fitted, and the downsampling can easily lose some important information. So that the over-sampling and the under-sampling are effectively combined, and the optimal sampling effect can be achieved; the step S1 includes the steps of:
s1.1, sampling an original data block to obtain majority class data and minority class data;
s1.2, combining a plurality of balanced data sub-blocks according to the majority class data and the minority class data obtained in the step S1.1;
s1.3, combining the plurality of balanced data sub-blocks obtained in the step S1.2 into a final balanced data block.
Further, the ratio of the majority data to the minority data in the original data block is set to be IR1, the ratio of the majority data to the minority data after the sampling operation is set to be IR2, and the sampling operation of the original data block in step S1.1 is performed according to the relation between IR1 and IR2, specifically as follows:
increasing the data amount of the minority class data using the SMOTE oversampling method, and decreasing the data amount of the majority class data by identifying noise points or boundary points among the majority class data by a clustering method and then removing the identified noise points and boundary points.
Further, the SMOTE oversampling method is an improvement scheme based on a random oversampling technology, and a new sample can be synthesized through an existing sample, and the SMOTE algorithm includes the following steps:
s1.1.1 for each sample X in the minority class of data i I= … I, I represents the number of samples in the minority class data, and sample X is calculated i Euclidean distance to all minority class data samples, then sample X is obtained i K neighbor samples of (a);
s1.1.2 determining a sampling rate based on the sample imbalance ratio IR1
S1.1.3 for each minority class data sample X i From sample X i Randomly selecting M neighbor samples Y in k-nearest neighbor of (2) k ,k=1…M,
S1.1.4 for each neighbor sample Y i New samples in the minority class data are synthesized according to the following formula:
X new =Y k +rand(0,1)*|X i -Y k |。 (1)
further, in step S1.2, since the ratio between the majority class data and the minority class data is IR2 at this time, the majority class data is divided into n=ir2 mutually disjoint majority class sub-blocks { majoritydata_j, j=1, 2,..n }, by means of random non-subsampling, and then each of the majority class sub-blocks is combined with a new sample in the minority class data to form a new balanced data sub-block { balanceddata_j, j=1, 2,..n }, respectively.
Further, in step S1.3, data is randomly extracted from the n=ir2 balanced data sub-blocks { balanceddata_i, i=1, 2,..n } generated in step S1.2, and the data are sequentially combined into a final balanced data block.
Further, since the information of the data stream S is limited, it is necessary to use the limited information to determine whether the characteristic drift occurs, and it is possible to use both the supervised information (with a tag) and the non-supervised information (without a tag) in the data stream. The supervision information comprises classification error rate, and the non-supervision information comprises sample mean and variance, and in the step S2, d is included in each data block divided in the step S1 i Data point, one of the data blocks B i Is the sample mean value M of (2) i The definition is as follows:
wherein B is i,j Refers to the jth data point in the ith data block;
given d i And M i Then the ith data block B i Sample variance V in (1) i The definition is as follows:
the sample mean M and the sample variance V are non-supervision information of the data stream, and can be used to represent the stability of the data stream S, where M and V should follow a stable normal distribution based on the data block B i Samples in the interior can calculate M i And V i Confidence intervals respectively corresponding toThe method comprises the following steps:
where α is the confidence of the t distribution and χ distribution (chi-square distribution) used to calculate the confidence intervals for M and V;
because the data stream is continuously input, the interval coincidence ratio R of the mean value of the data stream can be calculated based on the adjacent two data blocks M Overlap ratio R of interval with variance V The method is characterized by comprising the following steps:
the section overlap ratio is the quotient between the intersection of two adjacent sections and the union thereof, the value range is between [0,1], the closer the section overlap ratio is to 1, the closer the two sections are, the section overlap ratio of the mean value and the section overlap ratio of the variance are combined, and the internal stability measurement index R of the data stream can be obtained:
r may represent the internal stability of the current data stream, and if R is less than the interval overlap threshold θ, then the internal data stream is considered unstable at this time, and feature drift may occur.
Further, the feature drift is divided into two cases of virtual feature drift and real feature drift; the virtual feature drift does not cause the change of decision boundaries, which is caused by the incompleteness of data distribution, so that the algorithm does not need to be retrained, and only a few new data improvement models are needed to be added; the true characteristic drift has caused the change of decision boundaries, and the model needs to be retrained to adapt to the current dynamic environment;
in order to judge whether the feature drift is virtual feature drift or real feature drift, data with supervision information is needed to be introduced for assistance, the classification error rate is effective supervision information, and the data block B i Class error rate E in (1) i The definition is as follows:
wherein ε is j Is data block B i Classification result of data points, y j Classification results representing data points, label j Representing the actual result of the data point. Epsilon if the classification result is correct j =0, otherwise ε j =1; also, E can be calculated i Confidence interval of (2)The following are provided:
where σ is an interval parameter for calculating E confidence interval, set according to the data stream drift condition, set σ to 1 when drift is slow, set to 3 when drift is fast, mean () refers to mean, var () refers to variance, based on two adjacent data blocks, ifIt is considered that a true characteristic drift occurs at this time if +.>Only spurious feature drift is considered to have occurred at this time.
Further, in the step S3, after the concept drift is detected, an algorithm needs to be adjusted to adapt to the concept drift;
the adjustment is mainly divided into two parts, namely virtual concept drift and real concept drift. Virtual concept drift does not change decision boundaries, so the current data block B is i Supplementing the content of B into the existing algorithm i Adding the training set to the original training set, and training the algorithm. The true conceptual drift has changed the decision boundary, which means that the previously obtained decision function has not been suitable for the current data stream environment, so the previous decision function needs to be discarded, the original training data is discarded, and B is utilized i Is a data retraining algorithm.
Compared with the prior art, the invention has the beneficial effects that:
(1) By combining the over-sampling technique with the under-sampling technique, the data volume of the minority class data can be increased on the one hand, and the data volume of the majority class data can be reduced on the other hand. The unbalance phenomenon in the data stream can be effectively relieved, so that the accuracy of data stream mining can be improved.
(2) Meanwhile, the internal stable measurement index R is calculated by using the supervision information (classification error rate) and the non-supervision information (sample mean value and variance) of the data stream, the type of drift can be judged according to the size of R, if R is smaller than 0.3, the sudden drift occurs at this time, if R is smaller than 0.3 and smaller than 0.6, the gradual drift occurs at this time, if R is smaller than 0.6 and smaller than 0.9, the incremental drift occurs at this time, if R is larger than 0.9, the characteristic drift does not occur, and if R repeatedly occurs in a short time, the cyclic drift occurs at this time.
Drawings
FIG. 1 is a schematic diagram of an unbalanced data stream mining method based on active drift detection in an embodiment of the present invention;
FIG. 2 is a flowchart of an unbalanced data stream mining method based on active drift detection in an embodiment of the present invention;
FIG. 3 is a schematic diagram of an unbalanced data flow process according to an embodiment of the present invention.
Detailed Description
Specific implementations of the present invention will be described in further detail below with reference to examples and drawings, but embodiments of the present invention are not limited to this example.
Examples:
the invention comprises the following steps:
an unbalanced data stream mining method based on active drift detection, as shown in fig. 1, comprises the following steps:
s1, data unbalance processing. As shown in fig. 1, the data stream S can be acquired from an external sensor, and is divided into data blocks B with equal size based on a mining method of the data blocks 1 ,B 2 ,...,B n Processing unbalanced data in the data stream by taking the data block as a unit size;
unbalanced data in a data stream is processed mainly by using a sampling idea, and a sampling method is divided into over sampling and under sampling, but the over sampling can cause data to be over-fitted, and the under sampling can easily lose some important information. So that the combination of oversampling and undersampling is effective in achieving an optimal sampling result, said step S1 comprises the steps of:
s1.1, sampling an original data block to obtain majority class data MajorityData and minority class data Minortydata;
firstly, determining the proportion IR1 of majority class data MajorityData and minority class data MinortyData in an original data block, setting the proportion of majority class data and minority class data after sampling operation as IR2, in order to ensure the proportion balance of the sampled data, the IR2 is usually selected to be between 0.5 and 1, and then sampling the original data block according to the relation between the IR1 and the IR2, wherein the method comprises the following steps of:
increasing the data volume of minority class data MinortyData and reducing the data volume of majority class data MajorityData, wherein the increasing the data volume of minority class data MinortyData uses an SMOTE oversampling method, the reducing the data volume of majority class data MajorityData is to identify noise points or boundary points in the majority class data MajorityData by a clustering method, and then remove the identified noise points and boundary points;
it is necessary to ensure that the ratio between the data amount of the data MajorityData of the several classes and the data amount of the data MinorityData of the few classes is IR2 after the sampling operation.
The SMOTE oversampling method is an improvement scheme based on a random oversampling technology, and can synthesize a new sample through an existing sample, and comprises the following steps:
s1.1.1 for each sample X in the minority class of data i I= … I, I represents the number of samples in the minority class data, and sample X is calculated i Euclidean distance to all minority class data samples, then sample X is obtained i K neighbor samples of (a);
s1.1.2 determining a sampling rate based on the sample imbalance ratio IR1
S1.1.3 for each minority class data sample X i From sample X i Randomly selecting M neighbor samples Y in k-nearest neighbor of (2) k ,k=1…M。
S1.1.4 for each neighbor sample Y i New samples in the minority class data are synthesized according to the following formula:
X new =Y k +rand(0,1)*|X i -Y k |。 (1)
s1.2, since the ratio between the majority data MajorityData and the minority data MinorityData is IR2 at this time, the majority data MajorityData is divided into n=ir2 mutually disjoint majority sub-blocks { majoritydata_j, j=1, 2,..n } by randomly not putting back samples, and then each of the majority sub-blocks synthesizes a new balance data block { balanceddata_j, j=1, 2,..n } with a new sample in the minority data MinorityData.
S1.3, randomly extracting data from the n=ir2 balanced data blocks { balanceddata_i, i=1, 2,..n } generated in step S1.2, and sequentially combining the data blocks into a final balanced data block.
And S2, detecting a concept drift phenomenon existing in the data stream when the characteristic drift occurs.
Since the data stream S has limited information, it is necessary to use the limited information to determine whether or not characteristic drift occurs, and it is possible to use both the supervised information (with a tag) and the unsupervised information (without a tag) in the data stream. The supervision information includes classification error rates and the non-supervision information includes sample mean and variance. The division in step S1 includes d in each data block i Data point, one of the data blocks B i Is the sample mean value M of (2) i The definition is as follows:
wherein B is i,j Refers to the jth data point in the ith data block;
given d i And M i Then the ith data block B i Sample variance V in (1) i The definition is as follows:
the sample mean M and the sample variance V are non-supervision information of the data stream, and can be used to represent the stability of the data stream S, where M and V should follow a stable normal distribution based on the data block B i Samples in the interior can calculate M i And V i Confidence intervals respectively corresponding toThe method comprises the following steps:
where α is the confidence of the t distribution and χ distribution (chi-square distribution) used to calculate the confidence intervals for M and V;
because the data stream is continuously input, the interval coincidence ratio R of the mean value of the data stream can be calculated based on the adjacent two data blocks M Overlap ratio R of interval with variance V The method is characterized by comprising the following steps:
the section overlap ratio is the quotient between the intersection of two adjacent sections and the union thereof, the value range is between [0,1], the closer the section overlap ratio is to 1, the closer the two sections are, the section overlap ratio of the mean value and the section overlap ratio of the variance are combined, and the internal stability measurement index R of the data stream can be obtained:
r may represent an internal stability condition of the current data stream, and if R is smaller than the interval overlap ratio threshold θ, it is considered that the internal data stream is unstable at this time, and feature drift may occur, and the magnitude of the interval overlap ratio threshold θ is generally set to 0.3,0.6,0.9. If R < 0.3, it indicates that a sudden drift occurs at this time, if 0.3 < R < 0.6, it indicates that a gradual drift occurs at this time, if 0.6 < R < 0.9, it indicates that an incremental drift occurs at this time, if R > 0.9, it indicates that no characteristic drift occurs, and if R repeatedly occurs in a short period of time, it indicates that a cyclic drift occurs at this time.
The characteristic drift is divided into two cases of virtual characteristic drift and real characteristic drift; the virtual feature drift does not cause the change of decision boundaries, which is caused by the incompleteness of data distribution, so that the algorithm does not need to be retrained, and only a few new data improvement models are needed to be added; whereas true feature drift has led to a change in decision boundaries, the model needs to be retrained to accommodate the current dynamic environment.
In order to judge whether the feature drift is virtual feature drift or real feature drift, data with supervision information is needed to be introduced for assistance, the classification error rate is effective supervision information, and the data block B i Class error rate E in (1) i The definition is as follows:
wherein ε is j Is data block B i Classification result of data points, y j Classification results representing data points, label j Representing the true result of the data point, epsilon if the classification result is correct j =0, otherwise ε j =1; also, E can be calculated i Confidence interval of (2)The following are provided:
where σ is an interval parameter for calculating E confidence interval, set according to the data stream drift condition, set σ to 1 when drift is slow, set to 3 when drift is fast, mean () refers to mean, var () refers to variance. Based on two adjacent data blocks, ifIt is considered that a true characteristic drift occurs at this time if +.>Only spurious feature drift is considered to have occurred at this time.
And S3, drift adaptation, wherein after the concept drift is detected, an algorithm needs to be adjusted so as to adapt to the concept drift.
The adjustment is mainly divided into two parts, namely virtual concept drift and real concept drift. Virtual concept drift does not change decision boundaries, so the current data block B is i Supplementing the content of B into the existing algorithm i Adding the training set to the original training set, and training the algorithm. The true conceptual drift has changed the decision boundary, which means that the previously obtained decision function has not been suitable for the current data stream environment, so the previous decision function needs to be discarded, the original training data is discarded, and B is utilized i Is a data retraining algorithm.
Through the three steps, the method can effectively solve the unbalanced phenomenon in the data stream, and can detect various conceptual drift phenomena including sudden drift, gradual drift, incremental drift, cyclic drift and the like.
The above examples are preferred embodiments of the present invention, but the embodiments of the present invention are not limited to the above examples, and any other changes, modifications, substitutions, combinations, and simplifications that do not depart from the spirit and principle of the present invention should be made in the equivalent manner, and the embodiments are included in the protection scope of the present invention.

Claims (6)

1. An unbalanced data stream mining method based on active drift detection is characterized by comprising the following steps:
s1, acquiring a data stream S, and dividing the data stream S into data blocks B with equal sizes 1 ,B 2 ,...,B n Large in data block unitsSmall, processing unbalanced data in the data stream; the method comprises the following steps:
s1.1, sampling an original data block to obtain majority class data and minority class data;
s1.2, combining a plurality of balanced data sub-blocks according to the majority class data and the minority class data obtained in the step S1.1;
s1.3, combining the plurality of balanced data sub-blocks obtained in the step S1.2 into a final balanced data block; setting the ratio of the majority data and the minority data in the original data block as IR1, setting the ratio of the majority data and the minority data after sampling operation as IR2, and performing the sampling operation on the original data block in the step S1.1 according to the relation between IR1 and IR2, wherein the method specifically comprises the following steps:
increasing the data volume of minority class data and reducing the data volume of majority class data, wherein the increasing the data volume of minority class data uses an SMOTE oversampling method, the reducing the data volume of majority class data is to identify noise points or boundary points in the majority class data by a clustering method, and then remove the identified noise points and boundary points;
s2, detecting concept drift phenomena existing in the data stream in real time; judging whether characteristic drift occurs through supervised information and unsupervised information in the data stream, wherein the supervised information comprises a classification error rate, the unsupervised information comprises a sample mean value and a sample variance,
setting d to be included in each data block divided in step S1 i Data point, one of the data blocks B i Is the sample mean value M of (2) i The definition is as follows:
wherein B is i,j Refers to the jth data point in the ith data block;
given d i And M i Then the ith data block B i Sample variance V in (1) i The definition is as follows:
the sample mean M and the sample variance V are used to represent the stability of the data stream S, and M and V follow a stable normal distribution when the data stream is in a stable state, based on the data block B i Samples in the interior can calculate M i And V i Confidence intervals respectively corresponding toThe method comprises the following steps:
where α is the confidence of the t distribution and χ distribution (chi-square distribution) used to calculate the confidence intervals for M and V;
based on two adjacent data blocks, calculating the interval coincidence ratio R of the mean value of the two adjacent data blocks M Overlap ratio R of interval with variance V The method is characterized by comprising the following steps:
the interval overlap ratio is the quotient between the intersection of two adjacent intervals and the union of the two adjacent intervals, the value range is between 0 and 1, and the interval overlap ratio of the mean value and the interval overlap ratio of the variance are combined to obtain an internal stability measurement index R of the data stream:
r represents the internal stability of the current data stream, and if R is smaller than the interval coincidence degree threshold value theta, the internal data stream is considered to be unstable at the moment, and characteristic drift possibly occurs;
the feature drift is classified into a virtual feature drift and a real feature drift,
data block B i Class error rate E in (1) i The definition is as follows:
wherein ε j Is data block B i Classification result of data points, y j Classification results representing data points, label j Representing the true result of the data point, epsilon if the classification result is correct j =0, otherwise ε j =1; also, E can be calculated i Confidence interval of (2)The following are provided:
where σ is an interval parameter for calculating E confidence interval, set according to the data stream drift condition, set σ to 1 when drift is slow, set to 3 when drift is fast, mean () refers to mean, var () refers to variance,
based on two adjacent data blocks, ifIt is considered that a true characteristic drift occurs at this time if +.>Then it is considered that only spurious feature drift has occurred at this time;
and S3, after the concept drift is detected, adjusting an algorithm to adapt to the concept drift.
2. The method of unbalanced data stream mining based on active drift detection of claim 1, wherein the SMOTE oversampling comprises the steps of:
s1.1.1 for each sample X in the minority class of data i I= … I, I represents the number of samples in the minority class data, and sample X is calculated i Euclidean distance to all minority class data samples, then sample X is obtained i K neighbor samples of (a);
s1.1.2 determining a sampling rate based on the sample imbalance ratio IR1
S1.1.3 for each minority class data sample X i From sample X i Randomly selecting M neighbor samples Y in k-nearest neighbor of (2) k ,k=1…M,
S1.1.4 for each neighbor sample Y k Synthesizing new samples in minority class data according to the following formula;
X new =Y k +rand(0,1)*|X i -Y k | (1)。
3. the method according to claim 2, wherein in step S1.2, since the ratio between the majority data and the minority data is IR2, the majority data is divided into N mutually disjoint majority sub-blocks { majoritydata_j, j=1, 2,..n }, where n=ir2, and then each of the majority sub-blocks is combined with a new sample in the minority data to form a new balanced data sub-block { balanceddata_j, j=1, 2,..n }, respectively.
4. The method according to claim 1, wherein in step S1.3, data is randomly extracted from the plurality of balanced data sub-blocks { balanceddata_i, i=1, 2,..n } combined in step S1.2, and the data are sequentially combined into a final balanced data block.
5. The method according to claim 4, wherein the magnitude of the interval overlap threshold θ is set to 0.3,0.6,0.9, and if R < 0.3, it indicates that a sudden drift occurs at this time, and if 0.3 < R < 0.6, it indicates that a gradual drift occurs at this time, and if 0.6 < R < 0.9, it indicates that an incremental drift occurs at this time, and if R > 0.9, it indicates that no characteristic drift occurs, and if R repeatedly occurs in a short period of time, it indicates that a cyclic drift occurs at this time.
6. The method for mining unbalanced data streams based on active drift detection of claim 1, wherein in the step S3, the feature drift is divided into two cases of virtual feature drift and real feature drift,
when the occurrence of virtual concept drift is detected, the virtual concept drift does not change the decision boundary, so the current data block B is processed i Supplementing the content of B into the existing algorithm i Adding the training set into the original training set, and training the algorithm;
when detecting that the true concept drift occurs, the true concept drift changes the decision boundary, the previously obtained decision function is not suitable for the current data flow environment, so the previous decision function is discarded, the original training data is discarded, and B is utilized i Is a data retraining algorithm.
CN202010239770.7A 2020-03-30 2020-03-30 Unbalanced data stream mining method based on active drift detection Active CN112000705B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010239770.7A CN112000705B (en) 2020-03-30 2020-03-30 Unbalanced data stream mining method based on active drift detection

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010239770.7A CN112000705B (en) 2020-03-30 2020-03-30 Unbalanced data stream mining method based on active drift detection

Publications (2)

Publication Number Publication Date
CN112000705A CN112000705A (en) 2020-11-27
CN112000705B true CN112000705B (en) 2024-04-02

Family

ID=73461718

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010239770.7A Active CN112000705B (en) 2020-03-30 2020-03-30 Unbalanced data stream mining method based on active drift detection

Country Status (1)

Country Link
CN (1) CN112000705B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930856A (en) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 Classification method based on improved DBSCAN-SMOTE algorithm
CN107341497A (en) * 2016-11-11 2017-11-10 东北大学 The unbalanced weighting data streams Ensemble classifier Forecasting Methodology of sampling is risen with reference to selectivity
WO2019041629A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for classifying high-dimensional imbalanced data based on svm

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105930856A (en) * 2016-03-23 2016-09-07 深圳市颐通科技有限公司 Classification method based on improved DBSCAN-SMOTE algorithm
CN107341497A (en) * 2016-11-11 2017-11-10 东北大学 The unbalanced weighting data streams Ensemble classifier Forecasting Methodology of sampling is risen with reference to selectivity
WO2019041629A1 (en) * 2017-08-30 2019-03-07 哈尔滨工业大学深圳研究生院 Method for classifying high-dimensional imbalanced data based on svm

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
A Novel Ensemble Classification for Data Streams with Class Imbalance and Concept Drift;Yange Sun et al;《International Journal of Performability Engineering》;20171031;第第13卷卷(第第6期期);第945-955页 *
Selection-based Resampling Ensemble Algorithm for Nonstationary Imbalanced Stream Data Learning;Siqi Ren et al;《Knowledge-based SYSTEMS》;20180809;第1-35页 *
一种不平衡数据流集成分类模型;欧阳震诤;罗建书;胡东敏;吴泉源;;电子学报;20100115(01);第184-189页 *
一种面向不平衡数据流的集成分类算法;孙艳歌;王志海;白洋;;小型微型计算机系统;20180615(06);第60-65页 *
均值漂移在背景像素模态检测中的应用;梁英宏;王知衍;曹晓叶;许晓伟;;计算机科学;20080425(第04期);第228-230页 *

Also Published As

Publication number Publication date
CN112000705A (en) 2020-11-27

Similar Documents

Publication Publication Date Title
CN109639739B (en) Abnormal flow detection method based on automatic encoder network
CN111562108A (en) Rolling bearing intelligent fault diagnosis method based on CNN and FCMC
CN110070060B (en) Fault diagnosis method for bearing equipment
CN109218223B (en) Robust network traffic classification method and system based on active learning
CN111159243B (en) User type identification method, device, equipment and storage medium
CN112134862B (en) Coarse-fine granularity hybrid network anomaly detection method and device based on machine learning
CN113516228B (en) Network anomaly detection method based on deep neural network
EP4209959A1 (en) Target identification method and apparatus, and electronic device
Zheng Intrusion detection based on convolutional neural network
CN112101765A (en) Abnormal data processing method and system for operation index data of power distribution network
CN114492642A (en) Mechanical fault online diagnosis method for multi-scale element depth residual shrinkage network
CN108763926B (en) Industrial control system intrusion detection method with safety immunity capability
CN117170979B (en) Energy consumption data processing method, system, equipment and medium for large-scale equipment
CN114285587B (en) Domain name identification method and device and domain name classification model acquisition method and device
CN112000705B (en) Unbalanced data stream mining method based on active drift detection
CN113098848A (en) Flow data anomaly detection method and system based on matrix sketch and Hash learning
CN110826062B (en) Malicious software detection method and device
CN111738290A (en) Image detection method, model construction and training method, device, equipment and medium
CN116383645A (en) Intelligent system health degree monitoring and evaluating method based on anomaly detection
CN112014821B (en) Unknown vehicle target identification method based on radar broadband characteristics
CN111860661B (en) Data analysis method and device based on user behaviors, electronic equipment and medium
CN117151745B (en) Method and system for realizing marketing event data real-time processing based on data stream engine
CN114884896B (en) Mobile application flow sensing method based on feature expansion and automatic machine learning
CN115208703B (en) Industrial control equipment intrusion detection method and system of fragment parallelization mechanism
CN117395080B (en) Encryption system scanner detection method, device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant