CN108170695A

CN108170695A - One data stream self-adapting Ensemble classifier method based on comentropy

Info

Publication number: CN108170695A
Application number: CN201611158475.9A
Authority: CN
Inventors: 孙艳歌; 卲罕; 刘宏兵; 冯岩; 王淑礼; 姚建峰
Original assignee: Xinyang Normal University
Current assignee: Xinyang Normal University
Priority date: 2016-12-07
Filing date: 2016-12-07
Publication date: 2018-06-15

Abstract

The invention discloses a data stream self-adapting Ensemble classifier methods based on comentropy,It can not only detect concept drift and can identify and repeat concept,Within the system,New grader is just only rebuild when detecting new concept and is put into grader pond,The problem of preventing from repeating repetition training caused by concept occurs,Reduce model modification frequency,Improve model real-time grading ability and classifying quality,By carrying out performance evaluation comparison with classical data flow algorithm on artificial synthesized data set and truthful data collection,Experiment shows that this method can not only cope with multiple types concept drift,Improve disaggregated model noise resisting ability,And under the premise of ensureing compared with high-class accuracy rate,Consume less time cost,This method can be applied to sensor network abnormality detection,Credit card fraud behavioral value,In numerous practical problems such as weather forecast and Research on electricity price prediction.

Description

One data stream self-adapting Ensemble classifier method based on comentropy

Technical field

The invention belongs to data minings and machine learning techniques field, are related to a kind of data flow towards concept drift environment Ensemble classifier method especially proposes a kind of detecting system for the concept that can handle reproduction.The experimental results showed that proposed Method has apparent advantage on average classification accuracy, and less time is consumed than other Integrated Algorithms, is suitble to multiple types The environment of type concept drift and with higher noise immunity.The system can be applied to sensor network abnormality detection, credit card In numerous actual application problems such as fraud detection, weather forecast and Research on electricity price prediction.

Background technology

In numerous actual application problems of real world, data all constantly generate in the form of streaming.It is this quick Reach, real-time, continuous and unbounded data sequence is known as data flow (Data Streams).In true data flow ring In border, data distribution can usually change with the time, its essence of this phenomenon reflection data flow may have unstable. For example, rule based on weather forecast may change with seasonal variations；Customer's shopping online preference analysis side Method may change with the variation of the factors such as the interest, businessman's prestige, service type of customer group；Industrial electricity can be with There is cyclically-varying in season alternation.Usually, the data distribution in this data flow as the time occurs in some way The phenomenon that variation referred to as concept drift (Concept Drift).With therefore, we are required for for many actual application problems A kind of study mechanism of specific Data Flow Oriented variation characteristic of research and development quickly, copes with these problems in real time.

Concept drift mode can be divided into mutation formula (Abrupt Concept Drift) and gradual change according to speed is changed Formula (Gradual Concept Drift).If in a relatively short period of time, data distribution is suddenly complete by another in data flow Different data distributions are replaced, then claim mutation formula concept drift has occurred in data flow at this time.The drift of this type usually exists Have no that (such as sensor breaks down suddenly) occurs in the case of sign, accuracy rate can be made drastically to decline even model and lost completely Effect.And gradual change type concept drift is then a kind of slow rate change (the gradual failure of such as sensor), when typically passing through longer one section Between after just it is observed that, and concept drift occur before and after have between concept it is more or less similar.And in actual environment, data Concept repeats to be generally existing in stream.Reproduce-type concept drift (Recurring Concept Drift) is a kind of spy The concept drift of different type, other than the characteristics of having both above two drift, certain conception of species has rule or irregular can weigh It appears again existing so that disaggregated model needs continuous progress repetition training to adapt to this variation.Such as electricity consumption throughout the year Data can change with seasonal periodicity；In social networks a certain topic may at a fixed time (such as red-letter day or election) period go out It is existing.

Concept drift is the challenge in data Mining stream, in recent years, is made for concept drift problem domestic and foreign scholars Big research is broadly divided into Case-based Reasoning selection, Case-based Reasoning weighting and integrated study three kinds of methods.Most of these algorithms are only It is handled for the concept drift of a certain type, is not fully considered the situation that concept can repeat.To this type concept Drift, it is desirable that model can usage history data, and can use when repeating concept and occur the model trained in the past into Row classification, so as to avoid repetition training.One ideal disaggregated model should be able to increment type study and adapt to a plurality of types of Variation.Therefore, the sorting algorithm for designing the concept drift that can cope with multiple types has important research significance.Integrated approach leads to It crosses and carrys out training individuals grader in different periods data to retain historical concept, therefore be a kind of effective processing concept drift Method.We focus mainly on how building the data flow Ensemble classifier model that the data-oriented regularity of distribution changes over time.

Current concept drift detection method is the variation being distributed according to the classification error rate of model come detection data mostly, Such as document " Learning with drift detection. " (Gama, J., Medas, P., and Castillo, G., et al..Learning with drift detection.In：Proceedings of the 17th Brazilian Symposium on Artificial Intelligence.Berlin：Springer-Verlag, 2004.pp.286-295.) What is proposed detects the DDM algorithms of variation (Drift Detection Method) by monitoring the error rate of "current" model, but It cannot effectively detect gradual change type concept drift.Then, document " Learning from time-changing data with Adaptive windowing. " (Bifet, A., and Gavalda, R..Learning from time-changing data with adaptive windowing.In：Apte, C., Skillicorn, D., and Liu, B., et al. (eds.) .Proceedings of the 7th SIAM International Conference on Data Mining(SDM 2007) .Philadelphia, PA：SIAM, 2007.pp.443-448.) it proposes based on Bernoulli Jacob's distribution detection concept drift Method EDDM (Early Drift Detection Method), can be to the same of the detection of mutation formula concept drift ensure that When, improve detection result of the algorithm to gradual change concept drift.Nishida etc. proposes STEPD algorithms, by acquiring training sample Classification accuracy and the classification accuracy of whole training samples is come detection concept drift.The adaptive sliding of the propositions such as Bifet Dynamic window algorithm ADWIN (ADaptive WINdowing), by comparing the difference of the mean value of the error rate between different child windows It is different to determine whether occur concept drift.Ross etc. proposes ECDD (EWMA for Concept Drift Detection) calculations Method, utilization index weighted moving average control figure (EWMA) monitor error rate, when error rate is more than certain threshold value, then illustrate to send out Raw concept drift.

However, algorithm above does not consider the problems of that concept can repeat mostly.It is just proposed early in Widmer in 1996 The problem of concept can repeat, the up to date concern for just obtaining academia in several years.Widmer etc. proposes FLORA3 algorithms, The description of the concept of history is saved, when concept reappear when, the grader of preservation is reactivated.Nishida etc. is carried A kind of Online integration algorithm ACE (Adaptive Classifier Ensemble) is gone out to cope with the appearance of repetition concept.Class As method there is Ramamurthy etc. to propose one based on integrated learning approach EB (Ensemble Building).EB algorithms exist Build one group of global grader in sequence of blocks of data, this method will not deleting history grader, but selectively from selecting Correlation classifier in global classification device.Katakis etc. is a kind of to find new concept using cluster, based on representation of concept model. Yang etc. sees concept as the state in Markov chain, learns the rule of concept drift during being converted from concept, and Concept is described by Markov model to convert, and selects a concept most like with current concepts.Gama etc. is employed two layers The model of grader, first layer trains grader according to current concepts, and wherein the second layer is created according to existing concept Grader.When detecting that concept drift occurs, then the grader of the second layer is reused.Deng^]It proposes at one The general framework RCD (Recurring Concept Drifts) of the repeated concept drift of reason passes through the side of polynary nonparametric statistics Whether method identifies the new and old concept from same distribution.

Invention content

Technical problem：For concept drift, there are two problems demands to solve：First, how fast and accurately to detect concept Drift；Second is that after detecting drift how according to different types of variation come correction model to adapt to these variations.For this purpose, this hair It is bright to have designed and Implemented a kind of adaptive set constituent class method and system that cope with a variety of concept variations.Main contributions are as follows：

(1) for first problem, it is proposed that the concept drift detection method based on comentropy.Go out from the angle of comentropy Hair measures the distance of data distribution between new and old window by Jensen-Shannon divergences, can not only detect that concept is floated It moves, and repetition concept can be efficiently identified.

(2) for Second Problem, a kind of mechanism in grader pond is devised, after concept drift is detected, if newly Concept is then added in grader pond, if repeating concept then reuses existing grader.

(3) propose it is a kind of can detection concept drift simultaneously and using the integrated system for repeating concept, and manually closing Into with the experiment on truthful data stream, from classification accuracy, the multi-angles such as run time and noise immunity are investigated, and verification carries Go out the feasibility and validity of method.

Technical solution：In view of the repeatability of concept, identify that the cost of historical concept is smaller than creating new conceptual model It is more, therefore the necessary essential information by historical concept in data flow stores.History is preserved using grader pond Concept, one concept of each grader expression, when detecting that repetition concept occurs, quickly recalls relevant information and is handled, Reduce unnecessary repetition training.Therefore it needs to increase the concept detection method of an inside to increase algorithm to concept drift Adaptability, adaptive set constituent class algorithm (the Ensemble with concept drift testing mechanism of proposition Internal Change Detection, ECD).New grader is only just rebuild when detecting new concept and is put into point In class device pond, the problem of preventing from repeating repetition training caused by concept occurs, model modification frequency is reduced, model is improved and divides in real time Class ability and classifying quality.The present invention proposes adaptive Integrated Algorithm method, mainly forms in two stages：The concept detection stage With the Ensemble classifier stage.

The present invention proposes a data stream self-adapting Ensemble classifier method based on comentropy.Its specific steps is included such as Under：

step1：Initialize integrated classifier and buffer area；

step2：Example is moved into sliding window one by one；

step3：The detection model based on two windows proposed is utilized to be described as follows：Use W₁={ x_t+1, x_t+2..., x_t+nAnd W₂={ x_t+n+1..., x_t+2nThe continuous equal-sized window of t moment two, W are represented respectively₁Represent reference window Mouthful, W₂Represent current window.With JSD (W₁||W₂) distance being distributed between two windows is measured, when this value is less than or equal to 10^-5It is (non- Very close to when zero), representing that the data distribution of two windows is identical, that is, find to repeat concept；When more than 10^-5During less than threshold tau, There was no significant difference for the distribution for thinking between two windows, when then showing there is concept drift at this time more than threshold value.Threshold value is adopted It is calculated with the method for bootstrap.Due to window one example of each forward slip, mutation formula can be detected in time Concept drift.

step4：When having detected concept drift generation, just with the distribution of the data for establishing grader in grader pond It is compared, if new concept then creates a grader and is added in grader pond, and corresponding data are placed on buffer area； If repeating concept then reuses existing grader.Grader sorts from high to low according to the frequency of reuse, when grader pond When the grader number of middle storage reaches maximum value, then the grader being least commonly used is replaced.

step5：According to the classification error rate of each base grader example in newest window, by the way of Nearest Neighbor with Weighted Voting Each example is predicted.

Description of the drawings

The classification accuracy of Fig. 1 different windows sizes compares.

Classification accuracy compares on Fig. 2 SEA data sets.

Classification accuracy compares on Fig. 3 Elist data sets.

Specific implementation method

Technical scheme of the present invention is further described below in conjunction with drawings and examples.

(1) the concept detection algorithm based on comentropy

In information theory, it is that measurement is identical that relative entropy (Relative Entropy), which is also known as Kullback-Leibler divergences, In event space X two probability distribution relative mistakes away from estimate.The relative entropy of two probability distribution p (x) and q (x) is defined as：

However, Kullback-Leibler divergences are unsatisfactory for symmetry, therefore it is not stringent distance conception. Jensen-Shannon divergences are a kind of distance metrics based on Kullback-Leibler divergences, it solves Kullback- The asymmetry problem of Leibler divergences.Jensen-Shannon divergences in information theory can represent two data point well Relationship between cloth, therefore the present invention proposes a kind of concept detection algorithm based on Jensen-Shannon divergences, by comparing Data distribution carrys out detection concept drift with the presence or absence of significant difference in two adjacent window apertures.Jensen-Shannon divergences are determined Justice is as shown in formula (2).

Detection model JSD (the W based on two windows proposed₁||W₂) distance being distributed between two windows is measured, when This value is less than or equal to 10^-5It when (being in close proximity to zero), represents that the data distribution of two windows is identical, that is, finds to repeat concept；When More than 10^-5During less than threshold tau, it is believed that there was no significant difference for the distribution between two windows, when then showing have at this time more than threshold value Concept drift occurs.Threshold value is calculated using the method for bootstrap.Due to window one example of each forward slip, because This can detect mutation formula concept drift in time.Pseudocode is as shown in algorithm 1.

Concept detection algorithm of the algorithm 1 based on comentropy

(2) the adaptive set constituent class system based on comentropy

Particularly, with E={ C₁, C₂..., C_kRepresent the grader pond being made of k grader, while each point Class device is also attached to variable for recording the number that the grader is reused, B={ B₁, B₂..., B_kRepresent the corresponding of storage Data, C ' expressions establish new grader.Newest data are safeguarded using sliding window model, for continually reaching Example (x_i, y_i), W₁It represents to refer to (old) window, W₂Represent current window.By comparing the distribution of new and old two window datas Distance come detection concept drift, when detecting concept drift, just with the data for establishing grader in grader pond Distribution is compared, if new concept then creates a grader and is added in grader pond, and corresponding data are placed on slow Deposit area；If repeating concept then reuses existing grader.Grader sorts from high to low according to the frequency of reuse, works as classification When the grader number stored in device pond reaches maximum value, then the grader being least commonly used is replaced.Then according to each base point Class device C_i(i=1,2 ..., k) in newest window example classification error rate, it is weighted by formula (4), weighting throw Ticket mode predicts each example.

Weight(C_i)=MSE_r-MSE_ij (4)

Wherein, MSE_rFor the mean square deviation of stochastic prediction grader, MSE_ijFor base grader C_iIt is predicted on current window Mean square deviation,It represents in grader C_iMiddle prediction property value is x_iClass value be y_iProbability, p (y_i) it is y_iPriori it is general Rate.In this case, the sub-model for representing current concept is searched in grader pond, is reduced with learning relevant calculating A kind of new model, also improves the adaptation to concept drift in cost.Pseudocode is as shown in algorithm 2.

2 ECD pseudo-code of the algorithm of algorithm

The simulation result of the present invention

It in CPU is 2.8GHZ that this system, which is, inside saves as 8GB, operating system is to be tested in the PC machine of Windows 7 , experiment chooses 3 artificial generated data set pairs and proposes that model is verified, as shown in table 1.

1 artificial synthesized data acquisition system essential information of table

Table 1 Characteristic of synthetic datasets

3 type concept drifts are generated with data stream generator：Mutation formula, gradual change type and reproduce-type concept drift.

HyperPlane is most popular data flow data collection, the power which passes through change data sample attribute Value simulates concept drift phenomenon.Using the data flow generator HyperplaneGenerator generation examples in MOA in experiment Change the gradual change type concept drift data set that probability is 0.001.

SEA is Street to be proposed in 2001, famous when only containing continuous type attribute, was classical mutation formula concept Drift data collection.Its basic structure is<f₁, f₂, f₃, C>, wherein f₁、f₂And f₃For conditional attribute, C is generic attribute, only f₁, f₂And C It is related.When the attribute of example meets f₁+f₂During≤θ, belong to the first kind, otherwise belong to the second class.Data flow is used in experiment first Generator generation includes the data set of 3 mutation, occurs at 250K, 500K and 750K respectively, and then use can generate repetition The data flow generator generation of concept includes the data set of 3 repetition concepts.

Waveform data sets are each by 3 kinds of reference waveforms (each reference waveform is made of 21 numeric type attributes) Classification is all two of which or 3 kinds of combination.There are 40 Numeric Attributes waveforms using the generation of data flow generator in experiment Data flow data collection, including 19 uncorrelated attributes.

Emailing list (abbreviation Elist) are the data sets comprising burst concept drift and repetition concept, and Spam Filtering (abbreviation Spam) is then the data set for including gradual change concept drift, and two datasets are all with Boolean type bag of words mould Type represents.Data set can be in http：//mlkd.csd.auth.gr/concept_driff.html is downloaded, in MOA Static digital simulation is generated data flow by ArffFileStream generators.

Elist simulates the continuously various e-mail messages from different field, and user can be according to interest this A little mail labels become rubbish or interested.Including 1500 examples and 913 attributes altogether, data are divided into 5 stages, Changed the appearance to simulate concept drift by the interest of user.Table 2 describes being recognized for which type in each stage For interested or spam, wherein (+) represents interested, (-) represents a spam.Use C₁Represent user only to doctor The mail of etc is interested, C₂Represent that user is interested in aviation and baseball, then this data flow represents C₁, C₂, C₁, C₂, C₁Generally Read sequence.

2 Emailing list data sets of table describe

Table 2.Characteristic of Emailing list dstaset

Spam includes 9324 examples and 500 attributes, and each example represents the information of a mail, is divided into two types Type：Spam (only accounting for 20%) and legitimate mail.The feature of spam in data set is slowly varying with the time.

Experiment is first then used as training data using Prequential Evaluation Strategies, i.e. every example as test data, Accuracy rate is incremental update in this way.Do not have to detain data set using this Evaluation Strategy to test, so as to ensure to maximize profit With the information of each data, also ensure that accuracy rate has flatness at any time.

It is tested on SEA data sets, it is [500,2000] to test sliding window size n values respectively, is verified with this Window size sets the influence to algorithm performance.As seen from Figure 1, at the beginning, as the increase of window causes structure to classify The data of device increase, and classification accuracy also rises therewith.However, continuing to increase with window, concept drift finds lag, together When, the classification accuracy of grader has reached bottleneck, thus average classification accuracy slightly reduces.Table 3 shows put forward method Influenced very little by window size setting, when window size be 1000 when algorithm can obtain relatively high classification accuracy.

Classification accuracy under 3 different windows size of table

Table 3 accuracy using different window sizes

Then, algorithm is carried to be compared with following 3 algorithms：Hoeffding Tree (abbreviation HT), RCD and Accuracy Update Ensemble (abbreviation AUE).Wherein, HT and AUE is realized under MOA, and RCD can behttps：// sites.google.com/site/moaextensions/It obtains.For the ease of comparing, grader number k=in grader pond 15, using Hoeffding Tree, leafy node uses the base grader for the Ensemble classifier algorithm being compared Adaptive Bayes predicts class value, wherein n_min=100, confidence level δ=0.01, τ=0.05.It is accurate from classification respectively Two aspects of true rate and run time are compared.

Table 4 illustrates algorithm classification accuracy situation on 5 data sets.All in all, Integrated Algorithm is obtained than single point Class device algorithm wants high classification accuracy.HT algorithms showed on the data set comprising concept drift it is worst, this is because its There is no any processing concept drift mechanism, therefore be not suitable for concept drift environment.In no concept drift data set Waveform On, since data distribution is relatively stablized, all algorithm difference are simultaneously little, and the algorithm proposed does not have a clear superiority.And On gradual change type data set HyperPlane, AUE obtains highest accuracy rate, secondly the algorithm proposed.To find out its cause, it is Due to AUE algorithms, training generates grader constantly on latest data block, can and cope with gradual change type concept drift.And comprising On the data set SEA for repeating concept, propose that algorithm obtains highest accuracy rate, this is because increasing repetition concept drift Testing mechanism can establish new grader to adapt to unexpected concept variation in time.On truthful data collection Elist, the present invention It is proposed that algorithm behaves oneself best.Most preferably RCD is showed on Spam, secondly algorithm proposed by the invention.

Aspect at runtime, as shown in table 5.It is found by comparing analysis, the run time of HT is most short, secondly this hair Bright proposed algorithm, and AUE time loss longests.This is because algorithm proposed by the invention, which is one kind, is based on data distribution Detection algorithm, concept drift can be quickly detected and identification repeats concept, and an existing model is selected, avoid Repetition training, thus it is advantageous in time.Single classifier algorithm HT is although fastest, but on classification accuracy It shows worst.

4 algorithms of different classification accuracy (%) of table

Table 4.Comparison of Classification Accuracy (%)

5 algorithms of different run time (second) of table

Table 5.Comparison of Time Consumption

3 mutation have occurred in SEA data sets, from Fig. 2 it has been observed that all algorithm variation tendencies are basically identical.Most First phase data is more stable, and all algorithms all maintain higher, stable accuracy rate, and algorithm proposed by the invention has no Clear superiority.With the increase of data volume, concept drift number increases, so the classification accuracy of algorithm is all declined, The ratio that wherein HT algorithms decline is more serious, and fluctuates larger.When concept mutation occurs at 250K, 500K and 750K, own The accuracy rate of algorithm all drastically declines, and algorithm proposed by the invention maintains higher, stable accuracy rate.Institute of the present invention On the average value of the algorithm accuracy rate of proposition 20% or so is higher by than HT algorithm.This is because algorithm proposed by the invention can be fast Speed captures concept variation, and establishes new grader, so as to cope with this variation in time.Due to being added to 10% in data Noise also shows algorithm proposed by the invention with noise resisting ability.

Concept variation has unpredictability and uncertainty, therefore be more able to verify that algorithm in true data stream environment Generalization ability.On Elist, the situation of change of accuracy rate is as shown in figure 3, different journeys occurs in the accuracy rate curve of all algorithms The fluctuation of degree, this shows that there are concept drift phenomenons in the data set.And the accuracy rate curve phase of algorithm proposed by the invention To steady, the knowledge that history grader is possessed can be made full use of to solve the problems, such as that it is periodic that data flow concept drift is presented. Show that proposed method is influenced minimum by concept drift in data, to true well adapting to property of data environment.

By experimental contrast analysis, can obtain to draw a conclusion：(1) algorithm has bright on comprising the data set for repeating concept Aobvious advantage；(2) it is keeping compared with high score accuracy rate, is consuming the relatively small number of time；(3) there is certain stalwartness to noise Property.

Claims

A 1. data stream self-adapting Ensemble classifier method based on comentropy, it is characterised in that：Adaptive set constituent class method, It forms in two stages：Concept detection stage and Ensemble classifier stage；Its specific steps includes as follows：

Step 1：Initialize integrated classifier and buffer area；

Step 2：Example is moved into sliding window one by one；

Step 3：The detection model based on two windows proposed is utilized to be described as follows：Use W₁={ x_t+1, x_t+2..., x_t+n} And W₂={ x_t+n+1..., x_t+2nThe continuous equal-sized window of t moment two, W are represented respectively₁Represent reference windows, W₂Table Show current window.With JSD (W₁||W₂) distance being distributed between two windows is measured, when this value is less than or equal to 10^-5It is (very close It when zero), represents that the data distribution of two windows is identical, that is, finds to repeat concept；When more than 10^-5During less than threshold tau, it is believed that two There was no significant difference for distribution between a window, when then showing there is concept drift at this time more than threshold value.Threshold value uses The method of bootstrap is calculated.Due to window one example of each forward slip, it can detect that mutation formula is general in time Read drift；

Step 4：When having detected concept drift generation, just with the distribution of the data for establishing grader in grader pond into Row compares, if new concept then creates a grader and is added in grader pond, and corresponding data are placed on buffer area；If It is to repeat concept then to reuse existing grader.Grader sorts from high to low according to the frequency of reuse, when in grader pond When the grader number of storage reaches maximum value, then the grader being least commonly used is replaced；

Step 5：According to the classification error rate of each base grader example in newest window, by the way of Nearest Neighbor with Weighted Voting pair Each example is predicted.