CN109816010A

CN109816010A - A kind of CART increment study classification method based on selective ensemble for flight delay prediction

Info

Publication number: CN109816010A
Application number: CN201910052118.1A
Authority: CN
Inventors: 王丹; 王萌; 赵文兵; 杜金莲; 付利华; 杜晓琳; 苏航
Original assignee: Beijing University of Technology
Current assignee: Beijing University of Technology
Priority date: 2019-01-21
Filing date: 2019-01-21
Publication date: 2019-05-28

Abstract

The present invention discloses a kind of CART increment study classification method based on selective ensemble for flight delay prediction.The drawbacks of model problem and integrated classifier influence estimated performance in large scale cannot effectively be updated in face of new flight data for flight delay prediction model.By CART decision Tree algorithms in conjunction with Learn++ incremental learning frame, I-CART method is proposed, the incremental learning for new data is realized, efficiently updates prediction model；Using kappa coefficient as base classifier ballot weight, classification error rate is further decreased；Otherness and accuracy rate relationship between base classifier are probed into, two kinds of selection scheme VS (longitudinal method of scoring) and HS (lateral method of scoring) for integrated classifier is devised, reduces integrated classifier scale.The present invention can be improved flight delay prediction model for the learning efficiency and classification performance of new data, and the selective ensemble scheme of proposition can be substantially reduced the scale of final integrated classifier, improve flight delay prediction classifier performance.

Description

A kind of CART incremental learning based on selective ensemble point for flight delay prediction Class method

Technical field

The invention belongs to computer software fields more particularly to it is a kind of for flight delay prediction based on selective ensemble CART increment study classification method.

Background technique

The World Airways, Inc.'s delay rate rankings in 2018 listed according to US Airways Data web site Flightstats are minimum Preceding 40 airlines in, the average ranking 25 of China, average flight punctuality rate is about 71.13%, and mean delay duration is about 62.4 minutes.The statistics of flight datas in 2017 that China Civil Aviation company announces is about 71.67% for average flight punctuality rate, average Delay duration is about 24 minutes, and flight data statistics in 2016 is about 76.54% for average flight punctuality rate, mean delay duration About 16 minutes.It can be seen that China's flight tardy problem is extremely serious, and situation is severe year by year.Flight delay prediction can make dependent part Door knows the time that may be delayed in advance, carries out counter-measure and carries out the optimization of flight planning；It can also basis for traveler The stroke of prediction result appropriate adjustment oneself is alleviated since flight is delayed bring each side surface pressure.Therefore flight is delayed The research of prediction has important practical significance.

Have a large amount of scholars both at home and abroad to put into the research work of flight delay prediction, has calculated for machine learning at present The research of the selection of method compares Bayes Method and Decision Tree Algorithm in the performance of flight delay prediction；Have flight Data are merged with meteorological data extracts the research that more features further increase predictablity rate；Promising better analysis flight Data build the design of big data processing system.Further research about flight demand forecasting has scholar's proposition, for flight demand forecasting For, it obtains newest information and is added in prediction, can have bigger promotion to prediction accuracy, because in Flight Information In, newer information means that value is bigger, such as preceding flight delay time at stop, airport weather condition.

In Flight Information field, not only can all there are a large amount of new data to generate daily, and historical data is up to ten million item, such as Fruit uses conventional machines learning algorithm training pattern, when there is new data arrival, needs new data and historical data being placed on one It rises, re-executes the new model of learning algorithm training.On the one hand, since data volume makes greatly study more difficult, on the other hand, Many time and memory space can be wasted, needs again to learn the data learnt before again to learn new information It practises one time, results in waste of time；And a large amount of memory space is needed to save a large amount of historical datas, because conventional machines learn Algorithm has to again learn historical data when learning new information.And incremental learning and conventional machines learning algorithm phase Than possessing more advantages, on the one hand, the knowledge learnt before capable of retaining saves the time largely learnt again；It is another Aspect does not have to save historical data, save many due to not needing to visit again historical data when study new information Memory space.Therefore a kind of Increment Learning Algorithm based on CART decision tree is proposed, and for present in Increment Learning Algorithm The non-objective and fair problem of base classifier ballot weight, with being continuously increased with data, final integrated classifier is in large scale to be asked Topic has carried out probing into and proposing innovative solution.

Incremental learning technology is the Knowledge Discovery and data mining technology of a kind of intelligent, is all obtained at many aspects It is universal.When facing new data, conventional machines study can only abandon existing model, utilize static historical data collection re -training mould Type, and incremental learning can studying new knowledge know while not forget before overfitting knowledge, as mankind's learning process, Progressive updating iteration Knowledge framework, so that classifier performance constantly enhances.Incremental learning has compared with conventional machines study The sorting technique advantage of incremental learning ability, which is mainly reflected in, saves two aspects of time and memory space: on the one hand can save The model that historical data learns before, it is only necessary to again to newly-increased sample learn can, so when saving very much Between, the historical data learnt before being on the other hand does not need to be saved again, can save a large amount of memory spaces.

There are two types of the modes for realizing Incremental Learning Algorithm, a kind of mode be to originally can not the study of incremental processing data calculate Method is transformed, the data for making its processing of having the ability newly arrive.Such as: neural network algorithm transformation, SVM algorithm of support vector machine Transformation, the transformation of KNN nearest neighbor algorithm etc..The second way is to realize incremental learning using Multi-classifers integrated thought, so that calculating Method has the ability of incremental learning.The representative algorithm of integrated form incremental learning mode first is that Learn++ algorithm, one kind are based on The integrated form incremental learning frame of AdaBoost (adaptive boosting) thought and supervised learning；And evolution nerve net Network ENN (Evolved Neural Network) algorithm；There are also SONG algorithm, this algorithm is a kind of based on SGNT (Self- Generating Neural Networks) algorithm incremental integration method.

Learn++ algorithm has the advantage that advantage one provides and learns the mechanism combined with conventional machines, it is not necessary to Dependent on specific sorting algorithm；Advantage two, is not susceptible to over-fitting；Advantage three, algorithm parameter setting is simple, is easy to reach To preferable classifying quality.

Learn++ algorithm ballot weight case study is as follows, and Learn++ algorithm is one based on AdaBoost The incremental learning frame of (Adaptive Boosting) algorithm idea.The base classifier of AdaBoost algorithm final mask is voted Weight is made of the weight of classification error sample.After algorithm successive ignition, it is difficult to which the sample weights of classification will be amplified, and be made The classification performance for obtaining integrated classifier reduces.It will appear same problem in Learn++ algorithm, due to difficult sample weights of classifying Excessive, performance of the base classifier in these sample areas will become the deciding factor of base classifier ballot weight size.From And cause following two Weight: when base classifier is correct to difficult sample classification of classifying, biggish franchise can be endowed Weight, but equally good classification performance, but final weighting cannot be guaranteed for region base classifier except difficult sample of classifying Biggish weight is still suffered from when integrated, causes finally to differentiate mistake；The base classifier classification difficulty sample excessive to several weights When originally cannot correctly classify, base classifier will be given lesser ballot weight, since the weight for difficult sample of classifying is big, this When can ignore base classifier to the classification performance of other samples, give the classifier of whole classifying quality well to lesser throwing Ticket weight can not play a role when final weighting is integrated.

Kappa coefficient is a kind of statistical data for being commonly used in medicine context of detection, and in diagnostic test, researcher is verifying Whether the diagnostic result of different diagnostic methods is with uniformity, usually effectively measures this consistency by Kappa coefficient. Kappa coefficient is a kind of statistic of metric measurements result consistency, therefore can measure classification results and true class label Consistency, the classification of assessment device effect of objective and fair.

The integrated classifier that Incremental Learning Algorithm ultimately generates case study in large scale is as follows, and Learn++ algorithm uses Be that Multi-classifers integrated thought realizes incremental learning, therefore, with being continuously increased for data, the operation of Incremental Learning Algorithm, The number of base classifier inevitably increases, and causes final integrated classifier in large scale, occupies excessive storage space, prediction Rate decline or even some redundancies, base classifier that performance is bad has a negative impact to classification performance.For above-mentioned feelings Condition, according to Chinese scholar Zhou et al. propose " selective ensemble " thought, only select the base classifier of some better performances into Row is integrated, to obtain more preferably classification performance.In view of between integrated classifier otherness and accuracy determine effect to final Fruit can generate strong influence, and at present incremental learning between Ensemble Learning Algorithms for otherness base classifier and accurate Relationship between property still has very big research space.

To sum up, it is delayed prediction model existing repetition when learning new data in face of magnanimity flight data to solve flight The problem of practising the waste plenty of time, and occupying a large amount of memory headrooms for store historical data, the present invention are proposed CART decision Set Increment Learning Algorithm I-CART (the Incremental Classification in conjunction with Learn++ incremental learning frame And Regression Decision Tree), to improve the efficiency of flight delay prediction model study new data, improve flight It is delayed the estimated performance of prediction model.Again since the ballot weight of base classifier in I-CART Incremental Learning Algorithm will lead to finally The classification error of integrated classifier, reduces classification performance, and the present invention proposes to use franchise of the kappa coefficient as base classifier The I-CART.kappa method of weight, objective and fair are that base classifier gives ballot weight, promote classifier classification performance.For Ensemble classifier problem in large scale fully considers otherness and accuracy rate between base classifier according to the thought of selective ensemble Relationship, invented two major classes selection scheme, proposed the final CART incremental learning classification side based on selective ensemble Method --- I-CART.kappaS method effectively reduces the scale of integrated classifier, improves the estimated performance of model.

Summary of the invention

The contents of the present invention:

1. proposing a kind of CART increment study classification method based on selective ensemble, this method not only can be mentioned effectively High flight is delayed prediction model for the learning efficiency of new data, and can be substantially reduced the scale of integrated classifier, improves The estimated performance and forecasting efficiency of classifier.

2. CART decision Tree algorithms are proposed I-CART method in conjunction with Learn++ incremental learning frame, realize pair In the incremental learning of new data, flight delay prediction model is improved for the learning efficiency and classification performance of new data.

3. replacing throwing of the sum of the weight of classification error sample as base classifier in I-CART method using kappa coefficient Ticket weight.The objectivity of the excessive weights influence base classifier ballot weight of classification error sample is avoided, objective and fair is given Base classifier ballot weight reduces classification error rate, further improves I-CART method.

4. selectively integrating the base that otherness is big and accuracy rate is high based on otherness between base classifier and accuracy rate relationship Classifier, the invention proposes two class selection scheme VS (longitudinal scribing line back-and-forth method) and HS (laterally scribing line back-and-forth method), are applied to Improved I-CART method, significantly reduces the scale of final integrated classifier, improves integrated classifier classification performance.

To reach the above goal of the invention, discusses and repeatedly practises after study, this method determines that final scheme is as follows:

By CART decision Tree algorithms in conjunction with Learn++ incremental learning frame, I-CART method is constructed.It realizes for newly counting According to incremental learning, save the disaggregated model training time, efficiently use new information, improve classification performance.For integrated classifier In base classifier when ballot weight is set, replace the sum of weight of classification error sample to be used as I-CART using kappa coefficient The ballot weight of base classifier in method.Avoid the objective of the excessive weights influence base classifier ballot weight of classification error sample Property, objective and fair based on base classifier ballot weight, reduce classification error rate.From the integrated classifier ultimately generated, press Suitable selection scheme, the base combining classifiers that selection differences are big and accuracy rate is high are utilized according to the selective ensemble scheme of proposition Into final classification device, it is, selective ensemble scheme is added in improved I-CART algorithm, significantly reduce most The scale of whole integrated classifier improves integrated classifier classification performance.

To achieve the above object, the present invention adopts the following technical scheme that:

A kind of CART increment study classification method based on selective ensemble for flight delay prediction, including following step It is rapid:

Flight data collection is divided into K Sub Data Set (equal part under normal conditions), the number as each iteration by step 1. According to collection, each Sub Data Set can be considered as newly-increased data set.The number of iterations Tk is set, to improve the extensive of integrated classifier Property.Final base classifier number N, N=K*Tk.The number SN of selection base classifier is set, and according to actual needs and observation is classified Effect sets SN, generally 1/3N~1/2N, is herein the requirement of selective ensemble scheme.And CART decision Tree algorithms are made For basic learning algorithm.

Step 2. executes I-CART method after improvement, and iteration calls basic learning algorithm to generate base classifier.In each base After classifier generates, the kappa coefficient of base classifier is calculated, the ballot weight as base classifier saves.Kappa coefficient can The classifying quality for giving base classifier of objective and fair, which is given, to be evaluated, and further increases classification performance as ballot weight.

After the completion of step 3. iteration, N number of base classifier is all generated.Using suitable selective ensemble plan V S (longitudinal scribing line back-and-forth method) or HS (laterally scribing line back-and-forth method), select SN base classifier, composition from all base classifiers Final integrated classifier.The number of base classifier in integrated classifier can be substantially reduced after selective ensemble, reduce collection The scale of constituent class device promotes classification performance.

Base classifier after selection is combined using weighted voting algorithm by step 4., kappa coefficient as ballot weight, The final most classification of poll obtains flight delay prediction result as final classification.

Detailed description of the invention

CART increment study classification method conceptual scheme of the Fig. 1 based on selective ensemble

Fig. 2 I-CART method flow diagram

CART increment study classification method flow chart of the Fig. 3 based on selective ensemble

Fig. 4 selective ensemble plan V S (longitudinal scribing line back-and-forth method) schematic diagram

Fig. 5 selective ensemble scheme HS (laterally scribing line back-and-forth method) schematic diagram

Specific embodiment

CART decision Tree algorithms in conjunction with Learn++ algorithm, are proposed the I-CART method for realizing incremental learning by the present invention, Efficient study flight new information, is improved for delayed predictablity rate；Using kappa coefficient as integrated classifier Ballot weight, optimize Incremental Learning Algorithm, finally carry out selective ensemble base classifier, realize based on selective ensemble CART increment study classification method reduces integrated classifier scale, improves predicted velocity and classification accuracy.It is pre- for flight delay It surveys disaggregated model and has invented the integrated learning approach with strong classification performance.

Fig. 1 can be decomposed into several steps of the invention.

Flight data collection is divided into K Sub Data Set, can be considered as ever-increasing new data by step 1；

K Sub Data Set is input to this method by step 2, improved I-CART method is executed, first by CART Decision Tree algorithms, the Tk base classifier of iteration Tk times (Tk > 0 is traditionally arranged to be 3~5) generation on each Sub Data Set, and count The kappa coefficient for calculating base classifier saves.Due to there is K Sub Data Set, iteration Tk times on each Sub Data Set, therefore most Whole I-CART algorithm can generate a base classifier of N (N=K*Tk).

Step 3, after the completion of iteration, N number of base classifier is all generated, into selective ensemble method.Utilize invention The selective scheme based on otherness between base classifier Yu accurate sexual intercourse, therefrom in N number of base classifier select SN (SN > 0&& SN≤N, SN are traditionally arranged to be 1/2N~1/3N, and SN is too big or too small nonsensical) a base classifier is added to Ensemble classifier In device.

Step 4 obtains final classification as a result, ticket for kappa coefficient as base classifier ballot weight combination base classifier The highest classification of number is sample classification classification.

CART increment study classification method based on selective ensemble can also be divided into two large divisions, after a part is improvement I-CART Increment Learning Algorithm, that is, use I-CART Increment Learning Algorithm of the kappa coefficient as ballot weight；Another portion It is divided into selective ensemble plan V S (longitudinal scribing line back-and-forth method) and HS (laterally scribing line back-and-forth method), selective ensemble scheme is added It is the present invention to improved I-CART algorithm.Therefore, it is specifically introduced separately below for this two-part implementation.

One, the improvement of I-CART Increment Learning Algorithm and ballot weight

1.1I-CART Increment Learning Algorithm

I-CART incremental learning mode be by CART decision Tree algorithms in conjunction with Learn++ incremental learning frame so that CART algorithm has the ability of incremental learning.Relatively there is outstanding performance since CART is delayed to have in prediction classification in flight, but not The ability for having incremental learning.And I-CART Increment Learning Algorithm can not be forgotten and learn when in face of newly-increased data set The knowledge crossed learns new data on the basis of existing knowledge, improves learning efficiency, enhances classification performance.

I-CART incremental learning realizes that incremental learning depends on Learn++ Incremental Learning Algorithm.Learn++ pass through by Historical data is converted into base classifier and remains, and the knowledge learnt is not forgotten in realization；By utilizing new data training New base classifier learns new knowledge.Not only to the study of new data, it is often more important that for new classification It practises.Learn++ is that each Sub Data Set maintains one group of sample weights w, and weight determines that sample is selected into the probability of training set.Every time After iteration, with existing integrated classifier test data set, and sample weights are updated, so that the weight of the sample of classification error increases Add, and categorized correct sample weights reduce, the probability that classification error sample chooses training set is increased with this, so that Classifier is focused more in the sample for being difficult to classify.If there is new classification occurs, then being bound to by existing classifier point Class mistake becomes the sample of classification error, can thus make the learning algorithm sample with new category of paying close attention to these, realizes pair The study of new category.

I-CART Increment Learning Algorithm variable meaning and be specifically configured to: (1) basic learning algorithm is set as CART algorithm, Thus generate base classifier.(2) and K one's share of expenses for a joint undertaking data set is inputted, k indicates k-th of Sub Data Set, k=1,2 ..., K.(3) subnumber According to the number of iterations Tk, Tk > 0, being traditionally arranged to be 3~10, t indicates iteration each time, t=1,2 ..., Tk.(4) and generate The record count of the number of base classifier initializes count=0.(5)w_tIndicate the weight of one group of sample.(6)H_countIt indicates Possess the integrated classifier of count base classifier.(7) kappa indicates that kappa coefficient (8) betak of base classifier indicates base The ballot weight of classifier；(9)H_finalIndicate final integrated classifier.

Specific steps are as follows: have two layers of circulation in this method, outer loop is the traversal of subdata sets, can be considered as constantly Increased data set guarantees that the algorithm can receive newly-increased data set with this；Interior loop is to generate to have otherness base point Class device.

Firstly, starting outer loop for traversal K times of Sub Data Set:

Step 1: as k=1, initialization sample weight w1=1/m, m indicate the number of sample in Sub Data Set；Work as k > 1 When, that is, (because there is no any base classifier composition H when k=1 since second Sub Data Set_count), it can be according to H_count Sample weights w1 is updated to the evaluation of new data, so that promoting new base to classify comprising being difficult to the sample classified in training sample Device more learns these samples

Start interior loop for iteration Tk times of each Sub Data Set, t indicates the t times iteration:

Step 2: to guarantee w_tIt can be used as a kind of distribution,According to sample weights w_tConstruct training data Collection.

Step 3: calling CART learning algorithm, obtain base classifier C with the training dataset training generated in step 2_t。

Step 4: calculating base classifier C_tBallot weight beta.

Step 5,6: by existing base combining classifiers H_count, according to H_countThe evaluation of subdata sets updates sample weights w_t。

Step 7: Nearest Neighbor with Weighted Voting integrates final integrated classifier H_final。

As shown in Fig. 2, " H_countEvaluation data set updates the core that sample weights w " module is incremental learning.Every time The iteration of Sub Data Set k can be considered as the addition of new data, indicate that existing model can learn new data.Followed by, " k > 1 " judgement can enter " H as k > 1_countEvaluation data set updates sample weights w ".This operation is that Learn++ can learn To the key of new category, the new data being added with the evaluation of existing classifier, the sample weights of classification error will be will increase, just The sample weights really classified reduce, and those include the sample of new classification, and one is scheduled in the sample of classification error, therefore has Bigger probability is selected in training set.

It is input to CART algorithm as basic learning algorithm in Learn++ algorithm, constructs I-CART Increment Learning Algorithm, Realize incremental learning.Method specific steps such as method 1:

The improvement of 1.2 ballot weights

I-CART Increment Learning Algorithm step 4 is to calculate the ballot weight of base classifier, and formula is log ((1- ε_t)/ε_t), Physical meaning is weight the sum of of the base classifier Ct in Sub Data Set classification error sample, be will appear in this, as base classifier Problem, it is interpreted in Learn++ algorithm ballot weight problem, it repeats no more.It is improved to make with the kappa coefficient of base classifier For ballot weight, I-CART Increment Learning Algorithm step 4 ballot weight betak is changed to the calculation method of kappa coefficient, and is walked Ballot weight is changed to kappa value when finally integrating in rapid 7, as shown in the part * in method 1.The kappa coefficient meter of base classifier Calculation method such as method 2:

Two, selective ensemble scheme

Problem in large scale for final integrated classifier, according to the thought of selective ensemble, between difference base classifier The relationship of property and accuracy is probed into, kappa coefficient of the difference appraisal index between base classifier, the evaluation system of accuracy Vision response test error of the number between base classifier.The smaller expression otherness of kappa coefficient is bigger, and error is lower to indicate accurate Property it is higher, the present invention fully considers the relationship of otherness and accuracy between base classifier, never Tongfang in face of both taken House and balance, devise two kinds of selection scheme selective ensemble plan V S (longitudinal method of scoring) and selective ensemble scheme HS is (horizontal To method of scoring), this is two kinds of methods arranged side by side, and execution only needs to select one of which.It is integrated to carry out selection, can subtract significantly Few base classifier number, while promoting integrated classifier classification performance.Kappa coefficient between two base classifiers, calculation method Such as method 3.

2.1 selective ensemble plan V S (longitudinal scribing line back-and-forth method)

Fig. 3 be by all different base classifiers two-by-two between the kappa value that generates and error draw and obtain, x-axis For kappa value, y-axis error.If not considering the relationship of otherness and accuracy, opened from the maximum base classifier of otherness Begin selection, from figure can the big base classifier error rate of finding differences property it is high, the not integrated thought of coincidence selectivity.Therefore To take into account the two, also to guarantee the accuracy of base classifier, perpendicular to the corresponding error of base classifier on the right of the straight line of x-axis Value is generally less than the left side, thus from straight line carry out selective ensemble to the right, guarantee the accuracy of base classifier.

Selective ensemble scheme is Fig. 1 " selective ensemble " part, this part is completed in all base classifier trainings It just carries out afterwards.It parameter meaning and is specifically provided that this selection scheme needs to increase parameter begin, base point is guaranteed with this The accuracy of class device.N indicates the number of the base classifier generated in I-CART algorithm, i.e. K*Tk.SN indicates selection base classifier Number, SN > 0&&SN≤N, SN be traditionally arranged to be 1/2N~1/3N, and SN is too close to N for reducing the scale of integrated classifier Effect is unobvious, too small to be unable to satisfy integrated purpose.H_finalHIndicate the Ensemble classifier with the base classifier composition after selection Device.h_i、h_jIndicate different base classifiers, i ≠ j, i, j=1,2 ..., N.Implementation procedure is as follows:

Step 1: it is unduplicated calculate all base classifiers two-by-two between kappa coefficient and vision response test error, and deposit Store up array kappa and array error；Step 2: by kappa array ascending sort, obtaining corresponding base classifier serial number, protect Card is selected from the highest base classifier of otherness；Step 3: setting starts to carry out the ordinal position begin of selection, generally N times of position of kappa value number, the general value 1/2 or 1/3 of n are not since the smallest base classifier of kappa value It is integrated, and higher accuracy can be guaranteed by being to give up certain otherness；Step 4: successively base being selected to classify since begin Device is added in integrated classifier, until the number of base classifier reaches SN.Step 5: using the SN base classifier selected Final integrated classifier H is integrated using Nearest Neighbor with Weighted Voting_finalS.Specific choice method such as method 4:

2.2 selective ensemble scheme HS (laterally scribing line back-and-forth method)

Fig. 4 be by all base classifiers two-by-two between the kappa value that generates and error draw and obtain, x-axis error, y Axis is kappa value.If not considering the relationship of otherness and accuracy, selected since the minimum base classifier of error rate, from It can find that the low base classifier otherness of error rate is minimum in figure, not the integrated thought of coincidence selectivity.It therefore is to take into account two Person needs to control the otherness of base classifier, and selection has maximum accuracy from specified kappa threshold value base classifier below Base classifier integrated, as shown in Figure 4 perpendicular to being selected in the straight line of y-axis region below, ensure that between base classifier Otherness.

Selective scheme section is Fig. 1 " selective ensemble " part, with the optional one of VS.Parameter meaning with specifically set Set as follows: the selection scheme needs to increase the threshold value that parameter threshold represents otherness, controls otherness with this.N is indicated The number of the base classifier generated in I-CART algorithm.SN indicates the number of selection base classifier, and SN > 0&&SN≤N, SN are general Be set as 1/2N~1/3N, it is unobvious for the scale effect for reducing integrated classifier that SN is too close to N, too it is small be unable to satisfy it is integrated Purpose.H_finalHIndicate the integrated classifier with the base classifier composition after selection.IndexH indicates base classifier serial number.h_i1、 h_j1Indicate different base classifiers, i1 ≠ j1, i1, j1=1,2 ..., N.Steps are as follows for execution:

Step 1: it is unduplicated calculate all base classifiers two-by-two between kappa coefficient and vision response test error, and deposit Store up array kappaH and errorH；Step 2: given threshold threshold is set to the average value of all kappa coefficients, Threshold value of the average value than setting some fixation is set with more dynamic, can preferably adapt to concrete condition；Step 3: will ErrorH array ascending sort, obtains corresponding base classifier serial number, and guarantee is selected from the highest base classifier of accurate rate；Step Rapid 4: starting the ordinal position select as beginH, be set as first kappaH (IndexH) < threshold serial number Position, this step are to save the design of selection time, do not need to go again successively to compare；Step 5: base classification is selected since beginH Device meets condition kappaH (IndexH) < threshold base classifier and is added in integrated classifier, until base classifier Number reach SN；Step 6: final integrated classifier being integrated using Nearest Neighbor with Weighted Voting using the SN base classifier selected H_finalS。

Specific choice method such as method 5:

Such as Fig. 3, suitable selection scheme is selected to be added in improved I-CART method, realizes and be based on selective ensemble CART increment study classification method.

Experiment and result:

Experiment using data be U.S.'s communications and transportation statistics bureau (Bureau of Transportation Statistics, BTS) flight provided shows data (Airline On-Time Performance Data, AOTP), 2017 8, New York on time Whole flight datas in month 45853, reject 573, unavailable sample, and experiment uses 45280, sample.

Experimental setup, training dataset 43016, test data set 2264, K=20, Tk=5 can obtain N=100 by K*Tk. Selective ensemble scheme is HS (lateral method of scoring), SN=50.

Experimental result, data set C4.5 method, CART method, I-CART method study new data time (s) such as Following table:

I-CART method saves the time about 70% than CART method.I-CART method accuracy is 88.2%, I- after improvement CART method accuracy is 89.5%, I-CART method lower error rate 1.3% compared with I-CART method after improvement, based on choosing The integrated CART method base classifier number ratio I-CART method of selecting property reduces by 50, therefore integrated classifier scale down 50%, the scale of integrated classifier is effectively reduced, and accuracy is 90.2%, classification performance promotes 2.0%.The present invention can It realizes the incremental learning of Flight Information, and further improves the classification performance of flight delay prediction model.

Claims

1. a kind of CART increment study classification method based on selective ensemble for flight delay prediction, which is characterized in that The following steps are included:

Flight data collection is divided into K Sub Data Set by step 1., the data set as each iteration；The number of iterations Tk is set, Base classifier number N, N=K*Tk.The number SN, SN that selection base classifier is arranged are 1/3N~1/2N；

Step 2. iteration calls basic learning algorithm to generate base classifier；After the generation of each base classifier, base classifier is calculated Kappa coefficient, as base classifier ballot weight save；

After the completion of step 3. iteration, all base classifiers are all generated；Using longitudinal method of scoring VS or lateral method of scoring HS, SN base classifier is selected from all base classifiers, forms final integrated classifier；

Base classifier after selection is combined by step 4. using weighted voting algorithm, and kappa coefficient is as ballot weight, finally The most classification of poll obtains flight delay prediction result as final classification.

2. study classification method according to claim 1, which is characterized in that improved I-CART method specifically:

Improved I-CART method variable meaning and be specifically configured to: (1) basic learning algorithm is set as CART algorithm, thus Generate base classifier；(2) and K one's share of expenses for a joint undertaking data set is inputted, k indicates k-th of Sub Data Set, k=1,2 ..., K；(3) subdata changes Generation number Tk, Tk > 0, being set as 3~10, t indicates iteration each time, t=1,2 ..., Tk；(4) and generate base classifier Number record count, initialize count=0；(5)w_tIndicate the weight of one group of sample；(6)H_countExpression possesses count The integrated classifier of a base classifier；(7) kappa indicates that kappa coefficient (8) betak of base classifier indicates base classifier Ballot weight；(9)H_finalIndicate final integrated classifier.

Specific steps are as follows: have two layers of circulation, outer loop is the traversal of subdata sets, is considered as ever-increasing data set；It is interior Layer circulation is to generate to have otherness base classifier.

Firstly, starting outer loop for traversal K times of Sub Data Set:

Step 1: as k=1, initialization sample weight w1=1/m, m indicate the number of sample in Sub Data Set；As k > 1, It, can be according to H exactly since second Sub Data Set_countSample weights w1 is updated to the evaluation of new data, so that training sample In comprising being difficult to the sample classified, promote new base classifier more to learn these samples and start interior loop for every height The iteration of data set Tk times, t indicate the t times iteration:

Step 2: to guarantee w_tIt can be used as a kind of distribution,According to sample weights w_tConstruct training dataset；

Step 3: calling CART learning algorithm, obtain base classifier C with the training dataset training generated in step 2_t；

Step 4: calculating base classifier C_tBallot weight betak: calculate base classifier C_tKappa coefficient κ₁, betak=κ₁；

Step 5,6: by existing base combining classifiers H_count, according to H_countThe evaluation of subdata sets updates sample weights w_t；

3. study classification method according to claim 1, which is characterized in that longitudinal method of scoring specifically:

N indicates the number of the base classifier generated in I-CART algorithm, i.e. K*Tk；The number of SN expression selection base classifier, SN > 0&&SN≤N, SN are set as 1/2N~1/3N；H_finalHIndicate the integrated classifier with the base classifier composition after selection；h_i、h_j Indicate different base classifiers, i ≠ j, i, j=1,2 ..., N；Implementation procedure is as follows:

Step 1: it is unduplicated calculate all base classifiers two-by-two between kappa coefficient and vision response test error, and store and arrive Array kappa and array error；Step 2: by kappa array ascending sort, obtain corresponding base classifier serial number, guarantee from It is selected in the highest base classifier of otherness；Step 3: setting starts to carry out the ordinal position begin of selection, is kappa value N times several of positions, the general value 1/2 or 1/3 of n；Step 4: successively selecting base classifier to be added to collection ingredient since begin In class device, until the number of base classifier reaches SN；Step 5: using Nearest Neighbor with Weighted Voting collection using the SN base classifier selected At final integrated classifier H_finalS。

4. study classification method according to claim 1, which is characterized in that lateral method of scoring specifically:

N indicates the number of the base classifier generated in I-CART algorithm after improving；The number of SN expression selection base classifier, SN > 0&&SN≤N, SN are set as 1/2N~1/3N；H_finalSThe collection ingredient of base classifier composition after indicating execution selection scheme HS Class device；h_i1、h_j1Indicate different base classifiers, i1 ≠ j1, i1, j1=1,2 ..., N.IndexH indicates base classifier serial number. Steps are as follows for execution:

Step 1: it is unduplicated calculate all base classifiers two-by-two between kappa coefficient and vision response test error, and store and arrive Array kappaH and errorH；Step 2: given threshold threshold is set to the average value of all kappa coefficients, if； Step 3: by errorH array ascending sort, obtaining corresponding base classifier serial number, guarantee from the highest base classifier of accurate rate Middle selection；Step 4: start the ordinal position select as beginH, be set as first kappaH (IndexH) < The ordinal position of threshold；Step 5: since beginH select base classifier, meet condition kappaH (IndexH) < The base classifier of threshold is added in integrated classifier, until the number of base classifier reaches SN；Step 6: using selection SN base classifier out integrates final integrated classifier H using Nearest Neighbor with Weighted Voting_finalS。