CN107273387A

CN107273387A - Towards higher-dimension and unbalanced data classify it is integrated

Info

Publication number: CN107273387A
Application number: CN201610218160.2A
Authority: CN
Inventors: 李臻
Original assignee: Shanghai Boson Data Technology Co Ltd
Current assignee: Shanghai wind newspaper Mdt InfoTech Ltd
Priority date: 2016-04-08
Filing date: 2016-04-08
Publication date: 2017-10-20

Abstract

The present invention propose towards higher-dimension and unbalanced data classify it is integrated, it is characterised in that using dimensionality reduction and the sequencing of sampling, pretreatment strategy is reduced to two classes；Reproducibility principle based on experiment conclusion, some standard data sets for choosing data mining and machine learning are used as experimental data；In the selection of preprocess method, packaged type (Wrapper) feature selection approach and over sampling method are added；The influence of dependence number and the uneven aspect of degree two research preprocess method to higher-dimension unbalanced data classification performance, using more complete Pretreatment Test strategy, obtains different conclusions：Before classifying to higher-dimension unbalanced data, the average AUC performances for first reducing the generation of feature releveling data are more excellent, and automaticity is strong, and higher-dimension and the uneven influence to classification are relaxed using different pretreatment combination size strategies.

Description

Towards higher-dimension and unbalanced data classify it is integrated

Technical field

The present invention relates to data processing field, more particularly to towards higher-dimension and unbalanced data classify it is integrated.

Background technology

The challenge for being faced with various data problems is studied in data mining, and the data of different qualities add the complexity of algorithm research.Wherein, it is research focus in recent years to the data classification with higher-dimension and unbalance response.Existing method only accounts for higher-dimension or unbalanced a certain characteristic, but a large amount of real datas present double grading simultaneously.When classification has the data of double grading, the sorting algorithm individually for higher-dimension or unbalanced data faces performance bottleneck.How higher-dimension effectively to be classified and the problem of unbalanced data are application study urgent need to resolve.The method of classification higher-dimension unbalanced data has two kinds：Pretreatment (feature selecting and sampling) is classified and Direct Classification again.Pretreated data are used directly for existing sorting algorithm, but data degradation Partial Feature and example information, and the effect of pretreatment will influence classification performance.Direct Classification can retain total data information, but sorting algorithm must take into account consideration higher-dimension and unbalance response, add the complexity of design.Deploy research in terms of the two herein, during for pretreatment higher-dimension and unbalanced data, first feature selecting or first problem of sampling obtain feature selecting prior to sampling by Experimental comparison.

More excellent conclusion；The data nonbalance problem faced for first feature selecting, it is proposed that unbalanced data feature selecting algorithm BRFVS；It may cause feature or the loss problem of example for preprocessed data, under the integrated study framework of feature based, cost-sensitive random forests algorithm CSRF and the sorting algorithm IEFS based on Ensemble feature selection are proposed in terms of random fashion (random forest) and selection mode (Ensemble feature selection) two.Do specific works herein as follows：

1) influence of contrast characteristic's selection and sampling order to classification performance.Show that the classifying quality first sampled after feature selecting again is more excellent in the experimental studies results of specific area (software defect detection).Because experimental data is single, the conclusion does not have popularization.And shown in the checking research of multiple other fields, the order of feature selecting and sampling is not to influence the key factor of classification performance.But because introducing man-made noise factor, the conclusion is not suitable for muting situation.Herein from UCI data sets, 12 experimental data sets have been screened according to application field, dimension and uneven degree.Using AUC as evaluation criterion, test filtering type and packaged type feature selection approach combines influence after pretreatment to classification performance with sampling method.Different from above-mentioned conclusion, performance of the average AUC performances that first feature selecting is sampled again on 12 data sets is better than first sampling feature selecting again.The conclusion can provide practical advice for pretreatment higher-dimension unbalanced data.

2) uneven feature selecting algorithm BRFVS is proposed.It is relatively fewer currently for the algorithm of unbalanced data feature selecting.

Existing EFSBS algorithms belong to filtering type method, do not make full use of the feedback of sorting algorithm；Although PREE algorithms make use of the performance feedback of classification, discrete type feature can not be handled.BRFVS, which is one, can handle discrete type, and continuous type feature can be handled again, while the feature selecting algorithm that sorting algorithm is fed back can be made full use of.BRFVS has used for reference the thought of random forests algorithm, produces multiple equilibrium criterion collection using deficient sampling, feature importance measures value is calculated using random forest Variable Selection on each data set.Final metric is obtained by the metric weighted sum to each data set.The weight of data set is determined by the degree of consistency of itself and integrated prediction.Influence of the Experimental comparison random forest hyper parameter K value to algorithm performance, as a result shows, when K values are M, the classification performance sampled again after being selected using the classification performance that is sampled again after BRFVS feature selectings better than general feature.Further demonstrate and more excellent experiment conclusion is sampled after first feature selecting.Propose cost-sensitive random forests algorithm CSRF.Although Direct Classification be able to not can be influenceed by pretreatment, existing high dimensional data sorting algorithm can not effectively classify unbalanced data, and unbalanced data sorting algorithm does not consider the situation of data exhibiting higher-dimension characteristic.CSRF introduces test cost and misclassification cost in the decision tree Attributes Splitting measurement of random forest, and both costs are relevant to group data respectively, and the correct recognition rata to group is improved by the whole attention rate to group.Experimental comparison's CSRF algorithms, original random forests algorithm and the random forests algorithm for only introducing misclassification cost.CSRF has a clear superiority on the correct recognition rata of AUC performances, especially group, meanwhile, the classification performances of CSRF algorithms is also apparently higher than the classification performance classified again after pretreatment.

4) the higher-dimension unbalanced data sorting algorithm IEFS based on Ensemble feature selection is proposed.The object function of existing Ensemble feature selection algorithm only considers the weighted sum of diversity and accuracy, and unbalance response is not considered, is not suitable for unbalanced data classification.IEFS algorithms selection Kohavi-Wolpert variances are as diversity measure, and the rewards and punishments factor is introduced wherein increases the concern to group, and searches for solution space using climbing method, can take into account consideration diversity, accuracy and disequilibrium.Experimental result shows, the method is slightly worse than CSRF sorting algorithms on AUC classification performances, but its in the identification of AUC classification performances and group apparently higher than C4.5 and random forests algorithm.Although first feature selecting faces data nonbalance problem, however, being to use BRFVS algorithm general feature selection algorithms, more excellent classification performance will be produced by first pre-processing the mode of higher-dimension problem reprocessing imbalance problem.The performance comparison that Direct Classification is classified again with pretreatment shows that Direct Classification method is better than preprocess method on AUC and group correct recognition rata, but time cost is bigger, it is adaptable to processed offline mode.IEFS algorithms are due to the limitation by searching method, and the performance showed is then slightly worse than CSRF algorithms.

In summary, the defect existed for prior art, it is accordingly required in particular to towards higher-dimension and unbalanced data classify it is integrated, to solve the deficiencies in the prior art.

The content of the invention

It is an object of the invention to provide towards the integrated of higher-dimension and unbalanced data classification, automaticity is strong, and higher-dimension and the uneven influence to classification are relaxed using different pretreatment combination size strategies.

The present invention for solve its technical problem the technical scheme adopted is that

Towards higher-dimension and unbalanced data classify it is integrated, using dimensionality reduction and the sequencing of sampling, pretreatment strategy is reduced to two classes；Reproducibility principle based on experiment conclusion, some standard data sets for choosing data mining and machine learning are used as experimental data；In the selection of preprocess method, packaged type (Wrapper) feature selection approach and over sampling method are added；The influence of dependence number and the uneven aspect of degree two research preprocess method to higher-dimension unbalanced data classification performance；

Dimension reduction method is divided into two classes：Feature selecting and eigentransformation, whether feature selection approach foundation is divided into two kinds of filtering type (Filter) and packaged type (Wrapper) independently of follow-up learning algorithm, filtering type is unrelated with follow-up learning algorithm, typically directly assess feature using the statistic property of all training datas, speed is fast, but assessment and the aberrations in property of follow-up learning algorithm are larger；Packaged type assesses character subset using the training accuracy rate of follow-up learning algorithm, deviation is small, it is computationally intensive, it is not suitable for big data quantity, eigentransformation is different from being that its output result is not original attribute in place of feature selecting, but based on the new attribute produced by certain conversion principle, because the attribute after conversion changes the physical characteristic of original attribute, some eigentransformation methods are generally directed to connection attribute data simultaneously, then this does not consider eigentransformation method, and sampling method includes two kinds：Sampling and over sampling are owed, pretreatment uses dimension reduction method and sampling method；

The assessment of dimension reduction method depends directly on data set in itself, it has been generally acknowledged that the larger feature of correlation or character subset can obtain higher classification accuracy, common Filter feature selecting appraisal procedures have between class distance, information gain, the degree of association and inconsistent degree etc., although Kohavi once pointed out only to consider that the appraisal procedure operational efficiency of data set is high, the problem of feature or character subset that the searching feature related to classification or character subset and selection can optimize classification accuracy are two differences；

Sampling method is the conventional preconditioning technique of a class, and the imbalance problem in data can be alleviated, sampling method samples direction according to it can be divided into two classes with equilibrium criterion using sampling：Over sampling (Over Sampling) and deficient sampling (Under Sampling), over sampling increases group example, deficient sampling then reduces major class example, random and algorithm two classes are divided into according to Sampling Strategy, example is deleted or increased in grab sample in a random basis, and algorithm sampling is then sampled according to certain principle, such as delete the example close to major class border or the group example of any generation of increase, typically, grab sample is more conventional sampling means, and it is to simplify problem that algorithm sampling there may be certain guidance quality for the change of example set.

Further, dimension reduction method pays close attention to feature selecting and the combination experiment effect of sampling, therefore based on simplification principle, select simple, general and efficient algorithm, information gain feature selecting algorithm and Relief algorithms are selected in Filter feature selecting algorithms, select to be that subsequent classification algorithm is drafted the reason for the former to use decision Tree algorithms, and the method for information gain inherently decision tree Attributions selection；The latter is then because Relief algorithms are the current preferable Filter feature selecting algorithms of effect relatively generally acknowledged, Wrapper algorithms, which are drafted, selects different search strategy construction algorithms of different, because Kohavi experimental study shows best-first search better than greed search (climbing method) mode, best-first search mode is selected herein, in addition, random search can provide more accurate search result, and the genetic search mode of basic genetic algorithmic is considered as at the same time；

Information gain is that information gain is a kind of measure commonly used in machine learning and information theory, when carrying out class prediction, the value of known features, IG can measure the information digit required for relevant class prediction, information gain can be defined as the uncertain difference with desired posteriority between uncertain of priori, given attribute X is calculated on generic attribute Y IG, it is necessary to known two information：The uncertainty of class label Y values itself and uncertainty when considering attribute X, the two uncertainties can be expressed as Y entropy H (Y) and conditional entropy H (Y | X)；

Wherein r represents attribute X value number, and feature X IG may be defined as：

IG (X)=H (Y)-H (Y | X)

When H (Y) represents not considering feature X, the purity of Y attributes, and H (Y | X) represent to consider after feature X, the purity of Y attributes, if considering after X attributes so that if the division of Y attributes is purer, then think this characteristic attribute can effective district it is sub-category, entropy is smaller, and purity is higher, namely should select the attribute of maximum information gain.

Further, Relief algorithms are according to feature to the resolving ability of closely example come evaluating characteristic, the feature thought should approach similar example, and it is remote between making inhomogeneous example, circle and triangle represent two class examples respectively, algorithm randomly chooses an example R from training set D, then arest neighbors example H is found from the example similar with its, referred to as Nearest Hit, arest neighbors example M is found from its inhomogeneous example, referred to as Nearest Miss, then for every dimensional feature, if R and H is less than the distance on R and M in distance thereon, it is beneficial to distinguishing similar and inhomogeneous arest neighbors then to illustrate this dimensional feature, increase the weights of this feature；Conversely, then illustrating that this dimensional feature is reactive to distinguishing similar and inhomogeneous arest neighbors, then this feature weights are reduced, the more new formula of weights is as follows：

Weight [A]=Weight [A] diff (A, R, H）/ m+diff (A, R, M）/m

Wherein A=1....N, N represent attribute number, and m refers to iterations, and diff (A, R, H) represents distances of the example R and H on attribute A；

Repeat said process m time, it is last obtain be each feature average weight, the weights of feature are bigger, and the classification capacity of expression this feature is stronger, conversely, the classification capacity of expression this feature is weaker.

Further, Wrapper methods are proposed by Kohavi, by learning algorithm as a black box, character subset is selected using the result of learning algorithm, the valuation functions of learning algorithm inherently feature subset selection, different search strategies will produce different character subsets, one search contains state space, original state, end condition and search strategy four elementses, each state representation of search space is a character subset by Kohavi, for n feature, each state has n, every represents whether a feature occurs, 1 represents occur, 0 represents to occur without, operation determines the partial ordering relation between state, the operation of Kohavi selections is increase or deletes attribute, if n feature, then search space is (2) nO, it is unpractical using the whole space of exhaustive pattern search, thus different search strategies are needed.

Further, accuracy and error rate are conventional classifier performance measurements, but the two measurements are uneven to class sensitive, be excessively partial to it is more several classes of, when positive and negative class ratio is 5%：In the case of 95%, even if all examples are all divided into negative class, the accuracy of grader reaches 95%, and now all positive examples all will be by mistake point, and accuracy (Acc) and error rate (Err) are expressed as follows：

According to confusion matrix, accuracy and recall rate (real rate) and other measurements can be also calculated, F-measure is combined with accuracy and recall rate, and higher F-measure means that grader has better performance in positive class；

The G-mean of the propositions such as Kubat is the geometric average of positive and negative class prediction accuracy, and G-mean is the important measure for avoiding overfitting from bearing class；

ROC curve is each point one sorter model of correspondence on point (FPrate, TPrate) track, ROC curve, and point (0,0) represents each example to be predicted into the model of negative class；Point (1,1) represents each example to be predicted into the model of positive class；Point (1, 0) it is ideal model, all positive examples are categorized as positive class, all negative examples are categorized as negative class, when drawing ROC curve, y-axis represents real rate, and x-axis represents false positive rate, one grader is used to produce a confusion matrix on a test set, and then obtain corresponding real rate (TPate) and false positive rate (FPrate), the point thus corresponded in ROC spaces, different graders then corresponds to one group of point in ROC spaces, these points are connected and can be obtained by a ROC curve, its corresponding lower section area of pictural surface is exactly AUC (area under ROC curve), assuming that grader exports score f (x) on example x, then corresponding AUC can be calculated using following formula and obtained：

Wherein：I (g) is indicator function, and N+ represents positive example instance number, and N- represents negative example instance number.

The advantage of the invention is that：

1) UCI data sets, the influence of contrast characteristic's selection and sampling order to classification performance are based on.Pretreatment is a kind of method for solving higher-dimension unbalanced data classification problem.First reduction feature (feature selecting) or first equilibrium criterion (sampling) are the matters of utmost importance that it faces.Shown in the result of study of specific application area, the classification performance first sampled after feature selecting again is more excellent.Because the experimental data of use is single, this conclusion can not be popularized；The result of study of multiple application fields then shows that processing sequence is not key factor.But because introducing man-made noise interference, its conclusion is not suitable for muting situation.Coming from herein on 12 data sets in UCI different application field, using more complete Pretreatment Test strategy, obtaining different conclusions：Before classifying to higher-dimension unbalanced data, the average AUC performances for first reducing the generation of feature releveling data are more excellent.The conclusion can provide practical advice for application study.

2) random forest variables choice thought is used for reference, a new unbalanced data feature selecting algorithm BRFVS is devised herein.First feature selecting necessarily faces data nonbalance problem.It is relatively fewer currently for the algorithm of unbalanced data feature selecting.Existing EFSBS algorithms do not make full use of the feedback of sorting algorithm；Although PREE make use of the performance feedback of classification, discrete type feature can not be handled.Set forth herein BRFVS algorithms be one and can handle discrete type feature, can handle continuous type feature again, and performance study of the feature selecting that sorting algorithm can be made full use of to feed back when calculating value different to BRFVS hyper parameters K is shown：When K values are M, and it is substantially better than the classification performance sampled again after general feature selection using the classification performance sampled again after BRFVS feature selectings more preferably, the more excellent experiment conclusion of first feature selecting has been confirmed again.

3) consider misclassification and test dual cost, this paper presents a cost-sensitive random forests algorithm CSRF, dependence test and the adjustment of the aspect of misclassification two improve the correct recognition rata of group to the attention rate of group.Preprocess method may cause the loss of feature or example, and Direct Classification can then retain total data information.Existing high dimensional data sorting algorithm can not effectively classify unbalanced data, and unbalanced data sorting algorithm does not consider the situation of data exhibiting higher-dimension characteristic.It make use of random forest to handle the advantage of high dimensional data herein, dual cost, the preferably disequilibrium in processing data introduced in the Attributes Splitting measurement of its decision tree.Compared to cost and the only random forest of consideration misclassification cost is not considered, CSRF is sorted on the correct recognition rata of AUC performances, especially group and had a clear superiority.Meanwhile, directly with the performance of CSRF algorithm classifications also apparently higher than pretreated classification performance.

4) diversity, accuracy and disequilibrium are taken into account, diversity measure is used as using Kohavi-Wolpert variances in Ensemble feature selection herein, introducing the rewards and punishments factor wherein increases the concern to group, it is proposed that the IEFS algorithms of Direct Classification higher-dimension unbalanced data.The object function of existing Ensemble feature selection algorithm only considers diversity and accuracy, and unbalance response is not considered.IEFS algorithms consider integrated performance and uneven processing factor in Ensemble feature selection object function, new object function are devised, using climbing method search characteristics space.Experimental result shows, the method is slightly worse than CSRF sorting algorithms on AUC classification performances, but its in the identification of AUC classification performances and group apparently higher than C4.5 and random forests algorithm.

Brief description of the drawings

Describe the present invention in detail with reference to the accompanying drawings and detailed description：

Fig. 1 is Relief method schematic diagrams of the present invention；

Fig. 2 is state search space diagram of the present invention；

Fig. 3 is three layers of intersection accuracy rate assessment figure of the present invention；

Fig. 4 is present invention selection feature and data sampling scene graph；

Embodiment

In order that the technical means, the inventive features, the objects and the advantages of the present invention are easy to understand, with reference to diagram and specific embodiment, the present invention is expanded on further.

The resolving ideas of higher-dimension unbalanced data classification has two kinds：Classify again and Direct Classification after pretreatment.Higher-dimension and the uneven influence to classification are mainly relaxed using different pretreatment combination size strategies at present.Khoshgoftaar and Shanab are respectively for a certain specific area (software quality) and the data of multiple application fields (UCI), influence of the research preprocess method to higher-dimension unbalanced data classification performance.Research on software quality data set is single due to data set, and experiment conclusion does not have popularization；And the experimental study on UCI data sets, noise is artificially induced, its conclusion is not suitable for muting situation.The introducing of noise factor causes experimental result from truly reflecting the actual effect that preprocess method is classified for higher-dimension unbalanced data.

The core concept of Wrapper feature selectings is：And the unrelated Filter characteristic evaluatings of learning algorithm by and subsequent classification algorithm produce larger deviation.Character subset after the different character subset of different learning algorithm preferences, feature selecting will ultimately be used for follow-up learning algorithm, then the performance of the learning algorithm is exactly best evaluation criteria.The different sorting algorithm of selection and feature space search strategy, then can produce various Wrapper feature selecting algorithms, common way of search has best-first search, random search and heuristic search etc..

Dimension reduction method pays close attention to feature selecting and the combination experiment effect of sampling, therefore based on simplification principle, select simple, general and efficient algorithm, information gain feature selecting algorithm and Relief algorithms are selected in Filter feature selecting algorithms, select to be that subsequent classification algorithm is drafted the reason for the former to use decision Tree algorithms, and the method for information gain inherently decision tree Attributions selection；The latter is then because Relief algorithms are the current preferable Filter feature selecting algorithms of effect relatively generally acknowledged, Wrapper algorithms, which are drafted, selects different search strategy construction algorithms of different, because Kohavi experimental study shows best-first search better than greed search (climbing method) mode, best-first search mode is selected herein, in addition, random search can provide more accurate search result, and the genetic search mode of basic genetic algorithmic is considered as at the same time；

IG (X)=H (Y)-H (Y | X)

Referring to Fig. 1, Relief algorithms are according to feature to the resolving ability of closely example come evaluating characteristic, the feature thought should approach similar example, and it is remote between making inhomogeneous example, circle and triangle represent two class examples respectively, algorithm randomly chooses an example R from training set D, then arest neighbors example H is found from the example similar with its, referred to as Nearest Hit, arest neighbors example M is found from its inhomogeneous example, referred to as Nearest Miss, then for every dimensional feature, if R and H is less than the distance on R and M in distance thereon, it is beneficial to distinguishing similar and inhomogeneous arest neighbors then to illustrate this dimensional feature, increase the weights of this feature；Conversely, then illustrating that this dimensional feature is reactive to distinguishing similar and inhomogeneous arest neighbors, then this feature weights are reduced, the more new formula of weights is as follows：

Weight [A]=Weight [A]-diff (A, R, H)/m+diff (A, R, M)/m

Referring to Fig. 2, Wrapper methods are proposed by Kohavi, by learning algorithm as a black box, character subset is selected using the result of learning algorithm, the valuation functions of learning algorithm inherently feature subset selection, different search strategies will produce different character subsets, one search contains state space, original state, end condition and search strategy four elementses, each state representation of search space is a character subset by Kohavi, for n feature, each state has n, every represents whether a feature occurs, 1 represents occur, 0 represents to occur without, operation determines the partial ordering relation between state, the operation of Kohavi selections is increase or deletes attribute, if n feature, then search space is (2) nO, it is unpractical using the whole space of exhaustive pattern search, thus different search strategies are needed.

The target of search is, by valuation functions, to find the state with highest assessed value.Due to not knowing the actual accuracy rate of grader, Kohavi estimates grader accuracys rate as valuation functions using 5 layers of cross validation method, referring to Fig. 3 it is shown that three layers of cross validation method of an estimation accuracy rate.

Kohavi sets the original state of search to be empty set, i.e., using the preceding method (forward selection) to selection.Experimental result shows that the effect of best-first search is searched for better than greed, therefore paper experiment is intended using best-first search mode.The essential idea of best-first search is exactly the current existing most believable result of selection, and genetic algorithm (Genetic Algorithm, GA) is proposed by Holland earliest, is a kind of evolutionary learning method for simulating bioselection and seed procedure.

The problem of genetic algorithm will be to be solved is used for judging the quality and its survival probability of chromosome by the chromosome that coded representation is that multiple genes are constituted, one fitness function of design, and chromosome adaptive value is higher, it is meant that the possibility for being chosen breeding is bigger.Genetic algorithm is by selection, the next generation intersected and make a variation acquisition population, and continuous iteration is until meet termination condition.GA-Wrapper uses binary coding mode, and individual UVR exposure is binary digit string type, and each represents a feature, and value is that 0 representative does not select this feature, is that 1 representative selects this feature.An individual represents a kind of feature selecting scheme.Function is calculated using the accuracy rate of grader as adaptive value.

Traditional classification algorithm assumes data category distributing equilibrium.But True Data is usually present the situation [153] of class uneven (class imbalance) or class distribution deflection (skewed class distribution).When handling class unbalanced data, due to more several classes of dominances, classification boundaries are offset to advantage data, and traditional classification algorithm will face the problem of declining to minority class predictive ability, so as to influence overall estimated performance.Now, it is likely to occur deviation using common accuracy rate or error rate as classifier performance is assessed.The evaluation method of current unbalanced data sorting algorithm includes：Accuracy, accuracy (precision), recall rate (recall), F-measure, gmean, AUC, ROC curve, precision-recall curves and cost curves etc..Confusion matrix expresses the distribution situation of Exemplary classes, is the basis for calculating classifier performance measurement, as shown in table 1, wherein positive class represents minority class, bears class representative more several classes of.

Table 1

	Predict positive class	The negative class of prediction
			Actual positive class	True Positives(TP)	False Negatives(FN)
Actual negative class	False Positives(FP)	True Negatives(TN)

It is that data are pre-processed in face of the intuitive approach of higher-dimension imbalance problem, but higher-dimension and unbalanced influence are interpenetrated in data set, first feature selecting is still first sampledWhether the sequencing of pretreatment has certain contact with data dimension in itself and unbalance responseWhether pretreatment is optimal processing methodThe problem of this is a series of drive researcher pre-processed in terms of confirmatory experiment.There is many to separately verify various preprocess methods in higher-dimension or uneven classificatory effect at present, but considered the experiment for the treatment of effect of the preprocess method on higher-dimension unbalanced data, only Khoshgoftaar team.But test research contents by analyzing it, it can be found that, there is certain redundancy on experimental strategy, the selection of preprocess method is not representative enough, and noise is artificially introduced in data set, real data set is seemed although can simulate, but it is single for higher-dimension imbalance problem, then introduce impure information, treatment effect of the preprocess method on problems can not preferably be shown, based on this, herein on the basis of the experimental study of Khoshgoftaart team, to the strategy of experiment, data set and method are improved, it is expected that by rational Setup Experiments, obtain effective experimental result, the experiment conclusion of analysis can help the design of subsequent algorithm.

Khoshgoftaar solves the problems, such as feature selecting and class imbalance in data in software defect detection with the various integrated modes of feature selecting and sampling technique.According to feature selecting and the sequencing sampled and the training dataset being classified based on, four kinds of application scenarios are devised：

(1) initial data feature selecting is based on, based on initial data modeling；

(2) initial data feature selecting is based on, based on sampled data modeling；

(3) sampled data feature selecting is based on, based on initial data modeling；

(4) sampled data feature selecting is based on, based on sampled data modeling；

These four scenes such as Fig. 4 institutes, the experimental strategy that it is used is as shown in table 2：

Table 2

Experimental result is shown：

(1) feature selecting based on sampled data is substantially more preferable than the effect of the feature selecting based on initial data, i.e. S3 and S4 are significantly stronger than S1 and S2；

(2) training dataset is little to the performance impact of forecast model using initial data or sampled data, i.e. performance difference is little between S1 and S2 and between S3 and S4；Accordingly, Khoshgoftaar thinks to select correct property set extremely important in failure prediction.

Then, Shanab on this basis, has carried out more generally applicable experiment.In scene, Shanab thinks that first method is equal to the method for only carrying out feature selecting, therefore removes it, and other three kinds of scenes keep constant.Feature selection approach expands as 9 kinds by 6 kinds, sampling method remains in that constant, the sorting algorithm used then expands as five kinds by two kinds, the validation data set used is no longer limited to software defect detection field, and 7 are expanded as, respectively including gene expression data, internet data and image recognition data.Specific experimental strategy is as shown in table 3：

Table 3

There is any discrepancy for experimental result and Khoshgoftaar conclusion.Scene S2 and S3 performance are relatively more preferable, and S2 and S3 performance height then depends on used evaluation index and sorting algorithm, and according to PRC indexs or MLP or LR algorithm, then more preferably, according to AUC indexs or 5-NN or SVM algorithm, then S2 effects are more preferable by S3.Both experimental strategies of analysis and experimental result can be found that it has problems with：

1) feature selecting is with the unreasonable four kinds of scenes of combined strategy for sampling selection, and the training dataset that S2 and S4 are obtained is low-dimensional equilibrium criterion collection, and the training dataset that S1 and S3 are obtained is low-dimensional unbalanced dataset.Its sorting algorithm used is traditional classification algorithm.Thus, S1 and S3 are obtained result and the final process target for not meeting pretreatment.The key of strategy combination does not simultaneously lie in final data set and is formed at original data set or sampled data collection, key should be the sequencing of feature selection approach and sampling method, and feature selecting based on data set be raw data set or sampled data collection, and sampling method based on data set be data set after raw data set or feature selecting.Or it is that both are carried out as two independent processes.Though using which kind of strategy, according to sorting algorithm be traditional algorithm, the target of pretreatment should be that the data set with higher-dimension unbalance response is converted into low-dimensional equilibrium criterion collection.And these four scenes, the only two kinds final data sets of scene are low-dimensional equilibrium criterion collection.

2) feature selection approach and sampling method are incomplete

The feature selection approach that above-mentioned experiment is used is that filtering type method, i.e. feature selecting are based on data set, unrelated with selected sorting algorithm, and sampling method only considers to owe sampling one kind.Filtering type method only considers in itself from data set, but during in face of different classifications algorithm, effect will have nothing in common with each other, and it is a kind of way for reducing data set quantity to owe sampling, when training data is less, when minority class is less in other words, such small sample will influence the effect of classification.Therefore, for feature selection approach and sampling method, its method selected should be more complete.

3) analysis of experimental result is insufficient

Khoshgoftaar result of study shows that the scene first sampled is better than the scene of feature selecting, and the formation of final training dataset is then to influence little in initial data, or in sampled data.If this result is it is to be understood that first carry out feature selecting, data set is unbalanced dataset, feature selection approach is still conventional method, will so influence final result, if sampling in advance, it then can first eliminate imbalance so that the effect that feature selection approach is brought into normal play.It is understood that, its conclusion has certain reasonability.But Shanab result is shown, S2 and S3 are better than S4, and this result and Khoshgoftaar result have certain discrepancy, that is to say, that first sample, or first feature selecting difference is less, but the effect trained on raw data set is better than on sampled data collection.Why Similar strategies, its conclusion has certain differenceKhoshgoftaar experimental data set is only for software defect detection field, it is impossible to illustrate its versatility, although and Shanab experimental program is more generally applicable, it is in experimentation, it is contemplated that noise problem, introduces more uncertainties.The effect why trained on raw data set is better than on sampled data collection, being primarily due to deficient sampling technique reduces the quantity of training dataset, in the case that the quantity of the data set of Shanab experimental selections and the quantity of minority class are relatively fewer, the classifying quality in more data will be more preferable.

The data mining public database [159] that experiment data set used is safeguarded all from your gulf branch school machine learning group of University of California.142 data sets for coming from classification task are screened according to following principle：

1) screened according to three attribute intervals：[25,100], [100-1000] and more than 1000；

2) selection class is distributed unbalanced data set；

3) many categorized data sets of category distribution balance of the selection in the interval scope of attribute.Why select UCI data sets and screened by this three principles, reason is as follows：

1) UCI data sets are machine learning and the recognized standard data set is compared in data mining, and substantial amounts of mining algorithm research and the used data set of application are all to originate this, and selection UCI data sets can make follow-up study person's playback experiment process；

2) attribute interval, which is set, is embodied to contrast performance of the data set of different dimensions under various methods, influence degree of the parser by dimension, for the data set of the 1000 dimension above, it is only selected around 1000, it is because the preprocess method of comparative selection is realized based on Weka, algorithm in weka has certain requirement for performance, and the too high data of attribute belong to superelevation dimension data, therefore not within the limit of consideration of paper；

3) class is distributed the rare characteristic of itself of some classifications that uneven one side is due to data and determined, on the other hand the imbalance for being due to some algorithm process reasons and artificially causing, for example some algorithms are suitable only for two classification problems, now, need many classification problems being converted into two classification problems, then very likely artificially cause class distribution uneven, therefore meet the interval data set of Attributions selection, if many classification problems, fall within the limit of consideration of experimental data set.UCI grouped data is concentrated, and the data set that attribute number is more than 25 has 53, and seven data sets for coming from four application fields such as image recognition, spam filtering, web applications and error diagnosis are obtained according to screening principle.In above-mentioned data set some data attributes exist certain missing, some then belong to many classification problems, it is therefore desirable to initial data is handled accordingly.Following explanations and processing method for being data about collection.

1) steel plate defect data set (Steel Plates Faults)：There is provided by the communication science Semeion research centers of Italy.What each example was represented is the surface defect of a stainless steel substrates.There are seven kinds of different types of defects, defect is by 27 attribute descriptions.These attributes illustrate the geometry and profile of defect.Have 1941 examples [160].Steel data sets are classification problems more than one.Seven two-category data collection are translated into according to the six of Steel data sets kinds of data types.One of classification is a certain specific defect, and another classification is then other all curves, and classification purpose, which is then converted into, efficiently differentiates certain types of defect and other kinds of defect.Because first five classification is certain types of curve, and the 7th classification is the other defect for referring to this non-five defective, it occupies ratio of the whole data set about more than 30%, is thought of as a classification problem and had little significance, therefore it is deleted, obtains 6 two-category data collection.Because 6 data concentrate the unbalance factor of each two data set similar, respectively (Dirtiness, Stains), (Pastry, Z_Scratch) and (K_Scratch, Bumps), it is only necessary to which it is representative to select 3 data sets.The data set that every centering of algorithm three pairs of data sets herein selects the TPrate of positive class slightly lower using on the basis of C4.5 is used as final experimental data set, respectively Dirtiness, Pastry and Bumps.

2) earth resources satellite data set (Stalog Landsat Satellite)：Stalog data sets are a certain subregions of a satellite image.This region includes 82*100 pixel.Every a line in data set is corresponding be this 82*100 subregion in 3*3 neighborhood.The pixel value of four bands of 9 pixels in 3*3 neighborhoods is all corresponded to per a line, therefore forms 36 attributes.One pixel value is 8 bit strings, and numerical value 0 corresponds to black, and numerical value 255 corresponds to white.Class label represents the classification (7 classifications) of center pixel, respectively numerically, 1 represents red soil, 2 represent cotton crop, 3 represent grey soil, and 4 represent damp grey soil, and 5 represent soil with vegetation stubble, 6 represent mixture class, and 7 represent very damp grey soil.Because the 6th class does not have example, therefore actually the classification of data set is six classes.Data are sorted in a random basis, and some data wires have been removed, therefore can not be according to data reconstruction image.Similar to Steel data, Stalog data are also converted to two-category data collection, 6 data sets are thus form, this 6 data concentrate the unbalance factor of every three data sets similar, are respectively (1,2,3) and (4,5,6).It is similar with Steel selection principle, respectively select a data set to obtain final experimental data set (3,4) from this two groups of data

3) Spam data collection (Spambase)：Every a line of Spam data collection represents whether an envelope mail is spam.Most attribute represents whether some words or character frequently occur in mail.Data set has 57 attributes, wherein there is connection attribute of 48 spans at [0-100], and these attribute records are frequencies that some words occur in mail.There are 6 scopes to be used to record the frequency that character occurs in mail in the connection attribute of [0-100].Remaining three connection attributes represent the not homometric(al) (average value, greatest length and sum) for recording continuous capitalization length in mail.Last attribute be represent the mail whether be spam class label.0 represents it is not spam, and 1 represents it is spam.Data set has 4601 examples, but some of data set attribute has missing, because experimental data is more abundant, takes missing data the strategy of deletion, is derived from an example, wherein spam 1813, non-spam email 2788.

4) Moschus data set (Musk)：Musk data sets are the data acquisition systems of 102 molecules of a description, wherein 39 are determined as Moschus by human expert, remaining 63 molecules are determined as non-Moschus.The target of classification is that whether one recruit of prediction is Moschus by study.Because link (bond) can rotate, individual molecule can show different shapes.This data set generates 6598 forms according to all low energy forms of molecule.Wherein exact shape or form of the feature dependent on molecule, have extracted 166 features for describing each form.Remaining two attributes are respectively molecule name and construction name.There are two attributes, respectively molecule name and form name in addition to 166 features, the two attributes cannot be used for classification, therefore be deleted.

5) Internet advertisement data set (Internet Advertisements)：Internet advertisement data set contains 3279 examples, and these examples represent the image and word being embedded in Web page.Feature includes the dimension of image, the phrase in the URL of document or image, and the text occurred in image anchor tag or nearby, class label is expressed as whether an image is advertisement.Primitive class is distributed as non-2821 examples of advertisement, 458 examples of advertisement.But there is 28% or so missing in the attribute of some in data, the strategy of deletion is taken for missing data, non-advertisement instances 1978, advertisement instances 381 are derived from.

6) handwriting digital optical identification data set (Optidigits)：Optidigits data sets come from 0 to 9 numeral of 43 personal handwrittens, wherein the handwritten numeral of 30 people is as training dataset, the handwritten numeral of remaining 13 people is used as test data set.Pre-printed handwritten numeral image extracts the form for bitmap by the NIST preprocessor provided.32*32 is that bitmap is divided into the 4*4 blocks not covered, and calculates the pixel count in each block.Thus 8*8 input matrix is produced, wherein each element is 0-16 integer, 64 attribute dimensions are consequently formed.Class is marked as handwritten numeral.Due to each digital distributing equilibrium, two classification problems are translated into first.The mode of conversion has two kinds：(1) it is positive example to take one of numeral, and remaining digit is counter-example.(2) due to 2,3,5,7,8 these numerals are easily obscured, therefore first delete non-2, and 3,5,7,8 example takes one of them as positive example, remaining to be used as counter-example.Thus 15 data sets will be produced, according to foregoing similar principle, data set relatively low selection TPrate, the first situation have selected numeral 8 as positive example, form data set Optidigits A-8；Second of situation selection numeral 3 forms data set Optidigit P-3 as positive example.

7) handwriting digital data set (Semeion Handwritten Digit)：Semeion data sets come from 1593 handwritten numerals of 80 people.These numerals are stretched in rectangle of the tonal range for 256 16*16.Each pixel of each image is scanned as a Boolean that (each pixel value is less than or equal to 127 0, more than 127 for 1) using a fixed thresholding.Everyone writes 0 to 9 twice of totally ten numerals, wherein writing in a normal way for the first time, writes in a fast manner for the second time.Each record in data set represents a handwritten numeral, and attribute is expressed as pixel.It is similar with Optidigits selection principle, select two datasets, respectively Semeion A-8 and Semeion P-3.By being finally obtained 12 data sets after selection and processing, all researchs of paper are all based on this 12 data sets.Summarize selection reason as follows：(1) it is related to multiple fields；(2) there is essential uneven situation and artificial uneven situation, meet real-world data characteristic；(3) UCI data sets are the generally acknowledged experimental data sets of data mining, and contrast experiment's research has reproducibility and comparativity.(4) there are three spans in data set attribute, and degree of unbalancedness meets the impact analysis demand of the change of research dimension and uneven degree for algorithm there is also different stage.For the foregoing reasons, this 12 data sets are selected as experimental study basis.

According to the problems of above-mentioned experiment, the design more rational higher-dimension uneven pretreatment strategy in the selection of data set, has taken into full account different dimensions and uneven degree, is related to multiple common fields with higher-dimension unbalanced data.Compared to above-mentioned experiment, on experimental program more fully, it is desired to be able to have more deep understanding for solving higher-dimension and unbalanced data classification using preprocess method.

High dimensional data feature selection approach has two kinds：Filter and Wrapper.Uneven preprocess method generally refers to sampling method, including owes sampling and over sampling.Higher-dimension and unbalanced processing strategy are exactly two kinds nothing but：First dimensionality reduction releveling, still first balances dimensionality reduction again.

The principle of division is dimensionality reduction and the sequencing and the combining form of various dimension reduction methods and sampling method of sampling.In the selection of specific method, using some effective universal methods.

Because decision Tree algorithms have generality in actual applications, and the basic algorithm that the follow-up research institute of paper uses is decision Tree algorithms.Therefore, this experiment only considers a kind of this sorting algorithm of C4.5.It is for answer following questions to carry out feature selecting and the main purpose of sampling method combination experiment：(1) when classifying higher-dimension and unbalanced data, whether data preprocessing method is always effective(2) when pre-processing higher-dimension unbalanced data, how is the sequencing of feature selecting and sampling method, and whether different sequencings have an impact for classification results(3) whether preprocess method is influenceed by attribute number and uneven degree

Because filtering type Attributions selection is related to the attribute number of selection, the choosing value of the K parameter according to proposed by random forests algorithm herein, if primitive attribute number is M, the attribute number after feature selecting is M.Wrapper methods then according to search end condition, select corresponding subset.In terms of grader evaluation, then the AUC areas commonly used in unbalanced data classification are selected to be used as evaluation criterion.Due to using random sampling methods, the data set obtained after every sub-sampling will be different, and thus produced grader also will be otherwise varied, to obtain the experiment conclusion of more accuracy rate.Using the repeated sampling method of five times, five different sampled data collection are obtained, then to accuracy rate produced on this data set averagely as final experimental result.

Being derived from experimental result concentrates gauge outfit to contain three partial informations：Data set name, group proportion and the AUC classified with C4.5 when not pre-processing.What the row of Far Left one were represented is used pretreatment strategy, wherein F1 generation Table I G methods, and F2 represents Relief methods, and W1 represents best-first search, and W2 represents genetic search.

According to table 4 from the following aspects analyze and research result：

1) first sampling or first dimensionality reductionJudge the performance of scene S1 and scene S2 on different pieces of information collection.It has recorded average AUC value of each data set respectively not pre-processing, in the case of tri- kinds of scene S1 and scene S2.Scene S1 and S2 performance are superior to the classification performance of initial data, that is to say, that it is that can lift the performance of sorting algorithm that pretreatment is carried out to higher-dimension unbalanced data.Only occur some unusual phenomenons in Spam data.Its reason is analyzed it can be found that Spam classification ratio is about 4:6, it is not complete balanced, but unbalanced phenomena is not obvious, therefore the method for pretreatment may not can have positive effect, and the decline of classification performance is likely to result on the contrary.Another unusual phenomenon is appeared in Opt data, and situation about declining occurs on the contrary in pretreated average AUC value.Opt data and Sta data are all to use independent test data set, and the test result that other data sets are obtained is using ten layers of cross validation.Because training data is pre-processed, therefore test data also will also be pre-processed accordingly, but be due to the difference produced by pretreatment, the distribution of test data may be caused to be different from training data, therefore cause the abnormal phenomena of experimental result.

On the one hand sampling and the sequencing of dimensionality reduction, can be the execution efficiency of algorithm from the aspect of two.On the other hand it is the influence to sorting algorithm.From the aspect of execution efficiency, if using the method for over sampling, the very big run time for adding program of dimensionality reduction after sampling, especially when performing Wrapper algorithms, Filter classes algorithm has obvious jump compared to it.In terms of the influence to sorting algorithm, experimental result is not as described in Khoshgoftaar, and first sampling is better than first feature selecting.Find there are 6 data sets from S1 and S2 two scene comparisons, the average AUC value of S1 scenes is better than the average AUC value of S2 scenes.Although both effects are more or less the same, from the point of view of overall trend, first feature selecting effect is better than first sampling.Both why difference it is smaller be because over sampling classification accuracy it is too high so that the gap of Average Accuracy is little.

2) whether different sampling methods and dimension reduction method have an impact to subsequent classification algorithmFrom the experimental results, using over sampling method, the AUC of classification is generally lifted, and maximum AUC has reached 0.99, the higher data set of especially uneven degree.It is obvious that over sampling method presents very optimistic estimated performance.But this may represent overfitting, it is observed that when ratio data serious unbalance, over sampling method causes sorting algorithm to present the performance of almost Perfect.Main cause is to put back to sampling to group, generates substantial amounts of repetition example, and the high-accuracy on these repetition examples causes overall performance evaluation to be improved.But for the inclination ratio not enough big data set of uneven ratio, such as Spam data sets, the performance that over sampling method be able to may not show causes the hydraulic performance decline of grader on the contrary.Therefore, when the uneven large percentage of data, it is not recommended that use over sampling method.The classifying quality of filtering type feature selecting is generally better than using the classifying quality of packaged type feature selecting.And the performance difference between same type method is then little.Noteworthy point is that, it is little although with the performance difference of best-first search mode and genetic search mode, but, in most cases, the character subset searched for using genetic search mode, do not show the result better than best-first search mode, occur on the contrary than best-first search slightly poor effect.

3) what the dimension of data and the degree of unbalancedness of classification have influence for preprocess methodIn scene S2, the more processing time of the data of higher-dimension, apparently higher than low-dimensional data, carries out feature selecting again especially after over sampling.During using filtering type feature selecting, the feature selecting result obtained is that the sequence of feature is tested according to the fixed evolution for taking primitive character number as final characteristic.And packaged type feature selecting is then directly to determine final characteristic according to the character subset of selection.Therefore, the characteristic of packaged type feature selecting is significantly more than filtering type, from experimental data as can be seen that the effect of packaged type feature selecting is superior to filtering type feature selecting whether in scene S1 or S2.When the uneven degree of data is higher, using over sampling method, it is easy to cause the situation of overfitting；When the slight imbalance of data exhibiting, if the influence of pretreatment is simultaneously little, pretreatment may can also cause the decline of classification performance.The experimental result on this 12 data sets is summarized, following experiment conclusion can be obtained：

1) effect of first feature selecting is slightly better than first sampling；

2) when data nonbalance ratio is larger, it is proposed that using deficient sampling method；

3) when the unbalance ratio of data is little, it is not recommended that pre-processed；

4) in packaged type feature selecting, complicated method may not necessarily obtain more excellent result, and the result of such as GA search may not be better than best-first search.

Above-mentioned conclusion is not absolute, under only current experiment is set, the objective experimental result showed.Because the difference of data set and the different of algorithm parameter are set, it is also possible to cause different experiment conclusions.But paper experimental program have passed through many considerations, under similar facilities, this four conclusions still have certain value for concrete practice.

General principle, principal character and the advantages of the present invention of the present invention has been shown and described above.It should be understood by those skilled in the art that; the present invention is not limited to the above embodiments; merely illustrating the principles of the invention described in above-described embodiment and specification; various changes and modifications of the present invention are possible without departing from the spirit and scope of the present invention, and these changes and improvements all fall within the protetion scope of the claimed invention.The claimed scope of the invention is defined by appended claims and its equivalent.

Claims

1. towards higher-dimension and unbalanced data classify it is integrated, it is characterised in that using dimensionality reduction and sampling Sequencing, two classes are reduced to by pretreatment strategy；Reproducibility principle based on experiment conclusion, chooses Some standard data sets of data mining and machine learning are used as experimental data；In the selection of preprocess method On, add packaged type (Wrapper) feature selection approach and over sampling method；Dependence number and injustice Influence of the aspect research preprocess method of weighing apparatus degree two to higher-dimension unbalanced data classification performance；

Dimension reduction method is divided into two classes：Feature selecting and eigentransformation, whether feature selection approach is according to independent It is divided into two kinds of filtering type (Filter) and packaged type (Wrapper) in follow-up learning algorithm, filtering type is with after Continuous learning algorithm is unrelated, typically directly assesses feature using the statistic property of all training datas, and speed is fast, But assessment and the aberrations in property of follow-up learning algorithm are larger；Packaged type is accurate using the training of follow-up learning algorithm True rate assesses character subset, and deviation is small, computationally intensive, is not suitable for big data quantity, and eigentransformation is different It is that its output result is not original attribute in place of feature selecting, but based on certain conversion principle institute The new attribute produced, because the attribute after conversion changes the physical characteristic of original attribute, while some are special Transform method is levied generally directed to connection attribute data, then this does not consider eigentransformation method, sampling method bag Include two kinds：Sampling and over sampling are owed, pretreatment uses dimension reduction method and sampling method；

The assessment of dimension reduction method depends directly on data set in itself, it is generally recognized that the larger feature of correlation or Character subset can obtain higher classification accuracy, and common Filter feature selecting appraisal procedures have class Between distance, information gain, the degree of association and inconsistent degree etc., although Kohavi once pointed out only to consider data set Appraisal procedure operational efficiency it is high, but find the feature related to classification or character subset and selection can be optimal The problem of feature or character subset for changing classification accuracy are two differences；

Sampling method is the conventional preconditioning technique of a class, and data can be alleviated with equilibrium criterion using sampling In imbalance problem, sampling method according to its sample direction can be divided into two classes：Over sampling (Over Sampling) and sampling (Under Sampling) is owed, over sampling increase group example, deficient sampling then subtracts Few major class example, is divided into random and algorithm two classes, grab sample is in a random basis according to Sampling Strategy Delete or increase example, and algorithm sampling is then sampled according to certain principle, is such as deleted close to major class border Example or any generation of increase group example etc., typically, grab sample is more conventional Sampling means, and it is to simplify problem that algorithm sampling there may be certain guidance quality for the change of example set.

2. it is according to claim 1 towards higher-dimension and unbalanced data classify it is integrated, its feature exists In, dimension reduction method concern feature selecting and the combination experiment effect of sampling, therefore based on simplification principle, choosing Select selection information gain feature selecting in simple, general and efficient algorithm, Filter feature selecting algorithms Algorithm and Relief algorithms, select to be that subsequent classification algorithm is drafted using decision tree calculation the reason for the former Method, and the method for information gain inherently decision tree Attributions selection；The latter is then because Relief is calculated Method is the current preferable Filter feature selecting algorithms of effect relatively generally acknowledged, Wrapper algorithms draft choosing Different search strategy construction algorithms of different are selected, are most preferably preferentially searched because Kohavi experimental study is shown Rope selects best-first search mode, in addition, searching at random herein better than greed search (climbing method) mode Rope can provide more accurate search result, and the heredity that basic genetic algorithmic is considered as at the same time is searched Rope mode；

Information gain is that information gain is a kind of measure commonly used in machine learning and information theory, When carrying out class prediction, it is known that the value of feature, IG can measure the information bit required for relevant class prediction Number, information gain can be defined as the uncertain difference with desired posteriority between uncertain of priori, Given attribute X is calculated on generic attribute Y IG, it is necessary to known two information：Class label Y takes in itself The uncertainty when uncertainty and consideration attribute X of value, the two uncertainties can be expressed as Y entropy H (Y) and conditional entropy H (Y | X)；

<mrow> <mi>H</mi> <mrow> <mo>(</mo> <mi>Y</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>k</mi> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <mi>Y</mi> <mo>=</mo> <msub> <mi>Y</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <msub> <mi>log</mi> <mn>2</mn> </msub> <mrow> <mo>(</mo> <mi>P</mi> <mo>(</mo> <mrow> <mi>Y</mi> <mo>=</mo> <msub> <mi>Y</mi> <mi>i</mi> </msub> </mrow> <mo>)</mo> <mo>)</mo> </mrow> </mrow>

<mrow> <mi>H</mi> <mrow> <mo>(</mo> <mi>Y</mi> <mo>|</mo> <mi>X</mi> <mo>)</mo> </mrow> <mo>=</mo> <mo>-</mo> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>r</mi> </munderover> <mi>P</mi> <mrow> <mo>(</mo> <mi>X</mi> <mo>=</mo> <msub> <mi>X</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> <mi>H</mi> <mrow> <mo>(</mo> <mi>Y</mi> <mo>|</mo> <mi>X</mi> <mo>=</mo> <msub> <mi>X</mi> <mi>j</mi> </msub> <mo>)</mo> </mrow> </mrow>

IG (X)=H (Y)-H (Y | X)

When H (Y) represents not considering feature X, the purity of Y attributes, and H (Y | X) represent to consider feature X Afterwards, the purity of Y attributes, if considering after X attributes so that if the division of Y attributes is purer, then it is assumed that This characteristic attribute can effective district it is sub-category, entropy is smaller, and purity is higher, namely should select maximum The attribute of information gain.

3. it is according to claim 1 towards higher-dimension and unbalanced data classify it is integrated, its feature exists In Relief algorithms are according to feature to the resolving ability of closely example come evaluating characteristic, it is believed that good spy Levying approaches similar example, and makes remote between inhomogeneous example, and circle and triangle are represented respectively Two class examples, algorithm randomly chooses an example R from training set D, then from the example similar with its Middle searching arest neighbors example H, referred to as Nearest Hit, arest neighbors is found from its inhomogeneous example Example M, referred to as Nearest Miss, then for every dimensional feature, if R and H is in distance thereon Less than the distance on R and M, then it is beneficial to distinguishing similar and inhomogeneous arest neighbors to illustrate this dimensional feature , increase the weights of this feature；Conversely, then illustrating that this dimensional feature is similar and inhomogeneous nearest to distinguishing Neighbour is reactive, then reduces this feature weights, and the more new formula of weights is as follows：

Weight [A]=Weight [A]-diff (A, T, H)/m+diff (A, R, M)/m

Wherein A=1....N, N represent attribute number, and m refers to iterations, diff (A, R, H) table Show distances of the example R and H on attribute A；

Repeat said process m time, finally obtain be each feature average weight, the weights of feature are bigger, Represent that the classification capacity of this feature is stronger, conversely, representing that the classification capacity of this feature is weaker.

4. it is according to claim 1 towards higher-dimension and unbalanced data classify it is integrated, its feature exists In Wrapper methods are proposed by Kohavi, by learning algorithm as a black box, utilize The result selection character subset of habit algorithm, the valuation functions of learning algorithm inherently feature subset selection, Different search strategies will produce different character subsets, and a search contains state space, initial shape Each state representation of search space is one by state, end condition and search strategy four elementses, Kohavi Individual character subset, for n feature, each state has n, and whether every represent a feature Occur, 1 represents occur, 0 represents to occur without, operation determines the partial ordering relation between state, Kohavi choosings The operation selected is increase or deletes attribute, if n feature, then search space is (2) nO, is adopted It is unpractical with the whole space of exhaustive pattern search, thus needs different search strategies.

5. it is according to claim 1 towards higher-dimension and unbalanced data classify it is integrated, its feature exists In accuracy and error rate are conventional classifier performance measurements, but the two measurements are quick to class imbalance Sense, be excessively partial to it is more several classes of, when positive and negative class ratio be 5%：In the case of 95%, even if by all realities Example is all divided into negative class, and the accuracy of grader reaches 95%, and now all positive examples will all be divided by mistake, just True rate (Acc) and error rate (Err) are expressed as follows：

According to confusion matrix, accuracy and recall rate (real rate) and other measurements can be also calculated, F-measure is combined with accuracy and recall rate, and higher F-measure means grader in positive class It is upper that there is better performance；

<mrow> <mi>F</mi> <mo>-</mo> <mi>m</mi> <mi>e</mi> <mi>a</mi> <mi>s</mi> <mi>u</mi> <mi>r</mi> <mi>e</mi> <mo>=</mo> <mfrac> <mrow> <mo>(</mo> <mn>1</mn> <mo>+</mo> <msup> <mi>&beta;</mi> <mn>2</mn> </msup> <mo>)</mo> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> <mo>&times;</mo> <mi>p</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> <mrow> <msup> <mi>&beta;</mi> <mn>2</mn> </msup> <mo>&times;</mo> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>a</mi> <mi>l</mi> <mi>l</mi> <mo>+</mo> <mi>p</mi> <mi>r</mi> <mi>e</mi> <mi>c</mi> <mi>i</mi> <mi>s</mi> <mi>i</mi> <mi>o</mi> <mi>n</mi> </mrow> </mfrac> </mrow>

The G-mean of the propositions such as Kubat is the geometric average of positive and negative class prediction accuracy, and G-mean is to keep away Exempt from the important measure that overfitting bears class；

<mrow> <mi>G</mi> <mo>-</mo> <mi>m</mi> <mi>e</mi> <mi>a</mi> <mi>n</mi> <mo>=</mo> <msqrt> <mrow> <mi>T</mi> <mi> </mi> <mi>Pr</mi> <mi> </mi> <mi>a</mi> <mi>t</mi> <mi>e</mi> <mo>&times;</mo> <mi>T</mi> <mi>N</mi> <mi>r</mi> <mi>a</mi> <mi>t</mi> <mi>e</mi> </mrow> </msqrt> </mrow>

ROC curve is each point correspondence one on point (FPrate, TPrate) track, ROC curve Sorter model, point (0,0) represents each example to be predicted into the model of negative class；Point (1,1) represents handle Each example predicts into the model of positive class；Point (1,0) is ideal model, and all positive examples are categorized as into positive class, All negative examples are categorized as negative class, when drawing ROC curve, and y-axis represents real rate, and x-axis represents false Positive rate, a grader is used on a test set to produce a confusion matrix, and then obtain corresponding Real rate (TPate) and false positive rate (FPrate), thus corresponding to a point in ROC spaces, no Same grader then corresponds to one group of point in ROC spaces, and these points are connected and can be obtained by one ROC curve, its corresponding lower section area of pictural surface is exactly AUC (area under ROC curve), it is assumed that point Class device exports score f (x) on example x, then corresponding AUC can be calculated using following formula and obtained：

<mrow> <mi>A</mi> <mi>U</mi> <mi>C</mi> <mo>=</mo> <mfrac> <mrow> <munderover> <mo>&Sigma;</mo> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mo>+</mo> </msub> </munderover> <munderover> <mo>&Sigma;</mo> <mrow> <mi>j</mi> <mo>=</mo> <mn>1</mn> </mrow> <msub> <mi>N</mi> <mo>-</mo> </msub> </munderover> <mi>I</mi> <mrow> <mo>(</mo> <mi>f</mi> <mo>(</mo> <msubsup> <mi>x</mi> <mi>i</mi> <mo>+</mo> </msubsup> <mo>)</mo> <mo>></mo> <mi>f</mi> <mo>(</mo> <msubsup> <mi>x</mi> <mi>j</mi> <mo>-</mo> </msubsup> <mo>)</mo> <mo>)</mo> </mrow> </mrow> <mrow> <msub> <mi>N</mi> <mo>+</mo> </msub> <msub> <mi>N</mi> <mo>-</mo> </msub> </mrow> </mfrac> </mrow>