CN107608938B

CN107608938B - Factor screening method for binary classification based on enhanced regression tree algorithm

Info

Publication number: CN107608938B
Application number: CN201710670847.4A
Authority: CN
Inventors: 支俊俊
Original assignee: Anhui Normal University
Current assignee: Anhui Normal University
Priority date: 2017-08-08
Filing date: 2017-08-08
Publication date: 2020-12-08
Anticipated expiration: 2037-08-08
Also published as: CN107608938A

Abstract

The invention discloses a factor screening method facing binary classification based on an enhanced regression tree algorithm, which comprises the following steps of (1) collecting data, and establishing a target variable-prediction factor data set; (2) modeling by utilizing an enhanced regression tree algorithm based on the target variables and all factors, calculating the importance of the factors and sequencing; (3) performing correlation analysis on all factors, analyzing a Pearson correlation matrix and screening; (4) establishing a new model by utilizing an enhanced regression tree algorithm based on the target variable and the reserved factors, calculating the prediction deviation, calculating the importance of the factors, sequencing, and eliminating the factor with the minimum importance until the number of the reserved factors is less than or equal to 2; (5) and (4) comparing the prediction deviation of each enhanced regression tree model in the step (4), and taking all factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal factor combination. The invention establishes a quantitative factor selection system, has reliable result and wide application field.

Description

Factor screening method for binary classification based on enhanced regression tree algorithm

Technical Field

The invention relates to the technical field of factor screening, in particular to a factor screening method facing binary classification based on an enhanced regression tree algorithm, which is suitable for the fields of agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning and forecasting, weather (such as weather forecasting) and the like.

Background

Factor screening is the first problem to be solved when binary classification target variables are researched in various fields such as agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning forecast, weather (such as weather forecast) and the like. In the past, a correlation coefficient method and a stepwise regression analysis method are mostly adopted. The correlation coefficient method is to perform correlation analysis on all factors to eliminate the factors with higher correlation, but the selection of the eliminated factors in the factor combination with higher correlation is completely subjective. One limitation of stepwise regression analysis is that identification is performed assuming a single optimal factor subset in advance, but often there is no unique optimal subset; another limitation is that unreasonable subsets may be obtained when there is a high correlation between the factors. In recent years, scholars at home and abroad try a plurality of new factor screening methods, which mainly comprise principal component analysis, cluster analysis, factor analysis, discriminant analysis, fuzzy mathematics-based methods and the like. However, these methods have certain limitations, such as: principal component analysis needs to ensure that the accumulated contribution rate of the extracted first few principal components reaches a higher level, the naming definition of the extracted principal components is low, and in addition, when the signs of factor loads of the principal components are positive or negative, the comprehensive evaluation function is ambiguous; the clustering analysis has high requirements on multivariate normality, homogeneity of variance and the like of variables, and when the sample size is large, a clustering conclusion is difficult to obtain; the factor analysis has specific requirements on data quantity and components, and has certain limitation, in addition, when the method is used for calculating the factor score, a least square method is adopted, and the method may fail in some cases; judging and analyzing the condition that multiple collinearity exists among unsuitable processing factors; the fuzzy mathematics based method has certain subjectivity to the determination of the index weight vector. The common disadvantage of the existing methods is that the factor screening method suitable for quantification of various data types cannot be provided on the premise of ensuring that the information quantity of the original factor is not lost.

Disclosure of Invention

The invention aims to provide a factor screening method for binary classification based on an enhanced regression tree algorithm, which is suitable for various data types, can ensure that the information content of original factors is not lost, can effectively solve the problem of multiple collinearity among the factors and is quantitative.

The technical scheme of the invention is as follows:

a factor screening method for binary classification based on an enhanced regression tree algorithm is characterized by comprising the following steps: the method specifically comprises the following steps:

(1) collecting target variables and prediction factors for binary classification, and establishing a target variable-prediction factor data set;

(2) establishing an enhanced regression tree model by using an enhanced regression tree algorithm based on the target variables and all the prediction factors, calculating the importance of each prediction factor and sequencing;

(3) performing correlation analysis on all the prediction factors, analyzing a Pearson correlation matrix and screening, reserving the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step (2) for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, and rejecting all other factors in the factor combination;

(4) establishing a new enhanced regression tree model by using an enhanced regression tree algorithm based on the target variable and the retained factors, calculating the prediction deviation, calculating the importance of the factors, sorting, eliminating the factors with the minimum importance, and if the number of the retained factors after eliminating the factors with the minimum importance is more than 2, repeatedly executing the step based on the target variable and the retained factors until the number of the retained factors is less than or equal to 2;

(5) and (3) comparing the prediction deviations of the enhanced regression tree models in the step (4) (in the step (4), a new enhanced regression tree model is established every time one factor is removed, so that a plurality of enhanced regression tree models exist), and taking all prediction factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination.

As a further improvement of the technical scheme of the invention:

and (3) repeatedly operating the enhanced regression tree model established in the step (2) for 100 times, wherein the importance of each prediction factor is the average value of the results calculated by the model for 100 times.

And (4) calculating the prediction deviation of the enhanced regression tree model established in the step (4) by adopting a ten-fold cross-validation method, repeatedly operating for 100 times, and averaging the calculation results of the model for 100 times to obtain the prediction deviation of the model.

The invention has the beneficial effects that:

1. on the basis of correlation analysis, factors with high correlation are removed according to the importance of each factor calculated by the enhanced regression tree algorithm, factors with the minimum contribution to the model are removed step by using the enhanced regression tree algorithm, the problem of subjectivity in the existing factor screening method can be effectively solved, the problem of multiple collinearity among the factors can be effectively solved on the premise of ensuring no information loss, key factors influencing target variables can be effectively determined, various data types (including continuous types and discrete types) can be utilized, and normal distribution of the data is not required.

2. The method further improves the stability of the algorithm by repeatedly operating the enhanced regression tree model to obtain the average value, has the advantages of high precision, quantification, strong operability, wide application range and the like, and can be used for the factor screening process facing the binary classification in various fields of agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning and forecast, meteorology (such as weather forecast) and the like.

Drawings

Fig. 1 is a schematic flow chart of a basic implementation of the embodiment of the invention.

Fig. 2 is a schematic diagram of a pearson correlation matrix in an embodiment of the invention.

FIG. 3 is a diagram illustrating a comparison of prediction biases of the enhanced regression tree models according to an embodiment of the present invention.

Detailed Description

The enhanced regression tree algorithm (booted regression trees) in the embodiment of the invention adopts a common gbm software package (https:// www.r-project. org /), and is explained in detail by taking the data (point data, which is used as a target variable) of a grass mat layer (which is one diagnostic layer in Chinese soil system classification) in the Qilian mountain region and the data (planar grid data, which is used as a prediction factor) of an environmental factor as examples based on an R software platform.

Referring to fig. 1, the embodiment of the invention provides a factor screening method for binary classification based on an enhanced regression tree algorithm, which specifically includes the following steps:

1. and collecting target variables and prediction factors for binary classification of the turf layers, and establishing a target variable-prediction factor data set.

The turf layer data (target variable) of the embodiment is derived from the national natural science foundation key project 'black river basin key soil attribute digital mapping research' (41130530). The turf mat layer data samples amounted to 128, 54 of which were turf mats (as 1 in the two-value classification) and 74 were non-turf mats (as 0 in the two-value classification). The prediction factor data is derived from a 'black river basin ecology-hydrological process comprehensive remote sensing observation combined test' (http:// westdc. westgis. ac. cn), and comprises remote sensing data (30 m resolution, Landsat 5 TM), terrain data (30 m resolution, ASTER GDEM) and climate data (1 km resolution spatial distribution diagram, including temperature and precipitation). Performing geometric correction (by using a Georefferenging tool) on remote sensing, terrain and climate data (resampled to 30 m resolution) layers by using an ArcMap 9.3 software platform, and extracting remote sensing prediction factors and climate prediction factors; a geo-prediction factor is extracted using a System for Automated Geoscientific Analysis. The total number of extracted predictors is 26, and the predictors are numbered as V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25 and V26. By utilizing an ArcMap space analysis function (using Extract Values to Points tools), 26 prediction factor Values corresponding to 128 sampling Points are extracted, and the sampling point Values (0, 1 Values) of the turf layer and the corresponding 26 prediction factor Values are integrated into one csv file.

2. And modeling by utilizing an enhanced regression tree algorithm, and calculating and sequencing the importance of each prediction factor.

Modeling with the enhanced regression tree algorithm based on the sample values (target variables) of the turf layers and all predictors (csv files in step 1). The parameter setting of the established enhanced regression tree model comprises the following steps: data distribution type (distribution), set to "bernoulli" in the present embodiment (for binary classification); tree complexity (tree complexity), which is generally ≧ 2, set to 3 in this embodiment; the sampling rate (bagging fraction) is generally 0.50-0.75, and is set to 0.50 in this embodiment; the learning rate (learning rate) is adjusted so that the number of trees (trees) is equal to or greater than 1000, and in this embodiment, the learning rate is set to 0.001. The importance of each predictor is calculated. And repeatedly operating the built enhanced regression tree model for 100 times, and sequencing according to the average value of the calculated values of the importance of each factor for 100 times.

3. And performing correlation analysis, and removing factors containing repeated information according to Pearson correlation coefficients and factor importance.

All predictors were analyzed for pearson correlation using SPSS software, see fig. 2 for pearson correlation matrix. And for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, reserving the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step 2, and rejecting all other factors in the factor combination. In the embodiment, the pearson correlation coefficient absolute values between every two of the four factors of V4 (factor importance is 17.9%), V13 (factor importance is 15.2%), V14 (factor importance is 0.1%) and V22 (factor importance is 0.1%) are all equal to or larger than 0.80, only the V4 with the maximum importance is reserved according to the factor importance calculated in the step 2, and the V13, the V14 and the V22 are removed. Through the technical step, 11 factors are removed, and 15 factors with the Pearson correlation coefficient absolute value being less than 0.80 between every two factors are reserved.

4. And gradually eliminating the factor with the minimum importance based on the enhanced regression tree algorithm, and calculating the prediction deviation.

Based on the sample value (target variable) of the turf and the reserved 15 factors, a new model is established by using an enhanced regression tree algorithm (the parameter setting of the model is the same as that in the step 2), the importance and the prediction deviation of the factors are calculated, the established enhanced regression tree model is operated for 100 times, and the ranking is carried out according to the average value of the calculated values of the importance of each factor for 100 times. Removing the factor with the minimum importance, and repeatedly executing the step based on the sample value (target variable) of the turf and the factor retained after removing the factor with the minimum importance until the number of the retained factors is 2.

5. And comparing the prediction deviations to determine the optimal combination of the prediction factors.

And (4) comparing the prediction deviation average values of the 100 calculation results of each enhanced regression tree model in the step (4) (see fig. 3), and taking all the prediction factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination. In this example, it is clear that when the number of factors is 6, the prediction deviation of the enhanced regression tree model is the smallest, and therefore, the 6 factors are used as the optimal combination of prediction factors for performing the grass mat prediction (binary classification).

In the present embodiment, only the screening of the predictor in the prediction of the turf layer (binary classification target variable) is taken as an example for description, but the present embodiment is also applicable to the screening of the predictor in the binary classification of the target variable in other fields (such as agriculture, environment, ecology, hydrology, medical geography, disaster early warning and weather forecast, etc.). In the embodiment, repetitive and redundant information is removed by combining Pearson correlation analysis and a stepwise screening program based on an enhanced regression tree algorithm, so that the problems that the conventional factor screening method generally has original prediction factor information loss, has specific requirements on the type of original data (such as continuous and normal distribution requirements), needs artificial subjective judgment, is difficult to solve multiple collinearity existing among prediction factors and the like can be effectively solved. In addition, the importance of each factor and the prediction deviation of the model are obtained by repeatedly operating the enhanced regression tree algorithm to obtain the average value, so that the factor screening result is stable and reliable.

The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and all technical solutions belonging to the principle of the present invention belong to the protection scope of the present invention. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims

1. A factor screening method for binary classification based on an enhanced regression tree algorithm is characterized by comprising the following steps: the method specifically comprises the following steps:

(1) collecting target variables and prediction factors for binary classification of the straw mat layer, and establishing a target variable-prediction factor data set of the straw mat layer; the target variable of the binary classification of the straw felt layer comprises the straw felt layer and a non-straw felt layer, wherein the straw felt layer is used as a 1 value in the binary classification, and the non-straw felt layer is used as a 0 value in the binary classification; the prediction factors comprise a remote sensing prediction factor, a climate prediction factor and a terrain prediction factor;

(2) establishing an enhanced regression tree model by using an enhanced regression tree algorithm based on the target variables and all the prediction factors of the turf mat layer, calculating the importance of each prediction factor of the turf mat layer and sequencing;

(3) performing correlation analysis on all prediction factors of the turf carpet, analyzing a Pearson correlation matrix and screening, keeping the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step (2) for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, and rejecting all other factors in the factor combination;

(4) establishing a new enhanced regression tree model by using an enhanced regression tree algorithm based on the target variable and the reserved factors of the turf mat, calculating the prediction deviation, calculating the importance of the factors, sorting, removing the factors with the minimum importance, and if the number of the reserved factors after removing the factors with the minimum importance is more than 2, repeatedly executing the step based on the target variable and the reserved factors of the turf mat until the number of the reserved factors is less than or equal to 2;

(5) comparing the prediction deviation of each enhanced regression tree model in the step (4), and taking all prediction factors of the turf layer adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination;

the enhanced regression tree model established in the step (2) is repeatedly operated for 100 times, and the importance of the prediction factor of each turf layer is the average value of the calculation results of the model for 100 times;