CN107608938B - Factor screening method for binary classification based on enhanced regression tree algorithm - Google Patents

Factor screening method for binary classification based on enhanced regression tree algorithm Download PDF

Info

Publication number
CN107608938B
CN107608938B CN201710670847.4A CN201710670847A CN107608938B CN 107608938 B CN107608938 B CN 107608938B CN 201710670847 A CN201710670847 A CN 201710670847A CN 107608938 B CN107608938 B CN 107608938B
Authority
CN
China
Prior art keywords
factors
factor
prediction
regression tree
importance
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710670847.4A
Other languages
Chinese (zh)
Other versions
CN107608938A (en
Inventor
支俊俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Normal University
Original Assignee
Anhui Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Normal University filed Critical Anhui Normal University
Priority to CN201710670847.4A priority Critical patent/CN107608938B/en
Publication of CN107608938A publication Critical patent/CN107608938A/en
Application granted granted Critical
Publication of CN107608938B publication Critical patent/CN107608938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a factor screening method facing binary classification based on an enhanced regression tree algorithm, which comprises the following steps of (1) collecting data, and establishing a target variable-prediction factor data set; (2) modeling by utilizing an enhanced regression tree algorithm based on the target variables and all factors, calculating the importance of the factors and sequencing; (3) performing correlation analysis on all factors, analyzing a Pearson correlation matrix and screening; (4) establishing a new model by utilizing an enhanced regression tree algorithm based on the target variable and the reserved factors, calculating the prediction deviation, calculating the importance of the factors, sequencing, and eliminating the factor with the minimum importance until the number of the reserved factors is less than or equal to 2; (5) and (4) comparing the prediction deviation of each enhanced regression tree model in the step (4), and taking all factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal factor combination. The invention establishes a quantitative factor selection system, has reliable result and wide application field.

Description

Factor screening method for binary classification based on enhanced regression tree algorithm
Technical Field
The invention relates to the technical field of factor screening, in particular to a factor screening method facing binary classification based on an enhanced regression tree algorithm, which is suitable for the fields of agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning and forecasting, weather (such as weather forecasting) and the like.
Background
Factor screening is the first problem to be solved when binary classification target variables are researched in various fields such as agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning forecast, weather (such as weather forecast) and the like. In the past, a correlation coefficient method and a stepwise regression analysis method are mostly adopted. The correlation coefficient method is to perform correlation analysis on all factors to eliminate the factors with higher correlation, but the selection of the eliminated factors in the factor combination with higher correlation is completely subjective. One limitation of stepwise regression analysis is that identification is performed assuming a single optimal factor subset in advance, but often there is no unique optimal subset; another limitation is that unreasonable subsets may be obtained when there is a high correlation between the factors. In recent years, scholars at home and abroad try a plurality of new factor screening methods, which mainly comprise principal component analysis, cluster analysis, factor analysis, discriminant analysis, fuzzy mathematics-based methods and the like. However, these methods have certain limitations, such as: principal component analysis needs to ensure that the accumulated contribution rate of the extracted first few principal components reaches a higher level, the naming definition of the extracted principal components is low, and in addition, when the signs of factor loads of the principal components are positive or negative, the comprehensive evaluation function is ambiguous; the clustering analysis has high requirements on multivariate normality, homogeneity of variance and the like of variables, and when the sample size is large, a clustering conclusion is difficult to obtain; the factor analysis has specific requirements on data quantity and components, and has certain limitation, in addition, when the method is used for calculating the factor score, a least square method is adopted, and the method may fail in some cases; judging and analyzing the condition that multiple collinearity exists among unsuitable processing factors; the fuzzy mathematics based method has certain subjectivity to the determination of the index weight vector. The common disadvantage of the existing methods is that the factor screening method suitable for quantification of various data types cannot be provided on the premise of ensuring that the information quantity of the original factor is not lost.
Disclosure of Invention
The invention aims to provide a factor screening method for binary classification based on an enhanced regression tree algorithm, which is suitable for various data types, can ensure that the information content of original factors is not lost, can effectively solve the problem of multiple collinearity among the factors and is quantitative.
The technical scheme of the invention is as follows:
a factor screening method for binary classification based on an enhanced regression tree algorithm is characterized by comprising the following steps: the method specifically comprises the following steps:
(1) collecting target variables and prediction factors for binary classification, and establishing a target variable-prediction factor data set;
(2) establishing an enhanced regression tree model by using an enhanced regression tree algorithm based on the target variables and all the prediction factors, calculating the importance of each prediction factor and sequencing;
(3) performing correlation analysis on all the prediction factors, analyzing a Pearson correlation matrix and screening, reserving the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step (2) for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, and rejecting all other factors in the factor combination;
(4) establishing a new enhanced regression tree model by using an enhanced regression tree algorithm based on the target variable and the retained factors, calculating the prediction deviation, calculating the importance of the factors, sorting, eliminating the factors with the minimum importance, and if the number of the retained factors after eliminating the factors with the minimum importance is more than 2, repeatedly executing the step based on the target variable and the retained factors until the number of the retained factors is less than or equal to 2;
(5) and (3) comparing the prediction deviations of the enhanced regression tree models in the step (4) (in the step (4), a new enhanced regression tree model is established every time one factor is removed, so that a plurality of enhanced regression tree models exist), and taking all prediction factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination.
As a further improvement of the technical scheme of the invention:
and (3) repeatedly operating the enhanced regression tree model established in the step (2) for 100 times, wherein the importance of each prediction factor is the average value of the results calculated by the model for 100 times.
And (4) calculating the prediction deviation of the enhanced regression tree model established in the step (4) by adopting a ten-fold cross-validation method, repeatedly operating for 100 times, and averaging the calculation results of the model for 100 times to obtain the prediction deviation of the model.
The invention has the beneficial effects that:
1. on the basis of correlation analysis, factors with high correlation are removed according to the importance of each factor calculated by the enhanced regression tree algorithm, factors with the minimum contribution to the model are removed step by using the enhanced regression tree algorithm, the problem of subjectivity in the existing factor screening method can be effectively solved, the problem of multiple collinearity among the factors can be effectively solved on the premise of ensuring no information loss, key factors influencing target variables can be effectively determined, various data types (including continuous types and discrete types) can be utilized, and normal distribution of the data is not required.
2. The method further improves the stability of the algorithm by repeatedly operating the enhanced regression tree model to obtain the average value, has the advantages of high precision, quantification, strong operability, wide application range and the like, and can be used for the factor screening process facing the binary classification in various fields of agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning and forecast, meteorology (such as weather forecast) and the like.
Drawings
Fig. 1 is a schematic flow chart of a basic implementation of the embodiment of the invention.
Fig. 2 is a schematic diagram of a pearson correlation matrix in an embodiment of the invention.
FIG. 3 is a diagram illustrating a comparison of prediction biases of the enhanced regression tree models according to an embodiment of the present invention.
Detailed Description
The enhanced regression tree algorithm (booted regression trees) in the embodiment of the invention adopts a common gbm software package (https:// www.r-project. org /), and is explained in detail by taking the data (point data, which is used as a target variable) of a grass mat layer (which is one diagnostic layer in Chinese soil system classification) in the Qilian mountain region and the data (planar grid data, which is used as a prediction factor) of an environmental factor as examples based on an R software platform.
Referring to fig. 1, the embodiment of the invention provides a factor screening method for binary classification based on an enhanced regression tree algorithm, which specifically includes the following steps:
1. and collecting target variables and prediction factors for binary classification of the turf layers, and establishing a target variable-prediction factor data set.
The turf layer data (target variable) of the embodiment is derived from the national natural science foundation key project 'black river basin key soil attribute digital mapping research' (41130530). The turf mat layer data samples amounted to 128, 54 of which were turf mats (as 1 in the two-value classification) and 74 were non-turf mats (as 0 in the two-value classification). The prediction factor data is derived from a 'black river basin ecology-hydrological process comprehensive remote sensing observation combined test' (http:// westdc. westgis. ac. cn), and comprises remote sensing data (30 m resolution, Landsat 5 TM), terrain data (30 m resolution, ASTER GDEM) and climate data (1 km resolution spatial distribution diagram, including temperature and precipitation). Performing geometric correction (by using a Georefferenging tool) on remote sensing, terrain and climate data (resampled to 30 m resolution) layers by using an ArcMap 9.3 software platform, and extracting remote sensing prediction factors and climate prediction factors; a geo-prediction factor is extracted using a System for Automated Geoscientific Analysis. The total number of extracted predictors is 26, and the predictors are numbered as V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25 and V26. By utilizing an ArcMap space analysis function (using Extract Values to Points tools), 26 prediction factor Values corresponding to 128 sampling Points are extracted, and the sampling point Values (0, 1 Values) of the turf layer and the corresponding 26 prediction factor Values are integrated into one csv file.
2. And modeling by utilizing an enhanced regression tree algorithm, and calculating and sequencing the importance of each prediction factor.
Modeling with the enhanced regression tree algorithm based on the sample values (target variables) of the turf layers and all predictors (csv files in step 1). The parameter setting of the established enhanced regression tree model comprises the following steps: data distribution type (distribution), set to "bernoulli" in the present embodiment (for binary classification); tree complexity (tree complexity), which is generally ≧ 2, set to 3 in this embodiment; the sampling rate (bagging fraction) is generally 0.50-0.75, and is set to 0.50 in this embodiment; the learning rate (learning rate) is adjusted so that the number of trees (trees) is equal to or greater than 1000, and in this embodiment, the learning rate is set to 0.001. The importance of each predictor is calculated. And repeatedly operating the built enhanced regression tree model for 100 times, and sequencing according to the average value of the calculated values of the importance of each factor for 100 times.
3. And performing correlation analysis, and removing factors containing repeated information according to Pearson correlation coefficients and factor importance.
All predictors were analyzed for pearson correlation using SPSS software, see fig. 2 for pearson correlation matrix. And for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, reserving the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step 2, and rejecting all other factors in the factor combination. In the embodiment, the pearson correlation coefficient absolute values between every two of the four factors of V4 (factor importance is 17.9%), V13 (factor importance is 15.2%), V14 (factor importance is 0.1%) and V22 (factor importance is 0.1%) are all equal to or larger than 0.80, only the V4 with the maximum importance is reserved according to the factor importance calculated in the step 2, and the V13, the V14 and the V22 are removed. Through the technical step, 11 factors are removed, and 15 factors with the Pearson correlation coefficient absolute value being less than 0.80 between every two factors are reserved.
4. And gradually eliminating the factor with the minimum importance based on the enhanced regression tree algorithm, and calculating the prediction deviation.
Based on the sample value (target variable) of the turf and the reserved 15 factors, a new model is established by using an enhanced regression tree algorithm (the parameter setting of the model is the same as that in the step 2), the importance and the prediction deviation of the factors are calculated, the established enhanced regression tree model is operated for 100 times, and the ranking is carried out according to the average value of the calculated values of the importance of each factor for 100 times. Removing the factor with the minimum importance, and repeatedly executing the step based on the sample value (target variable) of the turf and the factor retained after removing the factor with the minimum importance until the number of the retained factors is 2.
5. And comparing the prediction deviations to determine the optimal combination of the prediction factors.
And (4) comparing the prediction deviation average values of the 100 calculation results of each enhanced regression tree model in the step (4) (see fig. 3), and taking all the prediction factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination. In this example, it is clear that when the number of factors is 6, the prediction deviation of the enhanced regression tree model is the smallest, and therefore, the 6 factors are used as the optimal combination of prediction factors for performing the grass mat prediction (binary classification).
In the present embodiment, only the screening of the predictor in the prediction of the turf layer (binary classification target variable) is taken as an example for description, but the present embodiment is also applicable to the screening of the predictor in the binary classification of the target variable in other fields (such as agriculture, environment, ecology, hydrology, medical geography, disaster early warning and weather forecast, etc.). In the embodiment, repetitive and redundant information is removed by combining Pearson correlation analysis and a stepwise screening program based on an enhanced regression tree algorithm, so that the problems that the conventional factor screening method generally has original prediction factor information loss, has specific requirements on the type of original data (such as continuous and normal distribution requirements), needs artificial subjective judgment, is difficult to solve multiple collinearity existing among prediction factors and the like can be effectively solved. In addition, the importance of each factor and the prediction deviation of the model are obtained by repeatedly operating the enhanced regression tree algorithm to obtain the average value, so that the factor screening result is stable and reliable.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and all technical solutions belonging to the principle of the present invention belong to the protection scope of the present invention. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.

Claims (1)

1. A factor screening method for binary classification based on an enhanced regression tree algorithm is characterized by comprising the following steps: the method specifically comprises the following steps:
(1) collecting target variables and prediction factors for binary classification of the straw mat layer, and establishing a target variable-prediction factor data set of the straw mat layer; the target variable of the binary classification of the straw felt layer comprises the straw felt layer and a non-straw felt layer, wherein the straw felt layer is used as a 1 value in the binary classification, and the non-straw felt layer is used as a 0 value in the binary classification; the prediction factors comprise a remote sensing prediction factor, a climate prediction factor and a terrain prediction factor;
(2) establishing an enhanced regression tree model by using an enhanced regression tree algorithm based on the target variables and all the prediction factors of the turf mat layer, calculating the importance of each prediction factor of the turf mat layer and sequencing;
(3) performing correlation analysis on all prediction factors of the turf carpet, analyzing a Pearson correlation matrix and screening, keeping the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step (2) for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, and rejecting all other factors in the factor combination;
(4) establishing a new enhanced regression tree model by using an enhanced regression tree algorithm based on the target variable and the reserved factors of the turf mat, calculating the prediction deviation, calculating the importance of the factors, sorting, removing the factors with the minimum importance, and if the number of the reserved factors after removing the factors with the minimum importance is more than 2, repeatedly executing the step based on the target variable and the reserved factors of the turf mat until the number of the reserved factors is less than or equal to 2;
(5) comparing the prediction deviation of each enhanced regression tree model in the step (4), and taking all prediction factors of the turf layer adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination;
the enhanced regression tree model established in the step (2) is repeatedly operated for 100 times, and the importance of the prediction factor of each turf layer is the average value of the calculation results of the model for 100 times;
and (4) calculating the prediction deviation of the enhanced regression tree model established in the step (4) by adopting a ten-fold cross-validation method, repeatedly operating for 100 times, and averaging the calculation results of the model for 100 times to obtain the prediction deviation of the model.
CN201710670847.4A 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm Active CN107608938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710670847.4A CN107608938B (en) 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710670847.4A CN107608938B (en) 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm

Publications (2)

Publication Number Publication Date
CN107608938A CN107608938A (en) 2018-01-19
CN107608938B true CN107608938B (en) 2020-12-08

Family

ID=61064801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710670847.4A Active CN107608938B (en) 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm

Country Status (1)

Country Link
CN (1) CN107608938B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948948A (en) * 2019-03-29 2019-06-28 广东电网有限责任公司 A kind of bus load key index screening technique, system and relevant apparatus
CN110119568B (en) * 2019-05-09 2022-10-14 河海大学 Method for evaluating stone-throwing effect influence factors of riprap bank protection
CN112149702A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature processing method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980603A (en) * 2017-02-23 2017-07-25 中国科学院南京土壤研究所 Soil sulphur element content prediction method based on soil types merger and multiple regression

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980603A (en) * 2017-02-23 2017-07-25 中国科学院南京土壤研究所 Soil sulphur element content prediction method based on soil types merger and multiple regression

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees;Chung,Yi-Shih;<ACCIDENT ANALYSIS AND PREVENTION>;20120911;107-118页 *
利用增强回归树分析中国野火空间分布格局的影响因素;焦琳琳等;《生态学杂志》;20150815;2288-2296页 *
基于增强回归树的城市PM2.5日均值变化分析:以常州为例;葛跃等;《环境科学》;20160927;485-494页 *
森林旅游地餐饮经营者的碳补偿意愿及其影响因素;谢晓文;《中国优秀硕士毕业论文全文数据库》;20170601;正文第3页,第36-46页 *
气候变化对东北丹顶鹤繁殖生境影响评价研究;徐婷;《中国优秀硕士毕业论文全文数据库》;20130601;正文第27页 *

Also Published As

Publication number Publication date
CN107608938A (en) 2018-01-19

Similar Documents

Publication Publication Date Title
Fan et al. A comparison of spatial autocorrelation indices and landscape metrics in measuring urban landscape fragmentation
Zhang et al. Heuristic sample learning for complex urban scenes: Application to urban functional-zone mapping with VHR images and POI data
CN107608938B (en) Factor screening method for binary classification based on enhanced regression tree algorithm
Sielenou et al. Combining random forests and class-balancing to discriminate between three classes of avalanche activity in the French Alps
CN105677791A (en) Method and system used for analyzing operating data of wind generating set
WO2014054042A1 (en) Device and method for detecting plantation rows
CN102184423B (en) Full-automatic method for precisely extracting regional impervious surface remote sensing information
CN112287018A (en) Method and system for evaluating damage risk of 10kV tower under typhoon disaster
Jasiewicz et al. Multi-scale segmentation algorithm for pattern-based partitioning of large categorical rasters
Hu et al. Integrating CART algorithm and multi-source remote sensing data to estimate sub-pixel impervious surface coverage: a case study from Beijing Municipality, China
CN107871183A (en) Permafrost Area highway distress Forecasting Methodology based on uncertain Clouds theory
CN113033081A (en) Runoff simulation method and system based on SOM-BPNN model
Albuquerque et al. Large-scale prediction of tropical stream water quality using Rough Sets Theory
CN116129262A (en) Cultivated land suitability evaluation method and system for suitable mechanized transformation
Ghosh et al. Calcrop21: A georeferenced multi-spectral dataset of satellite imagery and crop labels
CN117171533B (en) Real-time acquisition and processing method and system for geographical mapping operation data
CN112907113B (en) Vegetation change cause identification method considering spatial correlation
CN113656868A (en) BIM technology-based hospital construction collaborative management platform
CN113726558A (en) Network equipment flow prediction system based on random forest algorithm
Griffin et al. Ranking Mahalanobis distance models for predictions of occupancy from presence‐only data
CN117371604A (en) Agricultural production prediction method and system based on intelligent perception
CN112381644A (en) Credit scene risk user assessment method based on space variable reasoning
CN111985782A (en) Automatic tramcar driving risk assessment method based on environment perception
Young et al. A terrain-based paired-site sampling design to assess biodiversity losses from eastern hemlock decline
CN102622345A (en) High-precision land-utilization remote sensing updating technology with synergistic multisource spatio-temporal data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant