CN107608938B - Factor screening method for binary classification based on enhanced regression tree algorithm - Google Patents
Factor screening method for binary classification based on enhanced regression tree algorithm Download PDFInfo
- Publication number
- CN107608938B CN107608938B CN201710670847.4A CN201710670847A CN107608938B CN 107608938 B CN107608938 B CN 107608938B CN 201710670847 A CN201710670847 A CN 201710670847A CN 107608938 B CN107608938 B CN 107608938B
- Authority
- CN
- China
- Prior art keywords
- factors
- factor
- prediction
- regression tree
- importance
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 29
- 238000004422 calculation algorithm Methods 0.000 title claims abstract description 24
- 238000012216 screening Methods 0.000 title claims abstract description 21
- 238000010219 correlation analysis Methods 0.000 claims abstract description 6
- 238000012163 sequencing technique Methods 0.000 claims abstract description 6
- 239000011159 matrix material Substances 0.000 claims abstract description 5
- 238000004364 calculation method Methods 0.000 claims description 4
- 238000012935 Averaging Methods 0.000 claims description 2
- 238000002790 cross-validation Methods 0.000 claims description 2
- 239000010902 straw Substances 0.000 claims 7
- 230000000717 retained effect Effects 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000005070 sampling Methods 0.000 description 3
- 244000025254 Cannabis sativa Species 0.000 description 2
- 238000000556 factor analysis Methods 0.000 description 2
- 230000006870 function Effects 0.000 description 2
- 238000000513 principal component analysis Methods 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 238000011002 quantification Methods 0.000 description 2
- 238000000611 regression analysis Methods 0.000 description 2
- 239000002689 soil Substances 0.000 description 2
- 241000132092 Aster Species 0.000 description 1
- 238000010220 Pearson correlation analysis Methods 0.000 description 1
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000007621 cluster analysis Methods 0.000 description 1
- 238000012937 correction Methods 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000013507 mapping Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000001556 precipitation Methods 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000012360 testing method Methods 0.000 description 1
Images
Landscapes
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention discloses a factor screening method facing binary classification based on an enhanced regression tree algorithm, which comprises the following steps of (1) collecting data, and establishing a target variable-prediction factor data set; (2) modeling by utilizing an enhanced regression tree algorithm based on the target variables and all factors, calculating the importance of the factors and sequencing; (3) performing correlation analysis on all factors, analyzing a Pearson correlation matrix and screening; (4) establishing a new model by utilizing an enhanced regression tree algorithm based on the target variable and the reserved factors, calculating the prediction deviation, calculating the importance of the factors, sequencing, and eliminating the factor with the minimum importance until the number of the reserved factors is less than or equal to 2; (5) and (4) comparing the prediction deviation of each enhanced regression tree model in the step (4), and taking all factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal factor combination. The invention establishes a quantitative factor selection system, has reliable result and wide application field.
Description
Technical Field
The invention relates to the technical field of factor screening, in particular to a factor screening method facing binary classification based on an enhanced regression tree algorithm, which is suitable for the fields of agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning and forecasting, weather (such as weather forecasting) and the like.
Background
Factor screening is the first problem to be solved when binary classification target variables are researched in various fields such as agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning forecast, weather (such as weather forecast) and the like. In the past, a correlation coefficient method and a stepwise regression analysis method are mostly adopted. The correlation coefficient method is to perform correlation analysis on all factors to eliminate the factors with higher correlation, but the selection of the eliminated factors in the factor combination with higher correlation is completely subjective. One limitation of stepwise regression analysis is that identification is performed assuming a single optimal factor subset in advance, but often there is no unique optimal subset; another limitation is that unreasonable subsets may be obtained when there is a high correlation between the factors. In recent years, scholars at home and abroad try a plurality of new factor screening methods, which mainly comprise principal component analysis, cluster analysis, factor analysis, discriminant analysis, fuzzy mathematics-based methods and the like. However, these methods have certain limitations, such as: principal component analysis needs to ensure that the accumulated contribution rate of the extracted first few principal components reaches a higher level, the naming definition of the extracted principal components is low, and in addition, when the signs of factor loads of the principal components are positive or negative, the comprehensive evaluation function is ambiguous; the clustering analysis has high requirements on multivariate normality, homogeneity of variance and the like of variables, and when the sample size is large, a clustering conclusion is difficult to obtain; the factor analysis has specific requirements on data quantity and components, and has certain limitation, in addition, when the method is used for calculating the factor score, a least square method is adopted, and the method may fail in some cases; judging and analyzing the condition that multiple collinearity exists among unsuitable processing factors; the fuzzy mathematics based method has certain subjectivity to the determination of the index weight vector. The common disadvantage of the existing methods is that the factor screening method suitable for quantification of various data types cannot be provided on the premise of ensuring that the information quantity of the original factor is not lost.
Disclosure of Invention
The invention aims to provide a factor screening method for binary classification based on an enhanced regression tree algorithm, which is suitable for various data types, can ensure that the information content of original factors is not lost, can effectively solve the problem of multiple collinearity among the factors and is quantitative.
The technical scheme of the invention is as follows:
a factor screening method for binary classification based on an enhanced regression tree algorithm is characterized by comprising the following steps: the method specifically comprises the following steps:
(1) collecting target variables and prediction factors for binary classification, and establishing a target variable-prediction factor data set;
(2) establishing an enhanced regression tree model by using an enhanced regression tree algorithm based on the target variables and all the prediction factors, calculating the importance of each prediction factor and sequencing;
(3) performing correlation analysis on all the prediction factors, analyzing a Pearson correlation matrix and screening, reserving the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step (2) for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, and rejecting all other factors in the factor combination;
(4) establishing a new enhanced regression tree model by using an enhanced regression tree algorithm based on the target variable and the retained factors, calculating the prediction deviation, calculating the importance of the factors, sorting, eliminating the factors with the minimum importance, and if the number of the retained factors after eliminating the factors with the minimum importance is more than 2, repeatedly executing the step based on the target variable and the retained factors until the number of the retained factors is less than or equal to 2;
(5) and (3) comparing the prediction deviations of the enhanced regression tree models in the step (4) (in the step (4), a new enhanced regression tree model is established every time one factor is removed, so that a plurality of enhanced regression tree models exist), and taking all prediction factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination.
As a further improvement of the technical scheme of the invention:
and (3) repeatedly operating the enhanced regression tree model established in the step (2) for 100 times, wherein the importance of each prediction factor is the average value of the results calculated by the model for 100 times.
And (4) calculating the prediction deviation of the enhanced regression tree model established in the step (4) by adopting a ten-fold cross-validation method, repeatedly operating for 100 times, and averaging the calculation results of the model for 100 times to obtain the prediction deviation of the model.
The invention has the beneficial effects that:
1. on the basis of correlation analysis, factors with high correlation are removed according to the importance of each factor calculated by the enhanced regression tree algorithm, factors with the minimum contribution to the model are removed step by using the enhanced regression tree algorithm, the problem of subjectivity in the existing factor screening method can be effectively solved, the problem of multiple collinearity among the factors can be effectively solved on the premise of ensuring no information loss, key factors influencing target variables can be effectively determined, various data types (including continuous types and discrete types) can be utilized, and normal distribution of the data is not required.
2. The method further improves the stability of the algorithm by repeatedly operating the enhanced regression tree model to obtain the average value, has the advantages of high precision, quantification, strong operability, wide application range and the like, and can be used for the factor screening process facing the binary classification in various fields of agriculture, environment, ecology, hydrology, medical geography (such as epidemiology), disaster early warning and forecast, meteorology (such as weather forecast) and the like.
Drawings
Fig. 1 is a schematic flow chart of a basic implementation of the embodiment of the invention.
Fig. 2 is a schematic diagram of a pearson correlation matrix in an embodiment of the invention.
FIG. 3 is a diagram illustrating a comparison of prediction biases of the enhanced regression tree models according to an embodiment of the present invention.
Detailed Description
The enhanced regression tree algorithm (booted regression trees) in the embodiment of the invention adopts a common gbm software package (https:// www.r-project. org /), and is explained in detail by taking the data (point data, which is used as a target variable) of a grass mat layer (which is one diagnostic layer in Chinese soil system classification) in the Qilian mountain region and the data (planar grid data, which is used as a prediction factor) of an environmental factor as examples based on an R software platform.
Referring to fig. 1, the embodiment of the invention provides a factor screening method for binary classification based on an enhanced regression tree algorithm, which specifically includes the following steps:
1. and collecting target variables and prediction factors for binary classification of the turf layers, and establishing a target variable-prediction factor data set.
The turf layer data (target variable) of the embodiment is derived from the national natural science foundation key project 'black river basin key soil attribute digital mapping research' (41130530). The turf mat layer data samples amounted to 128, 54 of which were turf mats (as 1 in the two-value classification) and 74 were non-turf mats (as 0 in the two-value classification). The prediction factor data is derived from a 'black river basin ecology-hydrological process comprehensive remote sensing observation combined test' (http:// westdc. westgis. ac. cn), and comprises remote sensing data (30 m resolution, Landsat 5 TM), terrain data (30 m resolution, ASTER GDEM) and climate data (1 km resolution spatial distribution diagram, including temperature and precipitation). Performing geometric correction (by using a Georefferenging tool) on remote sensing, terrain and climate data (resampled to 30 m resolution) layers by using an ArcMap 9.3 software platform, and extracting remote sensing prediction factors and climate prediction factors; a geo-prediction factor is extracted using a System for Automated Geoscientific Analysis. The total number of extracted predictors is 26, and the predictors are numbered as V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25 and V26. By utilizing an ArcMap space analysis function (using Extract Values to Points tools), 26 prediction factor Values corresponding to 128 sampling Points are extracted, and the sampling point Values (0, 1 Values) of the turf layer and the corresponding 26 prediction factor Values are integrated into one csv file.
2. And modeling by utilizing an enhanced regression tree algorithm, and calculating and sequencing the importance of each prediction factor.
Modeling with the enhanced regression tree algorithm based on the sample values (target variables) of the turf layers and all predictors (csv files in step 1). The parameter setting of the established enhanced regression tree model comprises the following steps: data distribution type (distribution), set to "bernoulli" in the present embodiment (for binary classification); tree complexity (tree complexity), which is generally ≧ 2, set to 3 in this embodiment; the sampling rate (bagging fraction) is generally 0.50-0.75, and is set to 0.50 in this embodiment; the learning rate (learning rate) is adjusted so that the number of trees (trees) is equal to or greater than 1000, and in this embodiment, the learning rate is set to 0.001. The importance of each predictor is calculated. And repeatedly operating the built enhanced regression tree model for 100 times, and sequencing according to the average value of the calculated values of the importance of each factor for 100 times.
3. And performing correlation analysis, and removing factors containing repeated information according to Pearson correlation coefficients and factor importance.
All predictors were analyzed for pearson correlation using SPSS software, see fig. 2 for pearson correlation matrix. And for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, reserving the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step 2, and rejecting all other factors in the factor combination. In the embodiment, the pearson correlation coefficient absolute values between every two of the four factors of V4 (factor importance is 17.9%), V13 (factor importance is 15.2%), V14 (factor importance is 0.1%) and V22 (factor importance is 0.1%) are all equal to or larger than 0.80, only the V4 with the maximum importance is reserved according to the factor importance calculated in the step 2, and the V13, the V14 and the V22 are removed. Through the technical step, 11 factors are removed, and 15 factors with the Pearson correlation coefficient absolute value being less than 0.80 between every two factors are reserved.
4. And gradually eliminating the factor with the minimum importance based on the enhanced regression tree algorithm, and calculating the prediction deviation.
Based on the sample value (target variable) of the turf and the reserved 15 factors, a new model is established by using an enhanced regression tree algorithm (the parameter setting of the model is the same as that in the step 2), the importance and the prediction deviation of the factors are calculated, the established enhanced regression tree model is operated for 100 times, and the ranking is carried out according to the average value of the calculated values of the importance of each factor for 100 times. Removing the factor with the minimum importance, and repeatedly executing the step based on the sample value (target variable) of the turf and the factor retained after removing the factor with the minimum importance until the number of the retained factors is 2.
5. And comparing the prediction deviations to determine the optimal combination of the prediction factors.
And (4) comparing the prediction deviation average values of the 100 calculation results of each enhanced regression tree model in the step (4) (see fig. 3), and taking all the prediction factors adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination. In this example, it is clear that when the number of factors is 6, the prediction deviation of the enhanced regression tree model is the smallest, and therefore, the 6 factors are used as the optimal combination of prediction factors for performing the grass mat prediction (binary classification).
In the present embodiment, only the screening of the predictor in the prediction of the turf layer (binary classification target variable) is taken as an example for description, but the present embodiment is also applicable to the screening of the predictor in the binary classification of the target variable in other fields (such as agriculture, environment, ecology, hydrology, medical geography, disaster early warning and weather forecast, etc.). In the embodiment, repetitive and redundant information is removed by combining Pearson correlation analysis and a stepwise screening program based on an enhanced regression tree algorithm, so that the problems that the conventional factor screening method generally has original prediction factor information loss, has specific requirements on the type of original data (such as continuous and normal distribution requirements), needs artificial subjective judgment, is difficult to solve multiple collinearity existing among prediction factors and the like can be effectively solved. In addition, the importance of each factor and the prediction deviation of the model are obtained by repeatedly operating the enhanced regression tree algorithm to obtain the average value, so that the factor screening result is stable and reliable.
The above description is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above embodiment, and all technical solutions belonging to the principle of the present invention belong to the protection scope of the present invention. It will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the invention and these are intended to be within the scope of the invention.
Claims (1)
1. A factor screening method for binary classification based on an enhanced regression tree algorithm is characterized by comprising the following steps: the method specifically comprises the following steps:
(1) collecting target variables and prediction factors for binary classification of the straw mat layer, and establishing a target variable-prediction factor data set of the straw mat layer; the target variable of the binary classification of the straw felt layer comprises the straw felt layer and a non-straw felt layer, wherein the straw felt layer is used as a 1 value in the binary classification, and the non-straw felt layer is used as a 0 value in the binary classification; the prediction factors comprise a remote sensing prediction factor, a climate prediction factor and a terrain prediction factor;
(2) establishing an enhanced regression tree model by using an enhanced regression tree algorithm based on the target variables and all the prediction factors of the turf mat layer, calculating the importance of each prediction factor of the turf mat layer and sequencing;
(3) performing correlation analysis on all prediction factors of the turf carpet, analyzing a Pearson correlation matrix and screening, keeping the factor with the maximum importance in the factor combination according to the importance of the factor calculated in the step (2) for the factor combination with the Pearson correlation coefficient absolute value being more than or equal to 0.80, and rejecting all other factors in the factor combination;
(4) establishing a new enhanced regression tree model by using an enhanced regression tree algorithm based on the target variable and the reserved factors of the turf mat, calculating the prediction deviation, calculating the importance of the factors, sorting, removing the factors with the minimum importance, and if the number of the reserved factors after removing the factors with the minimum importance is more than 2, repeatedly executing the step based on the target variable and the reserved factors of the turf mat until the number of the reserved factors is less than or equal to 2;
(5) comparing the prediction deviation of each enhanced regression tree model in the step (4), and taking all prediction factors of the turf layer adopted by the enhanced regression tree model with the minimum prediction deviation as the optimal prediction factor combination;
the enhanced regression tree model established in the step (2) is repeatedly operated for 100 times, and the importance of the prediction factor of each turf layer is the average value of the calculation results of the model for 100 times;
and (4) calculating the prediction deviation of the enhanced regression tree model established in the step (4) by adopting a ten-fold cross-validation method, repeatedly operating for 100 times, and averaging the calculation results of the model for 100 times to obtain the prediction deviation of the model.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710670847.4A CN107608938B (en) | 2017-08-08 | 2017-08-08 | Factor screening method for binary classification based on enhanced regression tree algorithm |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201710670847.4A CN107608938B (en) | 2017-08-08 | 2017-08-08 | Factor screening method for binary classification based on enhanced regression tree algorithm |
Publications (2)
Publication Number | Publication Date |
---|---|
CN107608938A CN107608938A (en) | 2018-01-19 |
CN107608938B true CN107608938B (en) | 2020-12-08 |
Family
ID=61064801
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201710670847.4A Active CN107608938B (en) | 2017-08-08 | 2017-08-08 | Factor screening method for binary classification based on enhanced regression tree algorithm |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN107608938B (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109948948A (en) * | 2019-03-29 | 2019-06-28 | 广东电网有限责任公司 | A kind of bus load key index screening technique, system and relevant apparatus |
CN110119568B (en) * | 2019-05-09 | 2022-10-14 | 河海大学 | Method for evaluating stone-throwing effect influence factors of riprap bank protection |
CN112149702A (en) * | 2019-06-28 | 2020-12-29 | 北京百度网讯科技有限公司 | Feature processing method and device |
Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980603A (en) * | 2017-02-23 | 2017-07-25 | 中国科学院南京土壤研究所 | Soil sulphur element content prediction method based on soil types merger and multiple regression |
-
2017
- 2017-08-08 CN CN201710670847.4A patent/CN107608938B/en active Active
Patent Citations (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN106980603A (en) * | 2017-02-23 | 2017-07-25 | 中国科学院南京土壤研究所 | Soil sulphur element content prediction method based on soil types merger and multiple regression |
Non-Patent Citations (5)
Title |
---|
Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees;Chung,Yi-Shih;<ACCIDENT ANALYSIS AND PREVENTION>;20120911;107-118页 * |
利用增强回归树分析中国野火空间分布格局的影响因素;焦琳琳等;《生态学杂志》;20150815;2288-2296页 * |
基于增强回归树的城市PM2.5日均值变化分析:以常州为例;葛跃等;《环境科学》;20160927;485-494页 * |
森林旅游地餐饮经营者的碳补偿意愿及其影响因素;谢晓文;《中国优秀硕士毕业论文全文数据库》;20170601;正文第3页,第36-46页 * |
气候变化对东北丹顶鹤繁殖生境影响评价研究;徐婷;《中国优秀硕士毕业论文全文数据库》;20130601;正文第27页 * |
Also Published As
Publication number | Publication date |
---|---|
CN107608938A (en) | 2018-01-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Fan et al. | A comparison of spatial autocorrelation indices and landscape metrics in measuring urban landscape fragmentation | |
Zhang et al. | Heuristic sample learning for complex urban scenes: Application to urban functional-zone mapping with VHR images and POI data | |
CN107608938B (en) | Factor screening method for binary classification based on enhanced regression tree algorithm | |
Sielenou et al. | Combining random forests and class-balancing to discriminate between three classes of avalanche activity in the French Alps | |
CN105677791A (en) | Method and system used for analyzing operating data of wind generating set | |
WO2014054042A1 (en) | Device and method for detecting plantation rows | |
CN102184423B (en) | Full-automatic method for precisely extracting regional impervious surface remote sensing information | |
CN112287018A (en) | Method and system for evaluating damage risk of 10kV tower under typhoon disaster | |
Jasiewicz et al. | Multi-scale segmentation algorithm for pattern-based partitioning of large categorical rasters | |
Hu et al. | Integrating CART algorithm and multi-source remote sensing data to estimate sub-pixel impervious surface coverage: a case study from Beijing Municipality, China | |
CN107871183A (en) | Permafrost Area highway distress Forecasting Methodology based on uncertain Clouds theory | |
CN113033081A (en) | Runoff simulation method and system based on SOM-BPNN model | |
Albuquerque et al. | Large-scale prediction of tropical stream water quality using Rough Sets Theory | |
CN116129262A (en) | Cultivated land suitability evaluation method and system for suitable mechanized transformation | |
Ghosh et al. | Calcrop21: A georeferenced multi-spectral dataset of satellite imagery and crop labels | |
CN117171533B (en) | Real-time acquisition and processing method and system for geographical mapping operation data | |
CN112907113B (en) | Vegetation change cause identification method considering spatial correlation | |
CN113656868A (en) | BIM technology-based hospital construction collaborative management platform | |
CN113726558A (en) | Network equipment flow prediction system based on random forest algorithm | |
Griffin et al. | Ranking Mahalanobis distance models for predictions of occupancy from presence‐only data | |
CN117371604A (en) | Agricultural production prediction method and system based on intelligent perception | |
CN112381644A (en) | Credit scene risk user assessment method based on space variable reasoning | |
CN111985782A (en) | Automatic tramcar driving risk assessment method based on environment perception | |
Young et al. | A terrain-based paired-site sampling design to assess biodiversity losses from eastern hemlock decline | |
CN102622345A (en) | High-precision land-utilization remote sensing updating technology with synergistic multisource spatio-temporal data |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |