CN107608938A - The factor screening method towards two-value classification of tree algorithm is returned based on enhancing - Google Patents

The factor screening method towards two-value classification of tree algorithm is returned based on enhancing Download PDF

Info

Publication number
CN107608938A
CN107608938A CN201710670847.4A CN201710670847A CN107608938A CN 107608938 A CN107608938 A CN 107608938A CN 201710670847 A CN201710670847 A CN 201710670847A CN 107608938 A CN107608938 A CN 107608938A
Authority
CN
China
Prior art keywords
factor
factors
enhancing
importance
target variable
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710670847.4A
Other languages
Chinese (zh)
Other versions
CN107608938B (en
Inventor
支俊俊
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Anhui Normal University
Original Assignee
Anhui Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Anhui Normal University filed Critical Anhui Normal University
Priority to CN201710670847.4A priority Critical patent/CN107608938B/en
Publication of CN107608938A publication Critical patent/CN107608938A/en
Application granted granted Critical
Publication of CN107608938B publication Critical patent/CN107608938B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses it is a kind of based on enhancing return tree algorithm towards two-value classification factor screening method,(1), data collection, establish target variable predictive factor data set;(2), tree algorithm modeling returned using enhancing based on target variable and whole factors, calculate Importance of Factors and simultaneously sort;(3), to whole factors carry out correlation analysis, analysis Pearson relevance matrix simultaneously screened;(4), tree algorithm returned using enhancing based on target variable and the factor remained establish new model, calculate prediction deviation, calculate Importance of Factors and simultaneously sort, reject the minimum factor of importance, until factor quantity≤2 remained;(5), comparison step(4)In each enhancing regression tree model prediction deviation, using the minimum enhancing regression tree model of prediction deviation used by whole factors as optimum factor combination.The present invention establishes the predictor selection system of quantification, reliable results, and application field is wide.

Description

The factor screening method towards two-value classification of tree algorithm is returned based on enhancing
Technical field
The present invention relates to factor screening technical field, specifically one kind with being applied to agricultural, environment, ecology, the hydrology, medical science Reason(Such as epidemiology), disaster prewarning and forecasting and meteorology(Such as weather forecast)Tree algorithm is returned based on enhancing Deng numerous areas Towards the factor screening method of two-value classification.
Background technology
Factor screening is agricultural, environment, ecology, the hydrology, medical geography(Such as epidemiology), disaster prewarning and forecasting and meteorology (Such as weather forecast)Deng the matter of utmost importance for needing to solve when studying two-value class object variable in numerous areas.Conventional research is more Using correlation coefficient process and Stepwise Regression Method.Correlation coefficient process is to carry out correlation analysis to all factors, is rejected related Property the higher factor, but the selection for being removed the factor in the combinations of factors of high correlation for existing is entirely subjective. One limitation of Stepwise Regression Method is to presuppose single best factors subset to be identified again, but usually not There is unique optimal subset;Another limitation is when height correlation between the factor be present, may obtain irrational subset.Closely Nian Lai, domestic and foreign scholars have attempted many new factor screening methods, mainly including principal component analysis, cluster analysis, Factor minute Analysis, discriminant analysis and method based on fuzzy mathematics etc..However, these methods are all there is certain limitation, such as:Principal component Preceding several principal component contribution rate of accumulative total that analysis needs to ensure to be extracted reach a higher level, the principal component life extracted Name definition is low, in addition, when the symbol of the factor loading of principal component has and just has negative, composite evaluation function interrogatory is true;It is poly- Alanysis is higher to requirements such as the Multinormality of variable and homogeneities of variance, and when sample size is larger, obtain cluster conclusion compared with For difficulty;Factorial analysis has specific requirement to data volume and composition, has some limitations, in addition, the method calculate because During sub- score, using least square method, it may fail in some cases;Discriminant analysis is not suitable for depositing between treatment factors In the situation of multicollinearity;Certain subjectivity be present in determination of the method based on fuzzy mathematics then to index weights vector. Existing method is disadvantageous in that jointly, can not be provided on the premise of ensureing not losing primitive factor information content suitable various The factor screening method of the quantification of data type.
The content of the invention
It is an object of the invention to provide one kind be adapted to various data types, be able to ensure that primitive factor information content do not lose, It can effectively solve the problem that between the factor Problems of Multiple Synteny, the classifying towards two-value based on enhancing recurrence tree algorithm of quantification be present Factor screening method,.
Technical scheme is as follows:
A kind of factor screening method towards two-value classification that tree algorithm is returned based on enhancing, it is characterised in that:Specifically include with Lower step:
(1), collect for two-value classification target variable and predictive factor, establish target variable-predictive factor data set;
(2), based on target variable and whole predictive factors, utilize enhancing to return tree algorithm and establish enhancing regression tree model, calculate Each predictive factor importance simultaneously sorts;
(3), to whole predictive factors carry out correlation analysis, analysis Pearson relevance matrix simultaneously screened, for Pearson came The combinations of factors of coefficient correlation absolute value >=0.80, according to step(2)The Importance of Factors of calculating retains weight in the combinations of factors The maximum factor of the property wanted, and reject other whole factors in the combinations of factors;
(4), based on target variable and the factor remained, return tree algorithm using enhancing and establish new enhancing regression tree mould Type, calculating prediction deviation, calculate Importance of Factors and simultaneously sort, reject the minimum factor of importance, if rejecting importance minimum The factor quantity > 2 remained after the factor, then this step is repeated based on target variable and these factors remained Until factor quantity≤2 remained;
(5), comparison step(4)In each enhancing regression tree model(In step(4)In, due to one factor of every rejecting, will establish One new enhancing regression tree model, so having multiple enhancing regression tree models)Prediction deviation, it is that prediction deviation is minimum Whole predictive factors are as optimum prediction combinations of factors used by strengthening regression tree model.
Further improvement as above-mentioned technical proposal of the present invention:
The step(2)Middle established enhancing regression tree model reruns 100 times, and each predictive factor importance is the mould The average value of 100 result of calculation of type.
The step(4)Middle established enhancing regression tree model is predicted the meter of deviation using ten folding cross-validation methods Calculate and rerun 100 times, described 100 result of calculations of model are taken into the average prediction deviation as the model.
Beneficial effects of the present invention:
1st, on the basis of correlation analysis, each Importance of Factors according to enhancing recurrence tree algorithm calculating is rejected to be existed the present invention The factor of high correlation, and return tree algorithm using enhancing and progressively reject the factor minimum to model contribution, can effectively it solve Subjective sex chromosome mosaicism present in certainly existing factor screening method, efficiently solved on the premise of can not lost ensuring information because The problem of multicollinearity between son be present, can effectively determine to influence the key factor of target variable, various data can be utilized Type(Including continuous type and discrete type), and data normal distribution is not required.
2nd, the present invention is averaged by the enhancing regression tree model that reruns, and further increases the stability of algorithm, There is precision height, quantification, workable and have a wide range of application, available for agricultural, environment, ecology, the hydrology, medical science It is geographical(Such as epidemiology), disaster prewarning and forecasting and meteorology(Such as weather forecast)Deng the factor towards two-value classification of numerous areas Screening process.
Brief description of the drawings
Fig. 1 is the basic implementation process diagram of the embodiment of the present invention.
Fig. 2 is the Pearson relevance matrix schematic diagram in the embodiment of the present invention.
Fig. 3 is the schematic diagram that each enhancing regression tree model prediction deviation compares in the embodiment of the present invention.
Embodiment
Enhancing returns tree algorithm in the embodiment of the present invention(boosted regression trees)Using what is more commonly used Gbm software kits(https://www.r-project.org/), based on R software platforms, with reference to Qilianshan Area sod layer(In being A diagnostic horizon in state's soil taxonomy)Data(Point-like data, as target variable)With envirment factor data(Planar grid Lattice data, as predictive factor)Exemplified by be described in detail.
Referring to Fig. 1, a kind of factor screening side towards two-value classification that tree algorithm is returned based on enhancing of the embodiment of the present invention Method, comprise the following steps that:
1st, the target variable and predictive factor for the classification of sod layer two-value are collected, establishes target variable-predictive factor data set.
The sod layer data of the present embodiment(Target variable)From state natural sciences fund key project " Heihe River basin Critical soil attribute Research on Digital Mapping " (41130530).Sod layer data sample amounts to 128, wherein 54 are sod layer (1 value in classifying as two-value), 74 are non-sod layer(0 value in classifying as two-value).Predictive factor data source in " Heihe River basin ecology-hydrologic process comprehensive remote sensing observation Combined Trials " (http://westdc.westgis.ac.cn), including Remotely-sensed data(30 m resolution ratio, the TM of Landsat 5), terrain data(30 m resolution ratio, ASTER GDEM)And climatic data(1 Km resolution space distribution maps, including temperature and precipitation).Using the software platforms of ArcMap 9.3 by remote sensing, landform and weather number According to(Resampling is into 30 m resolution ratio)Figure layer carries out geometric correction(Use Georeferencing instruments), extraction remote sensing prediction because Son and weather predictive factor;Utilize geographical science automatic analysis system(System for Automated Geoscientific Analysis)Extract landform predictive factor.The predictive factor of extraction amounts to 26, numbering respectively V1, V2, V3, V4, V5, V6, V7, V8, V9, V10, V11, V12, V13, V14, V15, V16, V17, V18, V19, V20, V21, V22, V23, V24, V25 and V26.Utilize ArcMap spatial analysis functions(Use Extract Values to Points instruments), extract 128 sampling points pair The 26 predictive factor numerical value answered, and by the sample value of sod layer(0,1 value)And its 26 corresponding predictive factor set of values Close in a .csv file.
2nd, tree algorithm modeling is returned using enhancing, calculates each predictive factor importance and sort.
Sample value based on sod layer(Target variable)With whole predictive factors(.csv files in step 1), utilize increasing It is strong to return tree algorithm modeling.The enhancing regression tree model parameter setting established includes:Data distribution type (distribution), it is set as " Bernoulli Jacob in the present embodiment(bernoulli)”(Classify for two-value);Set complexity (tree complexity), >=2 are generally, is set as 3 in the present embodiment;Sampling rate(bagging fraction), generally 0.50-0.75, it is set as 0.50 in the present embodiment;Debug learning rate(learning rate)Make Best tree quantity(number of trees)>=1000, learning rate is set as 0.001 in the present embodiment.Calculate the importance of each predictive factor.Rerun The enhancing regression tree model established 100 times, it is ranked up according to the average value of each 100 calculated values of Importance of Factors.
3rd, correlation analysis is carried out, according to Pearson came(Pearson)Coefficient correlation and Importance of Factors are rejected containing repeatability The factor of information.
Pearson came correlation analysis is carried out to whole predictive factors using SPSS softwares, Pearson relevance matrix is referring to Fig. 2. For the combinations of factors of Pearson correlation coefficient absolute value >=0.80, retained according to the Importance of Factors size calculated in step 2 The maximum factor of importance in the combinations of factors, and reject other whole factors in the combinations of factors.In the present embodiment, V4 (Importance of Factors is 17.9%)、V13(Importance of Factors is 15.2%)、V14(Importance of Factors is 0.1%)And V22(Factor weight The property wanted is 0.1%)The Pearson correlation coefficient absolute value of four factors between any two >=0.80, according to calculated in step 2 because Sub- importance size, only retain the maximum V4 of importance, and V13, V14 and V22 are rejected.By this technical step, pick altogether Except 11 factors, it is equal to retain Pearson correlation coefficient absolute value between any two<0.80 15 factors.
4th, tree algorithm is returned based on enhancing and progressively rejects the minimum factor of importance, and calculate prediction deviation.
Sample value based on sod layer(Target variable)With 15 factors remained, return tree algorithm using enhancing and build Vertical new model(It is identical with the parameter setting of model in step 2), Importance of Factors and prediction deviation are calculated, what operation was established Strengthen regression tree model 100 times, be ranked up according to the average value of each 100 calculated values of Importance of Factors.Reject importance most The small factor, the sample value based on sod layer(Target variable)The factor weight remained after the factor minimum with importance is rejected This step is performed again until the factor quantity remained is 2.
5th, comparison prediction deviation, optimum prediction combinations of factors is determined.
Respectively strengthen the prediction deviation average value of 100 result of calculations of regression tree model in comparison step 4(Referring to Fig. 3), will Whole predictive factors are as optimum prediction combinations of factors used by the minimum enhancing regression tree model of prediction deviation.The present embodiment In, it is clear that enhancing regression tree model prediction deviation is minimum when factor quantity is 6, therefore, 6 factors is made To carry out sod layer prediction(Two-value is classified)Optimum prediction combinations of factors.
Only predicted in the present embodiment with sod layer(Two-value class object variable)In predictive factor screening exemplified by said It is bright, but the present embodiment is applied equally to other field(Such as agricultural, environment, ecology, the hydrology, medical geography, disaster alarm Forecast and weather forecast etc.)The predictive factor screening of middle target variable two-value classification.The present embodiment combination Pearson came correlation point Analysis and Stepwise Screening program rejecting repeatability and redundancy the information that tree algorithm is returned based on enhancing, can effectively be solved at present The original predictive factor information loss of generally existing, the type to initial data have particular requirement in factor screening method(As wanted Ask continuous type and normal distribution), need artificial subjective judgement and be difficult to solve the problems such as multicollinearity be present between predictive factor. Average in addition, the present embodiment by enhancing of reruning returns tree algorithm and try to achieve each Importance of Factors and model prediction is inclined Difference so that factor screening result is stable, reliable.
The preferred embodiment of the present invention is the foregoing is only, protection scope of the present invention is not limited in above-mentioned embodiment party Formula, every technical scheme for belonging to the principle of the invention belong to protection scope of the present invention.For those skilled in the art Speech, some improvements and modifications carried out on the premise of the principle of the present invention is not departed from, these improvements and modifications also should be regarded as this The protection domain of invention.

Claims (3)

  1. A kind of 1. factor screening method towards two-value classification that tree algorithm is returned based on enhancing, it is characterised in that:Specifically include Following steps:
    (1), collect for two-value classification target variable and predictive factor, establish target variable-predictive factor data set;
    (2), based on target variable and whole predictive factors, utilize enhancing to return tree algorithm and establish enhancing regression tree model, calculate Each predictive factor importance simultaneously sorts;
    (3), to whole predictive factors carry out correlation analysis, analysis Pearson relevance matrix simultaneously screened, for Pearson came The combinations of factors of coefficient correlation absolute value >=0.80, according to step(2)The Importance of Factors of calculating retains weight in the combinations of factors The maximum factor of the property wanted, and reject other whole factors in the combinations of factors;
    (4), based on target variable and the factor remained, return tree algorithm using enhancing and establish new enhancing regression tree mould Type, calculating prediction deviation, calculate Importance of Factors and simultaneously sort, reject the minimum factor of importance, if rejecting importance minimum The factor quantity > 2 remained after the factor, then this step is repeated based on target variable and these factors remained Until factor quantity≤2 remained;
    (5), comparison step(4)In each enhancing regression tree model prediction deviation, by the enhancing regression tree mould that prediction deviation is minimum Whole predictive factors are as optimum prediction combinations of factors used by type.
  2. 2. the factor screening method towards two-value classification according to claim 1 that tree algorithm is returned based on enhancing, it is special Sign is:The step(2)Middle established enhancing regression tree model reruns 100 times, and each predictive factor importance is institute State the average value of 100 result of calculation of model.
  3. 3. the factor screening method towards two-value classification according to claim 1 that tree algorithm is returned based on enhancing, it is special Sign is:The step(4)Middle established enhancing regression tree model is predicted the meter of deviation using ten folding cross-validation methods Calculate and rerun 100 times, described 100 result of calculations of model are taken into the average prediction deviation as the model.
CN201710670847.4A 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm Active CN107608938B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710670847.4A CN107608938B (en) 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710670847.4A CN107608938B (en) 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm

Publications (2)

Publication Number Publication Date
CN107608938A true CN107608938A (en) 2018-01-19
CN107608938B CN107608938B (en) 2020-12-08

Family

ID=61064801

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710670847.4A Active CN107608938B (en) 2017-08-08 2017-08-08 Factor screening method for binary classification based on enhanced regression tree algorithm

Country Status (1)

Country Link
CN (1) CN107608938B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948948A (en) * 2019-03-29 2019-06-28 广东电网有限责任公司 A kind of bus load key index screening technique, system and relevant apparatus
CN110119568A (en) * 2019-05-09 2019-08-13 河海大学 A kind of riprap protection jackstone influential effect factor evaluation method
CN112149702A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature processing method and device

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980603A (en) * 2017-02-23 2017-07-25 中国科学院南京土壤研究所 Soil sulphur element content prediction method based on soil types merger and multiple regression

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106980603A (en) * 2017-02-23 2017-07-25 中国科学院南京土壤研究所 Soil sulphur element content prediction method based on soil types merger and multiple regression

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
CHUNG,YI-SHIH: "Factor complexity of crash occurrence: An empirical demonstration using boosted regression trees", <ACCIDENT ANALYSIS AND PREVENTION> *
徐婷: "气候变化对东北丹顶鹤繁殖生境影响评价研究", 《中国优秀硕士毕业论文全文数据库》 *
焦琳琳等: "利用增强回归树分析中国野火空间分布格局的影响因素", 《生态学杂志》 *
葛跃等: "基于增强回归树的城市PM2.5日均值变化分析:以常州为例", 《环境科学》 *
谢晓文: "森林旅游地餐饮经营者的碳补偿意愿及其影响因素", 《中国优秀硕士毕业论文全文数据库》 *

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109948948A (en) * 2019-03-29 2019-06-28 广东电网有限责任公司 A kind of bus load key index screening technique, system and relevant apparatus
CN110119568A (en) * 2019-05-09 2019-08-13 河海大学 A kind of riprap protection jackstone influential effect factor evaluation method
CN110119568B (en) * 2019-05-09 2022-10-14 河海大学 Method for evaluating stone-throwing effect influence factors of riprap bank protection
CN112149702A (en) * 2019-06-28 2020-12-29 北京百度网讯科技有限公司 Feature processing method and device

Also Published As

Publication number Publication date
CN107608938B (en) 2020-12-08

Similar Documents

Publication Publication Date Title
Rafiei-Sardooi et al. Evaluating urban flood risk using hybrid method of TOPSIS and machine learning
CN106651211B (en) Flood disaster risk assessment method for different scale areas
Maddahi et al. Land Suitability Analysis for Rice Cultivation Using a GIS-based Fuzzy Multi-criteria Decision Making Approach: Central Part of Amol District, Iran.
Kisi et al. Precipitation forecasting using wavelet-genetic programming and wavelet-neuro-fuzzy conjunction models
Host et al. A quantitative approach to developing regional ecosystem classifications
CN106599601A (en) Remote sensing assessment method and system for ecosystem vulnerability
CN106780089B (en) Permanent basic farmland planning method based on neural network cellular automaton model
KR20170005553A (en) Floods, drought assessment and forecasting techniques development for intelligent service
Latt et al. Clustering hydrological homogeneous regions and neural network based index flood estimation for ungauged catchments: an example of the Chindwin River in Myanmar
CN108875242A (en) A kind of urban cellular automata Scene Simulation method, terminal device and storage medium
CN108304536A (en) A kind of geographical environmental simulation of the geographical environmental element of coupling and predicting platform
CN109657616A (en) A kind of remote sensing image land cover pattern automatic classification method
CN107608938A (en) The factor screening method towards two-value classification of tree algorithm is returned based on enhancing
CN113033081A (en) Runoff simulation method and system based on SOM-BPNN model
Fataei et al. Industrial state site selection using MCDM method and GIS in Germi, Ardabil, Iran
Wang et al. How do physical and social factors affect urban landscape patterns in intermountain basins in Southwest China?
CN111445087A (en) Flood prediction method based on extreme learning machine
Koolagudi Long-range prediction of Indian summer monsoon rainfall using data mining and statistical approaches
Gavsker Urban growth, changing relationship between biophysical factors and surface thermal characteristics: A geospatial analysis of Agra city, India
Yost Probabilistic modeling and mapping of plant indicator species in a Northeast Oregon industrial forest, USA
Mokarram et al. Identification of morphometric features of alluvial fan and basins in predicting the erosion levels using ANN
Keshavarzi et al. Fuzzy clustering analysis for modeling of soil cation exchange capacity
Richardson et al. Assessing watershed vulnerability in Bernalillo County, New Mexico using GIS-based fuzzy Inference
CN115393148A (en) Data monitoring system, monitoring method, device, medium and terminal for natural resources
CN113537793A (en) Method for ecological hydrological zoning of drainage basin

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant