CN108876487A

CN108876487A - A kind of industrial plot estimation method based on big data and intelligent decision mechanism

Info

Publication number: CN108876487A
Application number: CN201810992001.7A
Authority: CN
Inventors: 韦虎; 王洁微
Original assignee: Yingying (hangzhou) Network Technology Co Ltd
Current assignee: Yingying (hangzhou) Network Technology Co Ltd
Priority date: 2018-08-29
Filing date: 2018-08-29
Publication date: 2018-11-23

Abstract

The industrial plot estimation method based on big data and intelligent decision mechanism that the invention discloses a kind of.The present invention includes the following steps：Step 1, confirmation influence land price factor；Step 2, the acquisition of factor data, conversion, quantization and storage；Step 3, building model is iterated and operation；Confirmation described in step 1 influences land price factor, and wherein influence factor includes regional factor, traffic factor, the mating factor in periphery, policy factor, five major class of history conclusion of the business situation factor；Building model described in step 3 is iterated and operation includes establishing Feature Engineering, establishing model and using XGBoost algorithm analysis soil unit price.The present invention combines current newest big data and intelligence estimation algorithm, assesses the price in soil, and achieve good effect.

Description

A kind of industrial plot estimation method based on big data and intelligent decision mechanism

Technical field

The industrial plot estimation method based on big data and intelligent decision mechanism that the invention proposes a kind of.

Background technique

With the continuous development and urbanization of social and economic level, industrialized continuous propulsion, the industry of secondary market Land deal is consequently increased, but compared with the estimation system of more mature business/house plot, is estimated to industrial plot Valence is pain spot always in the industry.

The appraisal in industrial plot has the characteristics that " more than three, two high, two few ", is in particular in:Personalized factor is more, disposes Mode multiplicity, development & construction mode multiplicity；The target amount of money is high, risk factors are high；Can foundation object for reference it is few, can for reference Standard is few.Therefore industrial land values estimation evaluates algorithm by artificial+parts of traditional substantially at present, approximately as：

1, cost-or-market method：When seeking the price in appraisal object soil, to develop the every necessary expense for building appraisal object The sum of based on, in addition normal profit and tax liability determine a kind of valuation methods of appraisal object land price.Cost The shortcomings that method is that the scope of application is very narrow, be only applicable to not only without income but also seldom occur dealing situation real estate.

2, benifit-sharing contract：When seeking the price in appraisal object soil, with certain restored interest rate appropriate, will be expected Appraisal object soil following each phase normal earning conversion to appraisal time point on present worth, seek the sum of its to determine appraisal soil Price a kind of method.The method the disadvantage is that be only used for bears interest or the appraisal of the real estate of potential income, and needs handle Hold the selection of earning rate and the determination of reduction rate.

3, Market Comparison Approach：Appraisal object mound is subject to the similar mound for having occurred that transaction within nearlyr period Compare control, according to the given price in the similar soil for having occurred that transaction, amendment obtains the one of appraisal object land price The common valuation methods of kind.

It is artificial when carrying out land valuation, often to be had the following problems using these types of method：

1, different personnel generate in evaluation process because of focal point difference, Consideration difference and experience difference Human error.

2, Factors influencing land price and market information can not fast, accurately and comprehensively be obtained by holding in a short time, can not be to motionless Long-term price trend judges in production.

3, whole assessment is manually carried out for each pass through of disparity items requirement, consuming resource is big, time length, low efficiency.

4, every decision information can not intuitive Digital Display, seldom use mathematical model, human interference when calculating Factor is relatively more.Inaction provides comprehensive decision-making foundation for policymaker.

5, there are various in application land valuation for the land valuations such as cost-or-market method, benifit-sharing contract, Market Comparison Approach algorithm Defect, and use scope is limited.

6, the land market that above several algorithms are all based on west is established, and is not necessarily suitable the actual conditions soil in China Ground market.

Summary of the invention

The work based on big data and intelligent decision mechanism that in view of the deficiencies of the prior art, it is an object of the present invention to provide a kind of Industry plot estimation method.On the basis of the present invention collects the whole network plot Transaction Information, on all possible factors for influencing land prices into Extraction, conversion, qualitative and quantum chemical method are gone, binding factor analyzes FA, XGBoost algorithm development industry plot valuation model simultaneously Carry out case verification.

The technical solution adopted by the present invention to solve the technical problems includes the following steps：

Step 1, confirmation influence land price factor；

Step 2, the acquisition of factor data, conversion, quantization and storage；

Step 3, building model is iterated and operation；

Confirmation described in step 1 influences land price factor, and wherein influence factor includes regional factor, traffic factor, week The mating factor in side, policy factor, five major class of history conclusion of the business situation factor；

The regional factor refers to the developments such as economy, industry, employment, the urban construction of plot region and becomes Gesture；The periphery traffic factor refers to the traffic convenience of soil present position；The mating factor in the periphery refers to influence General, universal, the common factor of land price, including public transport, style facility, business, hospital, school, food and drink, wine Shop, government organs；The policy factor refers to every policy of government, planning limitation, land registration system and benchmark land price；It is described History conclusion of the business situation factor refer to the conclusion of the business situation in specific region periphery soil, including closing time, conclusion of the business area, premium Rate, transactions velocity；

Acquisition, conversion, quantization and the storage of factor data described in step 2, are implemented as follows：

Data factors data are acquired from every independent data source, and collected data are converted and quantified, and Storage；

The conversion refers to through manual operation, to about the data in periphery traffic factor, the mating factor in periphery into Row extracts and planning；The quantization, which refers to, carries out grade classification to the data in policy factor by operating；

Building model described in step 3 is iterated and operation, is implemented as follows：

3-1. establishes Feature Engineering；

3-2. establishes model；

3-3, soil unit price is analyzed using XGBoost algorithm；

Feature Engineering is established described in step 3-1, it is specific as follows：

The feature that will affect industrial land price is roughly divided into two major classes：First is that the peculiar information in industrial plot；Second is that industrial The statistical indicator information of the affiliated administrative region in plot, it is specific as follows：

(1) the peculiar information in industrial plot

Geographical location and latitude and longitude information where industrial plot obtain its corresponding peculiar information in plot, include Periphery is mating, periphery traffic, policy implication and history conclusion of the business situation this four bulks factor, amounts to 18 features, including government Mechanism, public transport, hospital, school, cuisines, market, hotel, high speed port, subway, railway station, land registration system, benchmark land price, political affairs Plan planning, strikes a bargain the place time at the month number that struck a bargain；

(2) the statistical indicator information in industrial plot region

By carrying out ETL combing to the information of public data and statistical yearbook, according to administrative region belonging to industrial plot Divide, by district grade, city-level, it is provincial summarize respectively built about population information, national economy, industrial development, talent market, city If, the statistical indicator information of seven major class of public utility and science, education, culture and hygiene；However the industrial affiliated administrative region in plot is about this seven A series of statistical indicator number of materializations of major class institute subordinate is more, and feature quantity is larger, for the system for overcoming these to embody Correlation, plyability between meter index, by factorial analysis FA to multiple fine granularity statistical indicators in each secondary characteristics It carries out dimensionality reduction and simplifies data, more primitive character is indicated with less feature；Common factor in factorial analysis is not It can directly observe but the joint effect factor of objective reality, each variable can be expressed as the linear function of common factor The sum of with specific factor, i.e.,

X_i=a_i1F₁+a_i2F₂+…+a_imF_m+ε_i, (i=1,2 ..., p)

In formula, F₁,F₂,…,F_mReferred to as common factor, ε_iReferred to as X_iSpecific factor；a_i1,a_i2,…,a_imIt is referred to as public because Linear combination coefficient between son；X_iReferred to as fine granularity statistical indicator.

Further, provincial, city-level belonging to industrial plot, district grade area information share 155 fine granularities statistics and refer to Mark, and need to carry out factorial analysis to multiple fine granularity statistical indicators affiliated in each secondary characteristics, respectively according to its rubble Figure judges the common factor number extracted needed for each secondary characteristics, and dimensionality reduction and the simplification of feature are realized with this；Furthermore it will be every Extracted common factor carries out that corresponding factor score is calculated in a secondary characteristics, as newly-generated three-level feature； Finally indicate a series of fine granularity statistical indicator information in each secondary characteristics respectively with the three-level feature of less number；Tool Hold the three-level feature being shown in Table in 2 in vivo, all three-level characterization factor cumulative proportion in ANOVA in each secondary characteristics exist 80% or more；Therefore, according to the area information in industrial plot finally can newly-generated 35 three-level features, and this has been contained Most primary statistics indication information；

The industrial plot area information of table 2

20 statistical indicator information of the public utility secondary characteristics institute subordinate carry out factorial analysis, obtain 4 most Good common factor, specific step is as follows：

1. 20 fine granularity statistical indicator information about public utility secondary characteristics institute subordinate are specific as shown in 3 on table：

3 public utility secondary characteristics ASSOCIATE STATISTICS indication information of table

2. the initial data of above 20 statistical indicators is subjected to correlation test, table 4 give KMO test statistics with Bartlett sphericity test result；Between zero and one, KMO value shows original variable correlation closer to 0 to the value of KMO statistic Property is weaker, shows that original variable correlation is stronger closer to 1, the module of KMO is：0.9 it is indicated above be very suitable into Row factorial analysis, 0.8 comparison indicated above are suitble to, and 0.7 indicates general, and 0.6 indicates unsuitable, and 0.5 following presentation is extremely uncomfortable It closes；The null hypothesis of Bartlett sphericity test is：The correlation matrix of original variable is unit matrix, i.e. the elements in a main diagonal is 1, other elements are 0；It is 0.000 that KMO statistic, which is equal to the p value of 0.864, Bartlett sphericity test, in inspection result, this The result shows that the initial data of 20 statistical indicators is appropriate for factorial analysis；

Table 4KMO is examined and Bartlett sphericity test

KMO and Bartlett's Test

3. the rubble Tu Ke get of binding factor：The characteristic root of preceding 4 factors is generally higher, and variation tendency is precipitous, and the 4th Characteristic root variation tendency after a factor is gentle, therefore is suitble to extract 4 common factors；

4. checking characteristic root and the variance contribution of 4 common factors of factorial analysis in conjunction with table 5；4 common factors are corresponding Characteristic root be both greater than 1, have been able to explain original variable 80.480% variance, contain most information；

5 characteristic root of table and variance contribution table

Total Variance Explained

Extraction Method:Principal Component Analysis.

5. according to Factor load-matrix, load difference of each factor on different original variables to common factor into Row name, to obtain four three-level features：Post and telecommunications, communications and transportation, water power coal gas, infrastructure guarantee, each three-level The corresponding specific fine granularity Index Content of feature is as shown in table 6 below；

6 public utility secondary characteristics correlation three-level feature of table

6. factor score coefficient matrix W is obtained according to factorial analysis FA and the initial data of 20 statistical indicators, according to The standardized value of factor score coefficient and original variable in table 7 calculates the score of each factor of each observation；Public utility Four common factor post and telecommunications in secondary characteristics, communications and transportation, water power coal gas, infrastructure guarantee score expression formula F1, F2, F3, F4, which are respectively indicated, to be written as follow：

F1=0.029X1+0.023X2+0.061X3+ ...+0.074X19+0.013X20

F2=0.257X1+0.485X2+0.055X3+ ...+0.008X19-0.240X20

F3=-0.026X1+0.078X2+0.367X3+ ... -0.240X19+0.355X20

F4=0.749X1+0.145X2-0.078X3+ ... -0.049X19+0.161X20

Thus the score of four common factors, as four three-level features can be used as newly-generated feature and enter next Model；

7 factor score coefficient matrix W of table

Component Score Coefficient Matrix

Extraction Method:Principal Component Analysis.

Component Scores.。

Further, population information, national economy, industrial development, talent market, urban construction and science, education, culture and hygiene's major class Statistical indicator information is calculated referring to public utility statistical indicator information, obtains the statistical indicator information in industrial plot region It is as follows：

4 features of population information；6 spies of 7 features of national economy, 6 features of industrial development, talent market Sign, 4 features of urban construction, 4 features of 4 features of public utility and science, education, culture and hygiene.

Further, model is established described in step 3-2, it is specific as follows：

(1) based on the Feature Engineering of factorial analysis, the feature for being selected into model has two major classes：Peculiar information 18 of industrial plot Feature, 35 features of statistical indicator information in industrial plot region, amounts to 53 characteristic variables；

(2) by 80% training set as model of industrial plot sample size, it is used for training pattern；20% as model Test set, the training result for assessment models；Tune ginseng is carried out by 5-fold cross validation, selects optimal models；

(3) the soil unit price in industrial plot is convert by improved cost restoration method in the training process of model It arrives；

(4) objective function uses root-mean-square error RMSE, and RMSE calculation formula is as follows：

In formula, N indicates the sample size of training set, y_iIndicate the soil unit price true value in the industrial plot of input,Indicate defeated Enter the soil price model predicted value in industrial plot.

Further, it is implemented as follows described in step 3-3 using XGBoost algorithm analysis soil unit price：

Going out important feature by XGBoost model discrimination, major parameter carries out tune ginseng by 5-fold cross validation, from And optimal models are selected, design parameter setting is as follows：

(1) learning rate learning_rate：0.1；

(2) maximal tree depth max_depth：4；

(3) iteration wheel number n_estimators:1000；

(4) for controlling whether the parameter gamma of rear beta pruning:0.1；

(5) Controlling model L2 regularization term parameter reg_lambda:0.6；

(6) subsample of training pattern accounts for the ratio subsample of entire sample set:0.8；

(7) column sampling colsample_bytree:0.1；

According to feature importance ranking, binding characteristic practical significance takes its importance to be greater than 0.012 preceding 43 features, The feature of model is selected into as next round；

43 important feature particular contents after screening are as shown in table 8, wherein entitled ' C5_f_3 ' similar feature is Factorial analysis carries out the newly-generated feature after dimensionality reduction, that is, represents one of 35 features, and wherein Ci is referred in statistical indicator information Major class, f_j refer to j-th of common factor；

43 important features after the screening of table 8

According to 43 important features that last round of model discrimination goes out, the second wheel is carried out using XGBoost algorithm and is modeled, master Want parameter setting as follows：

(1) learning rate learning_rate：0.1；

(2) maximal tree depth max_depth：4；

(3) iteration wheel number n_estimators:1000；

(4) for controlling whether the parameter gamma of rear beta pruning:0.1；

(5) Controlling model L2 regularization term parameter reg_lambda:0.8；

(7) column sampling colsample_bytree:0.2；

Feature after carrying out dimensionality reduction and screening according to factorial analysis and XGBoost algorithm, using XGBoost model, finally Test set prediction result is as shown in table 9, it will be seen that is significantly better than remaining model using the model of 43 main features, model has centainly Promotion effect；

The performance of 9 model of table

The present invention has the beneficial effect that：

Model of the invention is except traditional traditional land price appraising model from west, in conjunction with newest at present Big data (a large amount of history fetched datas, the development of plot location, plot periphery every mating etc. macro/microcosmic influence factors) And intelligence estimation algorithm, the price in soil is assessed, and achieve good effect.

The present invention can hold in a short time can not fast, accurately and comprehensively obtain Factors influencing land price and market information.

Detailed description of the invention

Fig. 1 is factor rubble figure of the present invention；

Fig. 3 is feature of present invention importance ranking；

Fig. 2 is model foundation process of the present invention；

Fig. 4 is models fitting figure of the present invention.

Specific embodiment

The present invention will be further explained below with reference to the attached drawings.

As shown in Figs 1-4, a kind of industrial plot estimation method based on big data and intelligent decision mechanism, specific steps are such as Under：

Step 1, confirmation influence land price factor；

Step 2, the acquisition of related data, conversion, quantization and storage；

Step 3, building model is iterated and operation；

Step 1：The confirmation for influencing land price factor, specifically includes as follows：

It is different from above-mentioned all kinds of classical land valuation algorithms, it is considered herein that land valuation will combine domestic actual conditions, Simultaneously in view of each region actual conditions difference, land valuation also influenced by various complicated factors, be it is various because Element it is comprehensive as a result, rather than being spliced by one or more of simply algorithms.

Must fully consider to the various factors for influencing land price in land valuation could accurately assess soil valence Lattice.In conjunction with actual conditions, the present invention by these influence factors be divided into regional factor, traffic factor, the mating factor in periphery, policy because Element, five major class of history conclusion of the business situation factor.

(1) regional factor

The developments such as economy, industry, employment, the urban construction of plot region and trend have directly land price Influence is connect, especially the medium-term and long-term price trend in plot is influenced bigger.

(2) periphery traffic factor

The main application in industrial plot is all kinds of manufacturing industry, therefore has the physical product of all kinds of forms to export substantially, together When be also required to the inputs of all kinds of raw material, therefore soil present position traffic convenience (such as：High speed port in a certain range The quantity of quantity, provincial highway national highway) it is affected to land price.

(3) the mating factor in periphery

The mating factor in periphery refers to general, universal, the common factor for influencing land price.These factors are to land price Aggregate level generates basic influence, including public transport, style facility, business, hospital, school, food and drink, hotel, government Mechanism etc..

(4) policy factor

Every policy, planning limitation, land registration system and the benchmark land price of government, embody governments at all levels from macroscopic aspect to row The whole control of industry or the accurate rule control of specific region, have direct influence to the price in particular industry soil.

(5) history conclusion of the business situation factor

The conclusion of the business situation in certain specific region periphery soil, including closing time, conclusion of the business area, premium rate, transactions velocity etc. Information, the intuitive intensity of demand, the circulation in soil and the upward price trend embodied to soil.

Step 2：Acquisition, conversion, quantization and the storage of related data, are implemented as follows：

After influence factor and its data type have been determined in step 1, from every independent data source, (such as major government is public Open Data web site, land deal website, GIS information etc.) acquisition data, and collected data are converted and quantified, and Storage；

The conversion refers to through manual operation, to about the data in periphery traffic factor, the mating factor in periphery into Row extracts and planning, such as：In 1 kilometer range, hospital, public transport, style facility, business, hospital, school, food and drink, wine Shop, government organs quantity and distance；

The quantization, which refers to, carries out grade classification to the data in policy factor by operating；

Step 3：Building model is iterated and operation, is implemented as follows：

3-1. establishes Feature Engineering

(1) the peculiar information in industrial plot

Geographical location and latitude and longitude information where industrial plot can get its corresponding peculiar information in plot, packet This four bulks factor of, periphery traffic mating containing periphery, policy implication and history conclusion of the business situation amounts to 18 features；Part ginseng See content such as the following table 1.

The industrial peculiar information in plot of table 1

(2) the statistical indicator information in industrial plot region

By carrying out ETL combing to the information of public data and statistical yearbook, according to administrative region belonging to industrial plot Divide, by district grade, city-level, it is provincial summarize respectively built about population information, national economy, industrial development, talent market, city If, the statistical indicator information of seven major class of public utility and science, education, culture and hygiene, the firsts and seconds in particular content such as the following table 2 is special Sign.However, statistical indicator number of the industrial affiliated administrative region in plot about a series of materializations of this seven major class institute subordinates More, feature quantity is larger, such as workers at their posts' average salary, state-owned and large non-state industrial enterprises' number, truck freight volume, industrial wastewater The suchlike fine granularity statistical indicator such as discharge amount, and there may be stronger correlation between the fine granularity index of part, In the presence of the overlapping of information to a certain extent.

The industrial plot area information of table 2

The correlation between statistical indicator in order to overcome these materializations, plyability, by factorial analysis FA to each Multiple fine granularity statistical indicators in secondary characteristics carry out dimensionality reduction and simplify data, indicate more original with less feature Feature, and this expression can reflect the most information of original multiple features.Factorial analysis is between the numerous variables of research Internal dependence indicates its basic data structure with the variable of a few " abstract ", these abstract variables are claimed Make " factor ", can reflect the main information of original numerous variables.Common factor in factorial analysis be not directly observe but The joint effect factor of objective reality, each variable can be expressed as common factor linear function and specific factor it With that is,

X_i=a_i1F₁+a_i2F₂+…+a_imF_m+ε_i, (i=1,2 ..., p)

In formula, F₁,F₂,…,F_mReferred to as common factor, ε_iReferred to as X_iSpecific factor；a_i1,a_i2,…,a_imIt is referred to as public because Linear combination coefficient between son；X_iReferred to as fine granularity statistical indicator；

The provincial, city-level belonging to industrial plot, district grade area information share 155 fine granularity statistical indicators.This Project carries out factorial analysis to multiple fine granularity statistical indicators affiliated in each secondary characteristics, is sentenced respectively according to its rubble figure The common factor number extracted needed for disconnected, dimensionality reduction and the simplification of feature are realized with this；Furthermore it will be extracted in each secondary characteristics Common factor carry out that corresponding factor score is calculated, as newly-generated three-level feature；Finally, less number can be used Three-level feature indicate a series of fine granularity statistical indicator information in each secondary characteristics respectively, particular content is shown in Table in 2 Three-level feature.All three-level characterization factor cumulative proportion in ANOVA in each secondary characteristics are 80% or more.Therefore, root According to industrial plot area information finally can newly-generated 35 three-level features, and this has contained most primary statistics Indication information.

Wherein, factorial analysis is carried out by taking 20 statistical indicator information of public utility secondary characteristics institute subordinate as an example, obtains 4 A optimal common factor；The factorial analysis step that the statistical indicator information of remaining secondary characteristics institute subordinate is carried out is similar, Specific step is as follows：

1. 20 fine granularity statistical indicator information about public utility secondary characteristics institute subordinate are specific as follows shown：

The initial data of above 20 statistical indicators is subjected to correlation test.Table 4 give KMO test statistics with Bartlett sphericity test result.Between zero and one, KMO value shows original variable correlation closer to 0 to the value of KMO statistic Property is weaker, shows that original variable correlation is stronger closer to 1, it is generally recognized that the module of KMO is：0.9 is indicated above non- It is often appropriate for factorial analysis, 0.8 comparison indicated above is suitble to, and 0.7 indicates general, and 0.6 indicates unsuitable, 0.5 following table Show and is extremely not suitable for.The null hypothesis of Bartlett sphericity test is：The correlation matrix of original variable is unit matrix, i.e., main diagonal Line element is 1, and other elements are 0.KMO statistic is equal to the p value of 0.864, Bartlett sphericity test in this project 0.000, these illustrate that the data in this project are relatively appropriate for factorial analysis.

KMO and Bartlett's Test

Table 4KMO is examined and Bartlett sphericity test

2. Fig. 1 gives the rubble figure of the factor.Abscissa is the serial number of the factor in figure, and ordinate is individual features root Value.It can be obtained by Fig. 1, the characteristic root of preceding 4 factors is generally higher, and variation tendency is precipitous, and the characteristic root after the 4th factor Variation tendency is gentle, illustrates to be suitble to extract 4 common factors.

Table 5 gives the characteristic root and variance contribution table of 4 common factors of factorial analysis.The corresponding spy of 4 common factors Sign root is both greater than 1, it is already possible to which the variance for explaining original variable 80.480% has contained most information.

Total Variance Explained

Extraction Method:Principal Component Analysis.

5 characteristic root of table and variance contribution table

3. according to Factor load-matrix, load difference of each factor on different original variables so as to it is public because Son is named, to obtain four three-level features：Post and telecommunications, communications and transportation, water power coal gas, infrastructure guarantee, each The corresponding specific fine granularity Index Content of three-level feature is as shown in table 6 below.

4. according to factorial analysis FA and the initial data of 20 statistical indicators, acquisition table 7 gives factor score coefficient Matrix W can calculate each factor of each observation according to the standardized value of factor score coefficient and original variable in table 7 Score.Four common factor post and telecommunications, communications and transportation, water power coal gas, infrastructure guarantee in public utility secondary characteristics Score expression formula F1, F2, F3, F4 can be written as follow respectively：

F1=0.029X1+0.023X2+0.061X3+ ...+0.074X19+0.013X20

F2=0.257X1+0.485X2+0.055X3+ ...+0.008X19-0.240X20

F3=-0.026X1+0.078X2+0.367X3+ ... -0.240X19+0.355X20

F4=0.749X1+0.145X2-0.078X3+ ... -0.049X19+0.161X20

Thus the score of four common factors, as four three-level features can be used as newly-generated feature and enter next Model.

Component Score Coefficient Matrix

Extraction Method:Principal Component Analysis.

Component Scores.

7 factor score coefficient matrix W of table

Other population informations, national economy, industrial development, talent market, the statistics of urban construction and science, education, culture and hygiene's major class Indication information is calculated referring to public utility statistical indicator information, and it is as follows to obtain industrial plot area information：

3 features of population information；6 spies of 5 features of national economy, 5 features of industrial development, talent market Sign, 4 features of urban construction, 4 features of 4 features of public utility and science, education, culture and hygiene；

3-2. establishes model

As shown in Fig. 2, this project model training explanation：

(1) based on the Feature Engineering of factorial analysis, the feature for being selected into model has two major classes：Peculiar information 18 of industrial plot Feature, 35 features of industrial plot area information, amounts to 53 characteristic variables.

(2) 80% training set as model of industrial plot sample size, is used for training pattern；20% survey as model Examination collection, the training result for assessment models；Tune ginseng is carried out by 5-fold cross validation, selects optimal models.

(3) the soil unit price in industrial plot is convert by improved cost restoration method in the training process of model It arrives.The building unit price of industrial land is usually 900 yuan/m², but in view of the economic development situation in each area is different, building Object unit price can carry out the floating within the scope of certain proportion according to the regional GDP per capita in city where industrial plot, thus It can convert to obtain the soil unit price in each industrial plot.

3-3, soil unit price is analyzed using XGBoost algorithm

The full name of XGBoost algorithm is eXtreme Gradient Boosting, is right on the basis of GBDT algorithm The improvement that boosting algorithm carries out.Xgboost is the efficient realization of Gradient Boosting algorithm.Traditional GBDT with CART refers in particular to gradient and promotes decision Tree algorithms as base classifier, and XGBoost also supports linear classifier (GBLinear), XGBoost is equivalent to band L at this time₁And L₂(recurrence is asked for the Logistic recurrence (classification problem) or linear regression of regularization term Topic).

The task of this project is to predict the soil unit price in industrial plot, belongs to the regression problem in machine learning, and Soil this target variable approximation of unit price obeys logarithm normal distribution (Log-Normal Distribution), available XGBoost algorithm realizes regression forecasting.

For XGBoost model for filtering out important feature, major parameter carries out tune ginseng by 5-fold cross validation, choosing Optimal models are selected, design parameter setting is as follows：

(1) learning_rate (learning rate)：0.1,

(2) max_depth (maximal tree is deep)：4,

(3) n_estimators (iteration wheel number):1000,

(4) gamma (for controlling whether the parameter of rear beta pruning):0.1,

(5) reg_lambda (Controlling model L2 regularization term parameter):0.6,

(6) subsample (ratio that the subsample of training pattern accounts for entire sample set):0.8,

(7) colsample_bytree (column sampling):0.1

According to the feature importance ranking of Fig. 3, binding characteristic practical significance takes first 43 of its importance greater than 0.012 Feature is selected into the feature of model as next round.

43 important features after the screening of table 8

(1) learning_rate (learning rate)：0.1；

(2) max_depth (maximal tree is deep)：4；

(3) n_estimators (iteration wheel number):1000；

(4) gamma (for controlling whether the parameter of rear beta pruning):0.1；

(5) reg_lambda (Controlling model L2 regularization term parameter):0.8；

(6) subsample (ratio that the subsample of training pattern accounts for entire sample set):0.8；

(7) colsample_bytree (column sampling):0.2.

Feature after carrying out dimensionality reduction and screening according to factorial analysis and XGBoost algorithm, using XGBoost model, finally Test set prediction result is as shown in table 9, it can be seen that is significantly better than remaining model using the model of 43 main features, model has Certain promotion effect.In view of effect of visualization and picture perception, Fig. 4 gives the models fitting figure of 200 random samples.

The performance of 9 model of table

Brief summary：This algorithm model is except traditional traditional land price appraising model from west, in conjunction with newest at present Big data (a large amount of history fetched datas, the development of plot location, plot periphery it is every it is mating etc. it is macro/microcosmic because Element) and intelligence estimation algorithm, the price in soil is assessed, and achieve good effect.

Claims

1. a kind of industrial plot estimation method based on big data and intelligent decision mechanism, it is characterised in that include the following steps：

Step 1, confirmation influence land price factor；

Step 2, the acquisition of factor data, conversion, quantization and storage；

Step 3, building model is iterated and operation；

Confirmation described in step 1 influences land price factor, and wherein influence factor includes that regional factor, traffic factor, periphery are matched Set factor, policy factor, five major class of history conclusion of the business situation factor；

The regional factor refers to the developments such as economy, industry, employment, the urban construction of plot region and trend； The periphery traffic factor refers to the traffic convenience of soil present position；The mating factor in the periphery refers to influence soil General, universal, the common factor of price, including public transport, style facility, business, hospital, school, food and drink, hotel, Government organs；The policy factor refers to every policy of government, planning limitation, land registration system and benchmark land price；Described History conclusion of the business situation factor refers to the conclusion of the business situation in specific region periphery soil, including closing time, conclusion of the business area, premium rate, Transactions velocity；

Data factors data are acquired from every independent data source, and collected data are converted and quantified, and are stored；

The conversion refers to through manual operation, mentions to about the data in periphery traffic factor, the mating factor in periphery It takes and plans；The quantization, which refers to, carries out grade classification to the data in policy factor by operating；

3-1. establishes Feature Engineering；

3-2. establishes model；

3-3, soil unit price is analyzed using XGBoost algorithm；

The feature that will affect industrial land price is roughly divided into two major classes：First is that the peculiar information in industrial plot；Second is that industrial plot The statistical indicator information of affiliated administrative region, it is specific as follows：

(1) the peculiar information in industrial plot

Geographical location and latitude and longitude information where industrial plot obtain its corresponding peculiar information in plot, include periphery This four bulks factor of mating, periphery traffic, policy implication and history conclusion of the business situation, total 18 features, including government organs, Public transport, hospital, school, cuisines, market, hotel, high speed port, subway, railway station, land registration system, benchmark land price, policy rule It draws, the month number that struck a bargain, strike a bargain the place time；

(2) the statistical indicator information in industrial plot region

By carrying out ETL combing to the information of public data and statistical yearbook, according to administrative division belonging to industrial plot, By district grade, city-level, provincial summarize respectively about population information, national economy, industrial development, talent market, urban construction, public affairs With the statistical indicator information of seven major class of facility and science, education, culture and hygiene；However the industrial affiliated administrative region in plot is about this seven major class institutes A series of statistical indicator number of materializations of subordinate is more, and feature quantity is larger, for the statistical indicator for overcoming these to embody Between correlation, plyability, multiple fine granularity statistical indicators in each secondary characteristics are dropped by factorial analysis FA Peacekeeping simplifies data, and more primitive character is indicated with less feature；Common factor in factorial analysis is not directly Observation but the joint effect factor of objective reality, each variable can be expressed as the linear function of common factor with it is special The sum of factor, i.e.,

X_i=a_i1F₁+a_i2F₂+…+a_imF_m+ε_i, (i=1,2 ..., p)

In formula, F₁,F₂,…,F_mReferred to as common factor, ε_iReferred to as X_iSpecific factor；a_i1,a_i2,…,a_imReferred to as common factor it Between linear combination coefficient；X_iReferred to as fine granularity statistical indicator.

2. a kind of industrial plot estimation method based on big data and intelligent decision mechanism according to claim 1, special Sign is provincial, city-level belonging to industrial plot, the area information of district grade shares 155 fine granularity statistical indicators, and needs pair Multiple fine granularity statistical indicators belonging in each secondary characteristics carry out factorial analysis, each to judge according to its rubble figure respectively The common factor number extracted needed for secondary characteristics, dimensionality reduction and the simplification of feature are realized with this；Furthermore it will be in each secondary characteristics Extracted common factor carries out that corresponding factor score is calculated, as newly-generated three-level feature；Finally with less Several three-level features indicates a series of fine granularity statistical indicator information in each secondary characteristics respectively；Particular content is shown in Table 2 In three-level feature, all three-level characterization factor cumulative proportion in ANOVA in each secondary characteristics are 80% or more；Therefore, According to the area information in industrial plot finally can newly-generated 35 three-level features, and this has contained most original system Count indication information；

The industrial plot area information of table 2

3. a kind of industrial plot estimation method based on big data and intelligent decision mechanism according to claim 2, special Sign is that 20 statistical indicator information of the public utility secondary characteristics institute subordinate carry out factorial analysis, obtains 4 most preferably Common factor, specific step is as follows：

4 KMO of table is examined and Bartlett sphericity test

KMO and Bartlett's Test

3. the rubble Tu Ke get of binding factor：The characteristic root of preceding 4 factors is generally higher, and variation tendency is precipitous, and the 4th because Characteristic root variation tendency after son is gentle, therefore is suitble to extract 4 common factors；

4. checking characteristic root and the variance contribution of 4 common factors of factorial analysis in conjunction with table 5；The corresponding spy of 4 common factors Sign root is both greater than 1, has been able to the variance for explaining original variable 80.480%, contains most information；

5 characteristic root of table and variance contribution table

Total Variance Explained

Extraction Method:Principal Component Analysis.

5. load difference of each factor on different original variables is according to Factor load-matrix to order common factor Name, to obtain four three-level features：Post and telecommunications, communications and transportation, water power coal gas, infrastructure guarantee, each three-level feature Corresponding specific fine granularity Index Content is as shown in table 6 below；

6. factor score coefficient matrix W is obtained, according to table 7 according to factorial analysis FA and the initial data of 20 statistical indicators In factor score coefficient and original variable standardized value calculate each observation each factor score；Public utility second level Four common factor post and telecommunications in feature, communications and transportation, water power coal gas, infrastructure guarantee score expression formula F1, F2, F3, F4, which are respectively indicated, to be written as follow：

F1=0.029X1+0.023X2+0.061X3+ ...+0.074X19+0.013X20

F2=0.257X1+0.485X2+0.055X3+ ...+0.008X19-0.240X20

F3=-0.026X1+0.078X2+0.367X3+ ... -0.240X19+0.355X20

F4=0.749X1+0.145X2-0.078X3+ ... -0.049X19+0.161X20

Thus the score of four common factors, as four three-level features can be used as newly-generated feature into next mould Type；

7 factor score coefficient matrix W of table

Component Score Coefficient Matrix

Extraction Method:Principal Component Analysis.

Component Scores.。

4. a kind of industrial plot estimation method based on big data and intelligent decision mechanism according to claim 3, special Sign is the population information, national economy, industrial development, talent market, the statistics of urban construction and science, education, culture and hygiene's major class Indication information is calculated referring to public utility statistical indicator information, and the statistical indicator information for obtaining industrial plot region is as follows：

4 features of population information；7 features of national economy, 6 features of industrial development, 6 features of talent market, 4 features of 4 features of urban construction, 4 features of public utility and science, education, culture and hygiene.

5. a kind of industrial plot estimation method based on big data and intelligent decision mechanism according to claim 4, special Sign is to establish model described in step 3-2, specific as follows：

(1) based on the Feature Engineering of factorial analysis, the feature for being selected into model has two major classes：Industrial peculiar 18 spies of information in plot Sign, 35 features of statistical indicator information in industrial plot region amount to 53 characteristic variables；

(2) by 80% training set as model of industrial plot sample size, it is used for training pattern；20% test as model Collection, the training result for assessment models；Tune ginseng is carried out by 5-fold cross validation, selects optimal models；

(3) the soil unit price in industrial plot is converted to by improved cost restoration method in the training process of model；

In formula, N indicates the sample size of training set, y_iIndicate the soil unit price true value in the industrial plot of input,Indicate input industry The soil price model predicted value in plot.

6. a kind of industrial plot estimation method based on big data and intelligent decision mechanism according to claim 5, special Sign is to be implemented as follows described in step 3-3 using XGBoost algorithm analysis soil unit price：

Go out important feature by XGBoost model discrimination, major parameter carries out tune ginseng by 5-fold cross validation, to select Optimal models are selected, design parameter setting is as follows：

(1) learning rate learning_rate：0.1；

(2) maximal tree depth max_depth：4；

(3) iteration wheel number n_estimators:1000；

(4) for controlling whether the parameter gamma of rear beta pruning:0.1；

(5) Controlling model L2 regularization term parameter reg_lambda:0.6；

(7) column sampling colsample_bytree:0.1；

According to feature importance ranking, binding characteristic practical significance takes its importance to be greater than 0.012 preceding 43 features, as Next round is selected into the feature of model；

43 important feature particular contents after screening are as shown in table 8, wherein entitled ' C5_f_3 ' similar feature is the factor Analysis carries out the newly-generated feature after dimensionality reduction, that is, represents one of 35 features, and wherein Ci refers to big in statistical indicator information Class, f_j refer to j-th of common factor；

43 important features after the screening of table 8

According to 43 important features that last round of model discrimination goes out, the second wheel is carried out using XGBoost algorithm and is modeled, is mainly joined Number setting is as follows：

(1) learning rate learning_rate：0.1；

(2) maximal tree depth max_depth：4；

(3) iteration wheel number n_estimators:1000；

(4) for controlling whether the parameter gamma of rear beta pruning:0.1；

(5) Controlling model L2 regularization term parameter reg_lambda:0.8；

(7) column sampling colsample_bytree:0.2；

Feature after carrying out dimensionality reduction and screening according to factorial analysis and XGBoost algorithm, utilizes XGBoost model, final test It is as shown in table 9 to collect prediction result, it will be seen that be significantly better than remaining model using the model of 43 main features, model has certain mention Ascending effect；

The performance of 9 model of table