CN113240185A

CN113240185A - County carbon emission prediction method based on random forest

Info

Publication number: CN113240185A
Application number: CN202110570856.2A
Authority: CN
Inventors: 狄筝; 黄少远; 王晓飞; 张恒; 罗韬; 张赫; 王睿
Original assignee: Tianjin University
Current assignee: Tianjin University
Priority date: 2021-05-25
Filing date: 2021-05-25
Publication date: 2021-08-10

Abstract

The invention provides a county carbon emission prediction method based on random forests, which considers a multi-feature random forest model to train and predict carbon emission in county, can comprehensively extract multi-dimensional features in county, can realize parallel training operation facing large data volume of county, and has high training speed and simple realization. In addition, after the random forest model training is completed, the important degree of influence of each characteristic on carbon emission can be obtained, so that the carbon emission pollution is effectively treated.

Description

County carbon emission prediction method based on random forest

Technical Field

The invention relates to the field of artificial intelligence application, in particular to a method for predicting county carbon emission by performing model training by using characteristic data after county multi-element optimization.

Background

Currently, China still lacks in carbon emission prediction research, and cannot effectively prevent excessive emission in a certain area and lose balance of carbon emission among areas. With the development of artificial intelligence, the carbon emission characteristics can be constructed by analyzing factors influencing carbon emission and utilizing characteristic engineering, and the carbon emission amount of county areas can be predicted from the aspect of characteristics, so that the accuracy of carbon emission prediction is improved. Through yearbook analysis, relevant indexes of economic development, traffic trip, resident life and ecological greening can be used as direct carbon emission characteristics, relevant indexes of scale structure and energy efficiency are used as indirect carbon emission characteristics, and the direct carbon emission characteristics and the indirect carbon emission characteristics are combined through a random forest algorithm, so that the county carbon emission amount is effectively predicted.

Disclosure of Invention

In order to solve the above problems, the present invention provides a county carbon emission prediction method based on a random forest algorithm, the method comprising:

a county carbon emission prediction method based on random forests, wherein data in a prediction model is subjected to feature extraction according to three-dimensional data of county production, resident life and road traffic, and the carbon emission is predicted based on the three-dimensional data, and the prediction method comprises the following steps:

step 1: screening county data to form an initial data set required by a training model, forming initial county town carbon emission index elements, and dividing the carbon emission index elements into three types: production, living and transportation;

step 2: carrying out data cleaning and standardized data preprocessing on the data;

and step 3: forming a training data set, generating a training subset and a decision tree in each category of the production category, the life category and the traffic category by adopting a Bootstrap method, and randomly selecting N carbon emission influence indexes from N attributes as a subset of current node splitting when each node of the decision tree is split, wherein N is required to be less than N; combining all the decision trees after splitting to form a random forest;

and 4, step 4: inputting the parameter vector feature in the prediction set into the trained model, wherein each decision tree T_mObtaining a predicted result value

And adding the prediction results obtained by all the decision trees to obtain an arithmetic mean value, respectively obtaining the carbon emission predicted by the life category, the production category and the traffic category of each county area, and adding the three types of predicted carbon emission to obtain a final carbon emission predicted value.

Further, the index elements of the production type, the life type and the traffic type in the carbon emission index elements are respectively used as N input variables, the actually measured carbon emission in the current year is used as an output variable, and the input variable and the output variable jointly form a training data set D.

Further, the data cleaning includes cleaning the initial data set by using a mean value substitution method, and includes the steps of: cleaning missing values, cleaning format contents, cleaning logic errors and cleaning waste demand data; the standardized data preprocessing comprises adopting min-max standardization, and if t elements exist in the set, carrying out set element x standardization₁，x₂，......，x_tPerforming transformation to obtain dimensionless new sequence y₁，y₂，......，y_t∈[0，1]Wherein

Further, the step of generating the training subset and the decision tree by adopting the Bootstrap method comprises the steps of carrying out replaced random sampling on the training samples, repeating the sampling for m times and then jointly obtaining m training samplesTraining data subset D forming a training data set D_mTraining a decision tree T for each subset of training data_mAs a sample of the root node of the decision tree.

Further, each of the node splits of the decision tree includes selecting the 1 carbon emission impact index X of the optimal outcome in accordance with the "least square error criterion" in the split subset by employing a classification and regression tree approach_kAnd as the splitting attribute of the node, until the decision tree can not be split any more, pruning is not carried out in the splitting process, and the value of n is kept unchanged.

Further, the parameter vector feature in the prediction set can be defined as follows according to the collected characteristic indexes affecting the t-year carbon emission of county areas according to the production class, the life class and the traffic class:

wherein n is₁，n₂，n₃The characteristic category numbers of production, life and traffic are respectively, X is a characteristic index of each type, and the carbon emission prediction task is further classified into a multiple linear regression problem, namely:

wherein beta is an unknown parameter, epsilon is a random error, and f is an optimal function for solving an algorithm model, namely beta₀，β₁，...，β_n(ii) a Thus, the final predicted carbon emissions

Wherein

The carbon emission caused by the characteristic of the production class element in the county area,

carbon emissions due to the characteristics of the life-style elements in counties,

the carbon emission caused by the characteristics of the traffic class elements in county areas.

The invention provides a random forest model considering multiple features to train and predict carbon emission in county areas, which can comprehensively extract the multiple-dimensional features in the county areas and does not need to select from the multiple county area features. Because the decision trees in the random forest model are independent, parallel training operation can be realized in the face of large data volume of county areas, the training speed is high, and the realization is simple. In addition, after the random forest model training is completed, the important degree of influence of each characteristic on carbon emission can be obtained, so that enterprises and governments can better control the carbon emission, and the carbon emission pollution is effectively treated.

Drawings

Fig. 1 shows an algorithm procedure of a Random Forest (RF) algorithm.

Fig. 2 shows the comparison result of the RF algorithm and LR algorithm, LASSO algorithm, SVR algorithm of the present invention with respect to county-area life-type carbon emission prediction.

Fig. 3 shows the comparison results of the RF algorithm and LR algorithm, LASSO algorithm, SVR algorithm of the present invention with respect to county production carbon-like emission prediction.

Fig. 4 shows the comparison result of the RF algorithm and LR algorithm, LASSO algorithm, SVR algorithm of the present invention with respect to county traffic-like carbon emission prediction.

Detailed Description

The following examples are presented to enable those skilled in the art to more fully understand the present invention and are not intended to limit the invention in any way.

The invention mainly utilizes a Random Forest (RF) algorithm, wherein the Random Forest refers to a classifier which trains and predicts samples by utilizing a plurality of trees, the RF algorithm is a learning method which adopts a bagging thought in integrated learning, the RF algorithm is a model consisting of a plurality of decision trees, and each decision tree has no correlation. The algorithmic process is shown in fig. 1. In the process of the RF algorithm, firstly, a bootstrap method, namely a replaced random sampling method, is adopted, n samples are extracted from a data set to serve as a training set, a decision tree is trained through each training set, and the experiment is repeated until m decision trees are constructed. And then, taking the average value of the prediction results of each decision tree of the random forest as the most overall prediction result, thereby performing overall prediction. The prediction accuracy by using the random forest is high, the method can be effectively operated on a large data set, and overfitting is not easy to occur. In addition, the model can be trained in parallel due to the fact that the model is composed of a plurality of decision trees, training speed is improved, random forests are insensitive to noise in training sets, and comprehensive decisions of the decision trees are more stable than a single decision tree algorithm.

Based on the method, the carbon emission in the county area is predicted by comprehensively extracting the three-dimensional data characteristics of production, resident life and road traffic in the county area and establishing a multi-characteristic random forest model, so that the method has more accurate prediction performance.

Description of the problem

First, the county carbon emission prediction is assumed to be a regression prediction process, and collected characteristics affecting the county in t years are classified into three categories, namely production (production), life (life), and traffic (traffic). Let C be the total carbon emission of a county i in t years_itAccording to the classification of carbon emissions, then

Wherein

According to the collected characteristic indexes affecting the t-year carbon emission of county areas, the method can be defined as follows according to production, life and traffic:

wherein n is₁，n₂，n₃The characteristic category numbers of production, life and traffic are respectively, and X is a characteristic index of each type.

The average influence coefficient of each index is obtained through a Pearson correlation analysis method, the linear correlation between the characteristics and the carbon emission can be determined, and the carbon emission prediction task can be summarized as a multiple linear regression problem, namely:

where β is the unknown parameter and ε is the random error. f is the optimal function for the algorithm model of the present invention to solve, i.e. β₀，β₁，...，β_n。

Final predicted carbon emissions

The loss function takes the Mean Squared Error (MSE), which is defined as:

wherein m is the observed number.

Solving method

Based on random forests, the invention provides a carbon emission structure for predicting carbon emission in consideration of production, resident life and road traffic in county. The prediction method comprises the following specific steps:

step 1: data acquisition

And obtaining data of N counties according to the statistical yearbook, wherein N is 1814. First, a set of elements of N county areas is obtained. Screening the county data to form an initial data set required by a training model, and forming initial county town carbon emission index elements. The factors comprise the area of a built-up area, the land urbanization rate, the population scale, the county population scale, the population density of the built-up area, GDP, the average-of-people GDP, the added value of a first industry, the added value of a second industry, the total fixed asset investment sum of the whole society, the gas supply coverage rate, the heat supply pipeline density, the heat supply volume rate, the living density, the road density, the quantity of public transportation tools owned by every ten thousand persons, the quantity of motor vehicles owned by every ten thousand persons, the medical facility allocation rate, the social welfare facility allocation rate and the pavement area proportion of the footpath occupied by the footpath.

Secondly, in order to improve the accuracy and the robustness of the model, the set elements are classified. Elements are classified into three categories: production, living and transportation. The production category comprises a built-up area, a land urbanization rate, a population scale, a county population scale, a built-up area population density, a GDP (product data processing), a per-capita GDP (product data processing), a first industry added value and a second industry added value; the living categories comprise the area of a built-up area, the land urbanization rate, the population scale, the county population scale, GDP, the population-average GDP, a first industry added value, a second industry added value, the gas supply coverage rate, the heat supply pipeline density, the heat supply volume rate and the living density; the traffic category comprises the area of a built-up area, the population scale, the county population scale, GDP, the average population GDP, a first industry added value, a second industry added value, road density, the quantity of public transport means owned by every ten thousand persons, the quantity of motor vehicles owned by each person, the quantity of parks owned by every ten thousand persons, the allocation rate of medical facilities, the allocation rate of social welfare facilities and the area proportion of footpath occupied by roads.

Step 2: data pre-processing

The clustering is divided into a production class, a life class and a traffic class, the data file types are Html and Excel, the files totally comprise 21 fields and 7612 records, the content covers 1905 county domain data, and the time span is from 2010 to 2018. The raw data set comprises the area of a built-up area, the land urbanization rate, the population scale, the county population scale and the like. Data cleaning is needed due to the problems of missing data, wrong format and the like of data in the yearbook. The invention adopts a mean value substitution method to clean a data set, and the method comprises the following steps: purge missing values, purge format content, purge logic errors, and purge fee requirement data.

After data is cleaned, due to the fact that the data magnitude span of each field is large and limited by data units, the data needs to be standardized, the data is scaled in proportion, the data falls into a very small specific interval and is converted into a dimensionless pure numerical value, and therefore indexes of different units can be weighted. The invention adopts min-max standardization, if there are t elements in the set, the set element x is₁，x₂，......，x_tPerforming transformation to obtain dimensionless new sequence y₁，y₂，......，y_t∈[0，1]Wherein

And step 3: algorithmic prediction

Forming a training data set: the area of the built-up area, the land urbanization rate, the population scale, the county population scale, the density of the built-up area population, GDP per capita, the first industry added value and the second industry added value in the production class, the area of the built-up area, the land urbanization rate, the population scale, the county population scale, GDP per capita, the first industry added value, the second industry added value, the gas supply coverage rate, the heat supply pipeline density, the heat supply volume rate and the residence density in the living class, the area of the built-up area, the population scale, the county population scale, GDP per capita, the first industry added value, the second industry added value, the road density, the quantity of public transportation tools owned by each ten thousand people, the quantity of motor vehicles owned by each ten thousand people, the quantity of parks owned by the medical facilities, the distribution rate of the social welfare facilities and the pavement area proportion of the pavement area occupied road surface area in the transportation class are respectively used as the production, And the characteristic indexes of life and traffic, namely the input variables of the model, take the carbon emission in the current year which is actually measured as the output variables, and the input variables and the output variables jointly form a training data set D.

Generating a training subset and a decision tree: in each category, by using the Bootstrap methodRandomly sampling the training samples with the training samples put back, repeating the sampling for m times, and forming a training data subset D of a training data set D by the m training samples_mTraining a decision tree T for each subset of training data_mAs a sample of the root node of the decision tree.

Splitting a node: when each node of the decision tree is split, N carbon emission influence indexes are randomly selected from N attributes to serve as a subset of the split current node, and N is required to be smaller than N. Selecting 1 carbon emission influence index X with optimal result according to 'least square error criterion' in the splitting subset by adopting Classification And Regression Tree (CART) method_kAs a split attribute for that node until the decision tree can no longer be split. Pruning is not performed during the splitting process, and the value of n remains unchanged.

Generating a random forest: and combining all the decision trees after splitting to form a random forest.

Predicting carbon emission: inputting the parameter vector feature in the prediction set into the trained model, wherein each decision tree T_mObtaining a predicted result value

Predicting results obtained by all decision trees

Adding the arithmetic mean value to respectively obtain the predicted carbon emission of life class, production class and traffic class of each county

Adding the three types of predicted carbon emission to obtain the final predicted carbon emission value

The method solves the problem of carbon emission prediction in county areas, is different from the traditional carbon emission solution, and has the advantages of higher speed, higher accuracy and higher generalization capability by adopting a random forest prediction algorithm based on multiple characteristics, so that the problem of overproof carbon emission can be better prevented. According to the invention, counties with high carbon emission can be observed, and aspects with high carbon emission can be effectively managed. The carbon emission prediction capability of the model is evaluated by using Mean Squared Error (MSE) and Mean Absolute Error (MAE) indexes according to a correlation evaluation function of a regression task. The formula is as follows:

wherein, O_iPredicted value of carbon emission, T, for model output_iIs an observed value of carbon emission, and n is an observed amount.

In order to verify the prediction capability of the model, the invention compares a Logistic Regression (LR) algorithm, a Least Absolute Shrinkage and Selection (LASSO) algorithm, and a Support Vector Regression (SVR) algorithm. As shown in table 1, the experimental results show that the RF algorithm achieves the best results in terms of MSE and MAE indices compared to other methods.

TABLE 1 comparison of the results

Those skilled in the art will appreciate that the above embodiments are merely exemplary embodiments and that various changes, substitutions, and alterations can be made without departing from the spirit and scope of the application.

Claims

1. A county carbon emission prediction method based on random forests, wherein data in a prediction model is subjected to feature extraction according to three-dimensional data of county production, resident life and road traffic, and the carbon emission is predicted based on the three-dimensional data, and the prediction method comprises the following steps:

2. The method for predicting county-side carbon emissions according to claim 1, wherein index elements of the production class, the living class, and the transportation class in the carbon emission index elements are respectively used as N input variables, the actually measured carbon emission amount of the current year is used as an output variable, and the input variables and the output variables together form a training data set D.

3. The method of predicting county carbon emissions of claim 1, wherein the data cleansing comprises cleansing the initial data set using a mean-substitution method, comprising: cleaning missing values, cleaning format contents, cleaning logic errors and cleaning waste demand data; the standardized data preprocessing comprises adopting min-max standardization, and if t elements exist in the set, carrying out set element x standardization₁，x₂，......，x_tPerforming transformation to obtain dimensionless new sequence y₁，y₂，......，y_t∈[0，1]Wherein

4. The method for predicting county carbon emission according to claim 1, wherein the generating of the training subsets and the decision trees by using the Bootstrap method comprises performing replaced random sampling on training samples, and forming the training data subsets D of the training data set D by combining m training samples obtained after repeating the sampling m times_mTraining a decision tree T for each subset of training data_mAs a sample of the root node of the decision tree.

5. The method of predicting county-scale carbon emissions of claim 1, wherein splitting each node of the decision tree comprises selecting an optimal outcome of 1 carbon emission impact index X according to a "least square error criterion" in a split subset by using a classification and regression tree approach_kAnd as the splitting attribute of the node, until the decision tree can not be split any more, pruning is not carried out in the splitting process, and the value of n is kept unchanged.

6. The county carbon emission prediction method according to claim 1, wherein the parameter vector feature in the prediction set is defined as follows according to the collected characteristic indexes affecting the county carbon emission in the t year, according to the production class, the life class and the traffic class:

Wherein