CN109408774A

CN109408774A - The method of prediction sewage effluent index based on random forest and gradient boosted tree

Info

Publication number: CN109408774A
Application number: CN201811323416.1A
Authority: CN
Inventors: 张天麟; 高俊波; 孙伟; 赵友标; 孙峰
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2019-03-01
Anticipated expiration: 2038-11-07
Also published as: CN109408774B

Abstract

The invention discloses a kind of methods of prediction sewage effluent index based on random forest and gradient boosted tree comprising following steps: step 1: being concentrated in initial data training and put back to ground sample drawn, constitute several sample sets；Step 2: random forest is constructed according to sample；Feature importance is calculated according to random forest, carries out attribute selection；Step 3: the sample building gradient formed according to the attribute after screening promotes tree-model；Step 4: the sewage effluent index that sewage plant following a period of time is predicted in gradient promotion tree-model is put into according to Real-time Monitoring Data.Random forest and gradient boosted tree models coupling are got up to establish the relational model of sewage effluent index data by the present invention, can relatively accurately predict the sewage effluent index data in following a period of time by the dimensionality reduction and the high-precision training of gradient boosted tree of random forest.

Description

The method of prediction sewage effluent index based on random forest and gradient boosted tree

Technical field

The present invention relates to sewage treatments and machine learning techniques field, and in particular to one kind is mentioned based on random forest and gradient The method for rising the prediction sewage effluent index of tree.

Background technique

Urban wastewater treatment process is a complicated biochemical reaction process, along with there is physical-chemical reaction, biochemical anti- It answers, conversion and the transmittance process of phase transition process and substance and energy, process is complicated, and traditional mathematical modeling is difficult.Many scholars It is solving to be studied on problems using neural network.Based on neural network prediction sewage effluent index to a certain degree On solve problems, but still have training speed slow, model accuracy also needs disadvantage to be hoisted.And such research Irrelevant factor during the reaction is not avoided, this produces negative effect to the training speed and accuracy of model.

Summary of the invention

The object of the present invention is to provide a kind of sides of prediction sewage effluent index based on random forest and gradient boosted tree Method, it is therefore intended that the relational model for establishing sewage main effluent index data and sewage quality achievement data, according to real-time monitoring The main effluent index data of obtained sewage.

In order to achieve the above objectives, it is discharged the present invention provides a kind of based on random forest and the prediction sewage of gradient boosted tree Refer to calibration method comprising following steps:

Step 1: being concentrated in initial data training and put back to ground sample drawn, constitute several sample sets；

Step 2: random forest is constructed according to sample；Feature importance is calculated according to random forest, carries out attribute selection；

Step 3: the sample building gradient formed according to the attribute after screening promotes tree-model；

Step 4: being put into gradient promotion tree-model according to Real-time Monitoring Data and predict sewage plant following a period of time Sewage effluent index.

The method of the above-mentioned prediction sewage effluent index based on random forest and gradient boosted tree, wherein the step 1 It is further comprising the steps of: to be concentrated with according to original training and put back to ground randomly drawing sample building regression tree；It will not be pumped to every time Sample form identical with regression tree quantity bag sample outside.

The method of the above-mentioned prediction sewage effluent index based on random forest and gradient boosted tree, wherein the step 2 Specifically includes the following steps:

Step 2.1: traversing possible value under each characteristic attribute, final selection square error and the smallest point are used as and cut Branch；

Step 2.2: calculating the square error and the smallest attribute conduct division attribute of selection error of each attribute；

Step 2.3: regression tree is constructed to each sample set of division；

Step 2.4: by more regression tree components at recurrence forest；

Step 2.5: the random forest of composition is trained using training set；The bag that random forest passes through the outer sample of calculating bag Outer error calculates feature importance；

Step 2.6: feature being ranked up according to feature importance, filters out important feature.

The method of the above-mentioned prediction sewage effluent index based on random forest and gradient boosted tree, wherein the step 3 Specifically includes the following steps:

Step 3.1: the sample for screening feature is built into new training sample；

Step 3.2: every regression tree is carried out the penalty values of approximate calculation iterative process by using negative gradient and is determined often with this The optimal parameter of regression tree；Every regression tree updates the difference calculated, and the difference of update is put into lower regression tree；

Step 3.3: more regression trees being subjected to the cumulative gradient that constitutes and promote tree-model.

The method of the above-mentioned prediction sewage effluent index based on random forest and gradient boosted tree, wherein the gradient Promote tree-model are as follows:

Wherein, J is the number of leaf node, and I is to represent the value of c whether to belong to j-th of leaf node, f_mIt (x) is final The predicted value of model.

Compared with the existing technology, the invention has the following advantages:

Expert opinion is not only referred in the data link of sewage, the present invention can also be by Random Forest model according to acquisition Sewage data screening go out to be suitble to the characteristic attributes of the data, reached with this and delete redundancy feature attribute, realized Feature Dimension Reduction, mention High model training rate and the quality of data.The present invention uses gradient boosted tree model method, and the method is higher than in accuracy The predictablity rate of sewage can be improved in the methods of support vector machines and neural network；The present invention mentions random forest and gradient It rises tree-model and combines the relational model for establishing sewage effluent index data, promoted by the dimensionality reduction and gradient of random forest Sewage effluent index data in following a period of time can relatively accurately be predicted by setting high-precision training, thus sewage plant The sewage effluent index data of following a period of time can be predicted according to the sewage effluent index data of real-time detection, this If sample, sewage plant can judge that the effluent index of sewage plant meets national security according to the sewage effluent index data of prediction Standard；Further, sewage plant can also control on the basis of sewage effluent index is satisfied with national safety standard and be discharged into sewage In amount of oxygen, with this achieve the purpose that save factory cost；In short, sewage plant after having used the present invention, can achieve section The purpose of energy emission reduction, can also reduce the processing cost of sewage plant.

Detailed description of the invention

Fig. 1 is the flow chart of the method for the prediction sewage effluent index the present invention is based on random forest and gradient boosted tree；

Fig. 2 is the flow chart that random forest screens attribute step in the present invention；

Fig. 3 is the flow chart of gradient boosted tree model construction step in the present invention.

Specific embodiment

Below in conjunction with attached drawing, by specific embodiment, the invention will be further described, these embodiments are merely to illustrate The present invention is not limiting the scope of the invention.

The present invention provides a kind of method of prediction sewage effluent index based on random forest and gradient boosted tree, packets Include following steps:

The step 1 is further comprising the steps of: being concentrated with according to original training and puts back to ground randomly drawing sample building recurrence Tree；The sample not being pumped to every time is formed into sample outside bag identical with regression tree quantity.

The step 2 specifically includes the following steps:

Step 2.3: regression tree is constructed to each sample set of division；

Step 2.4: by more regression tree components at recurrence forest；

The step 3 specifically includes the following steps:

Step 3.1: the sample for screening feature is built into new training sample；

The gradient promotes tree-model are as follows:

In a more specific embodiment, the method for the prediction sewage effluent index based on random forest and gradient boosted tree The following steps are included:

It is concentrated in original training data and puts back to ground randomly drawing sample, construct several sample sets；Original training data collection Refer to the data of each sewage index collected by sewage plant sensor；

Split Attribute is selected according to sample set, constituting regression tree to sample set according to Split Attribute, (prediction target is continuous Type variable)；More regression trees constitute random forest；

It will be put into random forest and be trained to training sample, random forest obtains training number according to error outside bag is calculated According to feature importance；

It refers to training sample and is trained in random forest from extracting to be put into original training data at random Data set；

As an implementation, sample characteristics importance is calculated according to sample set, comprising the following steps: walk in description Before rapid, the relationship of sample and attribute can be expressed as follows:

X=< x₁, x₂, x₃…x_n>

Wherein X refers to a sample in sample set, x₁, x₂...x_nRefer to each attribute an of sample, here Assuming that sharing n attribute；

Attribute refers in the method: a certain kind of each sewage index of sewage plant sensor acquisition, i.e. aeration value (DO), it intakes PH, water inlet COD (COD), water inlet total phosphorus (TP), influent ammonia nitrogen (NH3N), concentration of suspension (SS), outstanding One of them of the indexs such as floating object solid concentration (MLSS)；

Select in sample attribute square error and the smallest attribute as Split Attribute；

It is as an implementation, described to select in training sample square error and the smallest attribute as Split Attribute, Specifically includes the following steps:

The predicted value that random forest obtains is the mean value of output data in subset；

Wherein, subset refers to that each tree of random forest will all be divided into different subsets, output data to training sample Refer to the true value of the feature of the required prediction of the sample in each subset, predicted value refers to the required pre- of the sample in subset The average value of the true value of the feature of survey；

Possibility value all under each characteristic attribute is traversed, it is final to choose what cut-off to obtain under the cut-off Square error and minimum；

Wherein, value refers to the value of characteristic attribute；

Compare each attribute square error and, the attribute for choosing least squares error sum will be to as optimal dividing attribute Training subset is divided into two child nodes；

As an implementation, the calculation formula for selecting square error sum in training subset are as follows:

Wherein, it is two parts that regression tree, which chooses cut-off for Attribute transposition: c1, c2, yi are the predicted value of training sample, C1, c2 are the average value of all predicted values in Attribute transposition part, and s is that each feature has s value, and j represents its of feature In a value；

As an implementation, the calculating feature importance, specifically includes the following steps:

For each regression tree (regression problem) in random forest, its bag is calculated using sample outside corresponding bag Outer sample error, is denoted as err₁；

Noise jamming randomly is added to the feature of sample outside bag, its outer error of bag is calculated again, is denoted as err₂；

Wherein, the outer sample of bag refers to the not sample data as training sample；

For the importance calculation formula of some feature are as follows:

Wherein, n is the number of samples of the outer sample of bag, and f is the outer sample error of bag and the outer error of bag that noise jamming is added With value according to f as the importance of a certain feature；

According to calculated feature importance, feature is ranked up, filters out important feature；

The sample for screening feature is built into new training sample；

Training sample is put into predicted in first regression tree training sample as a result, calculating error amount；

Wherein, error amount refers to the error between predicted value and true value；

It is put into next regression tree using error amount as input terminal and continues to calculate error amount；

The error amount of iteration each time is added up and is constituted gradient promotion tree-model with this by m regression tree of iteration；

As an implementation, the composition of the regression tree, specifically includes the following steps:

According to the criterion of least squares error, regression tree is recursively constructed；

Wherein, least squares error criterion is determined based on following formula:

Mode is established according to decision tree, the jth dimensional feature and corresponding threshold of sample x are selected at each decision node Value s then divides node into two regions as cutting feature and cutting threshold value, specific formula is as follows:

R₁(j, s)=and x | x [j] <=s } and R₂(j, s)=and x | x [j] > s }

Wherein, sample x refers to that one of sample to training sample, x [j] refer to j-th of feature of sample x Value, cutting feature refer to x by selected characteristic j by setting cutting threshold value come disruptive features attribute to reach division characteristic attribute Purpose；

As an implementation, described that node division is as follows for the calculation formula in two regions:

Wherein, yi is the true value of sample, and c1 and c2 are the predicted values of this regression tree；

Feature space (input space) is divided into M unit { R by one regression tree₁, R₂... R_M, each of regression tree Leaf node corresponds to a unit, there is a fixed output valve C accordingly_m, when input feature vector is x, regression tree meeting A leaf node is determined, by the corresponding output valve C of this leaf node_mAs the output of regression tree, regression tree Calculation formula is as follows:

As an implementation, the composition gradient promotes tree-model, specifically includes the following steps:

Gradient, which promotes tree-model, to be made of M regression tree iterative addition, and calculation formula is as follows:

f_m(x)=f_m-1(x)+T(x+θ_m), m=1 ... M,

Wherein f_m-1It (x) is current promotion tree-model, T (x, θ_m) it is the new regression tree generated, θ_mIt is the coefficient of regression tree, this A coefficient makes the m regression tree error minimum；

Gradient boosted tree needs to choose suitable regression tree parameter, so that loss function is reached minimum, calculation formula is as follows:

Wherein, y_iIt is the true value of current the m tree, f_m-1(x_i)+T(x_i；θ_m) it is to be obtained with the prediction of current the m tree The purpose of value cumulative errors value that m-1 tree calculates plus before, the two are added to obtain current m tree predicted value, this formula is determining So that the smallest parameter θ of loss function L, θ is unlike the parameter of current the m regression tree, herein formula and preceding formula It is to obtain the parameter of optimum regression tree and make loss function error minimum herein, and the similar formula of front is mentioned to gradient Rise the description of tree addition model.

The approximation of a wheel loss in iterative process is fitted according to the negative gradient of loss function, so that it is determined that lose The smallest parameter θ of function is indicated by using quadratic loss function, such as following formula:

L(y_i, f_m-1(x_i)+T(x_i；θ_m))=[y_i-f_m-1(x_i)-T(x_i, θ_m)]²

Wherein, the L in the calculation formula of suitable regression tree parameter is determined before loss function L is；

The expression formula for carrying out the penalty values formula of approximate calculation iterative process using negative gradient is as follows:

L(y_i, f_m-1(x_i)+T(x_i；θ_m))=[y_i-f_m-1(x_i)-T(x_i, θ_m)]²=[r_{M, i}-T(x_i, θ_m)]²

It wherein, is L loss function (using mean square error as loss function in this model), f (x_i) it is wait train The predicted value that a training sample in sample is obtained by training, r_{M, i}The gradient that is negative calculation formula, final use top formula It determines so that the smallest parameter θ of loss function；

According to the method every regression tree of training for determining regression tree parameter in previous step, final repetitive exercise is set to the m When, training is completed；

In m regression tree of training, the output of leaf node can be expressed as follows:

Wherein c_{M, j}It is the output valve of the m regression tree, j-th of leaf node, c is the predicted value that the m tree training obtains (error amount), this predicted value and before m-1 set obtained predicted value carry out m tree before cumulative finally obtain it is trained obtain it is pre- Measured value, so that loss function reaches minimum, preceding m regression tree training is finished at this time；

The model expression of final gradient boosted tree is as follows:

One section of sewage plant future is predicted in tree-model finally, being put into gradient according to the data of sewage plant real-time monitoring and being promoted The sewage effluent index of time.

It should be evident that the prediction sewage effluent index provided by the present invention based on random forest and gradient boosted tree Method can also be designed to that a kind of system combined based on random forest and gradient boosted tree, including sample construct module, with Machine forest constructs module, and random forest training module, data screening module, gradient, which is promoted, constructs module, the training of gradient boosted tree Module and prediction module；

The sample constructs module, puts back to ground randomly drawing sample for being concentrated in original training, constructs several samples Collection, so as to the training of subsequent Random Forest model；

The random forest module, for multiple regression tree construction and integration random forests according to foundation；

The random forest training module is trained for sample set of the random forest for building, to obtain every The importance of a characteristic attribute；

The data screening module, the weight of each characteristic attribute for random forest training module before to be calculated The property wanted is ranked up, and is deleted the low feature of importance, is finally retained important feature；

The gradient boosted tree constructs module, constructs gradient by iteration whole regression tree and promotes tree-model；

The gradient boosted tree training module, for passing through structure to the training set of the feature construction obtained out is screened before The gradient built promotes tree-model, is trained to training set；

The prediction module is used to obtain band prediction data, is predicted by being put into gradient boosted tree after screening feature, The prediction result of the sewage effluent index data in following a period of time is obtained by the way of add up error according to forecast set.

In conclusion the present invention, which filters out key factor using random forest, further increases model training efficiency and accurate Degree, and use gradient more higher than neural network accuracy promotes tree-model and is predicted, so that sewage plant predicts future one Whether the sewage quality of section time is in national emission standard.The aeration value that final factory can control input reaches to save work The purpose of factory's operating cost and green safe discharge.

It is discussed in detail although the contents of the present invention have passed through above preferred embodiment, but it should be appreciated that above-mentioned Description is not considered as limitation of the present invention.After those skilled in the art have read above content, for of the invention A variety of modifications and substitutions all will be apparent.Therefore, protection scope of the present invention should be limited to the appended claims.

Claims

1. a kind of method of the prediction sewage effluent index based on random forest and gradient boosted tree, which is characterized in that including with Lower step:

Step 4: the sewage that sewage plant following a period of time is predicted in gradient promotion tree-model is put into according to Real-time Monitoring Data Effluent index.

2. the method for the prediction sewage effluent index as described in claim 1 based on random forest and gradient boosted tree, special Sign is that the step 1 is further comprising the steps of: being concentrated with according to original training and puts back to ground randomly drawing sample building recurrence Tree；The sample not being pumped to every time is formed into sample outside bag identical with regression tree quantity.

3. the method for the prediction sewage effluent index as claimed in claim 2 based on random forest and gradient boosted tree, special Sign is, the step 2 specifically includes the following steps:

Step 2.1: traversing possible value under each characteristic attribute, the final square error and the smallest point chosen is as cutting Point；

Step 2.3: regression tree is constructed to each sample set of division；

Step 2.4: by more regression tree components at recurrence forest；

Step 2.5: the random forest of composition is trained using training set；Outside bag of the random forest by calculating the outer sample of bag accidentally Difference calculates feature importance；

4. the method for the prediction sewage effluent index as claimed in claim 3 based on random forest and gradient boosted tree, special Sign is, the step 3 specifically includes the following steps:

Step 3.1: the sample for screening feature is built into new training sample；

Step 3.2: every regression tree carrys out the penalty values of approximate calculation iterative process by using negative gradient and determines that every is returned with this The optimal parameter of Gui Shu；Every regression tree updates the difference calculated, and the difference of update is put into lower regression tree；

5. the method for the prediction sewage effluent index as claimed in claim 4 based on random forest and gradient boosted tree, special Sign is that the gradient promotes tree-model are as follows:

Wherein, J is the number of leaf node, and I is to represent the value of c whether to belong to j-th of leaf node, f_mIt (x) is final mask Predicted value.