CN109408774B

CN109408774B - Method for predicting sewage effluent index based on random forest and gradient lifting tree

Info

Publication number: CN109408774B
Application number: CN201811323416.1A
Authority: CN
Inventors: 张天麟; 高俊波; 孙伟; 赵友标; 孙峰
Original assignee: Shanghai Maritime University
Current assignee: Shanghai Maritime University
Priority date: 2018-11-07
Filing date: 2018-11-07
Publication date: 2022-11-08
Anticipated expiration: 2038-11-07
Also published as: CN109408774A

Abstract

The invention discloses a method for predicting sewage effluent indexes based on random forests and gradient lifting trees, which comprises the following steps: step 1: extracting samples in a place where the samples are put back in an original data training set to form a plurality of sample sets; and 2, step: constructing a random forest according to the sample; calculating the feature importance according to the random forest, and performing attribute screening; and step 3: constructing a gradient lifting tree model according to a sample formed by the screened attributes; and 4, step 4: and putting the real-time monitoring data into a gradient lifting tree model to predict the sewage outlet index of the sewage plant for a period of time in the future. The invention combines the random forest and the gradient lifting tree model to establish a relation model of the sewage effluent index data, and can more accurately predict the sewage effluent index data in a period of time in the future through the dimensionality reduction of the random forest and the high-precision training of the gradient lifting tree.

Description

Method for predicting sewage effluent index based on random forest and gradient lifting tree

Technical Field

The invention relates to the technical field of sewage treatment and machine learning, in particular to a method for predicting a sewage effluent index based on a random forest and a gradient lifting tree.

Background

The town sewage treatment process is a complex biochemical reaction process accompanied by physicochemical reaction, biochemical reaction, phase change process and material and energy conversion and transmission process, the process is complex, and the traditional mathematical modeling is difficult. Many scholars have studied on solving such problems using neural networks. The problems are solved to a certain extent by predicting the effluent indexes of the sewage based on the neural network, but the problems still have the defects of low training speed and need to be improved in model accuracy. And such studies do not avoid extraneous factors in the reaction process, which negatively impacts the training speed and accuracy of the model.

Disclosure of Invention

The invention aims to provide a method for predicting a sewage effluent index based on a random forest and a gradient lifting tree, which aims to establish a relation model of main sewage effluent index data and sewage water quality index data and obtain the main sewage effluent index data according to real-time monitoring.

In order to achieve the aim, the invention provides a method for predicting a sewage effluent index based on a random forest and a gradient lifting tree, which comprises the following steps:

step 1: extracting samples in a place where the samples are replaced in an original data training set to form a plurality of sample sets;

step 2: constructing a random forest according to the sample; calculating feature importance according to the random forest, and screening attributes;

and 3, step 3: constructing a gradient lifting tree model according to a sample formed by the screened attributes;

and 4, step 4: and putting the real-time monitoring data into a gradient lifting tree model to predict the sewage outlet index of the sewage plant for a period of time in the future.

The method for predicting the effluent index of the sewage based on the random forest and the gradient lifting tree is characterized in that the step 1 further comprises the following steps: randomly extracting samples according to the original training set and putting back to construct a regression tree; the samples that are not drawn each time are grouped into the same number of out-of-bag samples as the regression tree.

The method for predicting the effluent index of the sewage based on the random forest and the gradient lifting tree comprises the following steps in step 2:

step 2.1: traversing possible values under each characteristic attribute, and finally selecting a point with the smallest sum of square errors as a segmentation point;

step 2.2: calculating the sum of squared errors of all attributes, and selecting the attribute with the minimum error as a partition attribute;

step 2.3: constructing a regression tree for each divided sample set;

step 2.4: building a plurality of regression trees into a regression forest;

step 2.5: the formed random forest is trained by using a training set; calculating the feature importance of the random forest by calculating the out-bag error of the out-bag sample;

step 2.6: and sorting the features according to the feature importance, and screening out important features.

The method for predicting the effluent index of the sewage based on the random forest and the gradient lifting tree is characterized in that the step 3 specifically comprises the following steps:

step 3.1: constructing a new training sample by using the sample with the screened characteristics;

step 3.2: each regression tree approximately calculates the loss value of the iterative process by using a negative gradient to determine the optimal parameter of each regression tree; updating the calculated difference value of each regression tree, and putting the updated difference value into the next regression tree;

step 3.3: and accumulating the multiple regression trees to form a gradient lifting tree model.

The method for predicting the effluent index of the sewage based on the random forest and the gradient lifting tree is characterized in that the gradient lifting tree model is as follows:

wherein J is the number of leaf nodes, I is whether the value representing c belongs to the jth leaf node, f _m (x) Is the predicted value of the final model.

Compared with the prior art, the invention has the following beneficial effects:

in the sewage data link, not only expert opinions are referred to, but also the characteristic attributes suitable for the data can be screened out through the random forest model according to the collected sewage data, so that redundant characteristic attributes are deleted, the characteristic dimension reduction is realized, and the model training speed and the data quality are improved. The method uses a gradient lifting tree model method, which is higher in accuracy than methods such as a support vector machine and a neural network, and can improve the prediction accuracy of sewage; according to the method, a random forest and a gradient lifting tree model are combined to establish a relation model of sewage effluent index data, and the sewage effluent index data in a future period of time can be accurately predicted through dimensionality reduction of the random forest and high-precision training of the gradient lifting tree, so that a sewage plant can predict the sewage effluent index data in the future period of time according to the sewage effluent index data detected in real time, and then the sewage plant can judge that the effluent index of the sewage plant meets the national safety standard according to the predicted sewage effluent index data; furthermore, the sewage plant can control the amount of oxygen discharged into the sewage on the basis that the effluent index of the sewage meets the national safety standard, so that the aim of saving the cost of the plant is fulfilled; in a word, after the sewage plant uses the invention, the purposes of energy saving and emission reduction can be achieved, and the treatment cost of the sewage plant can also be reduced.

Drawings

FIG. 1 is a flow chart of a method for predicting effluent index of sewage based on random forests and gradient spanning trees in accordance with the present invention;

FIG. 2 is a flow chart of the steps of random forest screening attributes in the present invention;

FIG. 3 is a flowchart of the steps of constructing a gradient lifting tree model according to the present invention.

Detailed Description

The invention will be further described by means of specific examples in conjunction with the accompanying drawings, which are provided for illustration only and are not intended to limit the scope of the invention.

The invention provides a method for predicting a sewage effluent index based on a random forest and a gradient lifting tree, which comprises the following steps:

the step 1 further comprises the following steps: randomly extracting samples according to the original training set and the place to be placed back to construct a regression tree; the samples that were not drawn each time were grouped into the same number of out-of-bag samples as the regression tree.

Step 2: constructing a random forest according to the sample; calculating the feature importance according to the random forest, and performing attribute screening;

the step 2 specifically comprises the following steps:

step 2.3: constructing a regression tree for each divided sample set;

step 2.4: building a plurality of regression trees into a regression forest;

the step 3 specifically comprises the following steps:

step 3.1: constructing the sample with the screened characteristics into a new training sample;

The gradient lifting tree model is as follows:

In a more specific embodiment, the method for predicting the effluent index of sewage based on random forests and gradient lifting trees comprises the following steps:

randomly sampling samples in a place where the samples are put back in an original training data set to construct a plurality of sample sets; the original training data set refers to data of each sewage index acquired by a sewage plant sensor;

selecting splitting attributes according to the sample set, and forming a regression tree (a prediction target is a continuous variable) for the sample set according to the splitting attributes; forming a random forest by a plurality of regression trees;

putting a sample to be trained into a random forest for training, and obtaining the characteristic importance of training data by the random forest according to the calculated out-of-bag errors;

the sample to be trained is a data set to be put into a random forest for training is randomly extracted from original training data;

as an implementation mode, the method for calculating the importance of the sample feature according to the sample set comprises the following steps: before the describing step, the relationship of the sample and the attribute may be expressed as follows:

X＝<x ₁ ，x ₂ ，x ₃ …x _n >

where X refers to a sample in the sample set, X ₁ ，x ₂ ...x _n Each attribute of one sample is referred to, and n attributes are assumed to be total;

the attributes in the method refer to: one of various sewage indexes collected by a sewage plant sensor is one of indexes such as aeration value (DO), inlet water PH, inlet water Chemical Oxygen Demand (COD), inlet water Total Phosphorus (TP), inlet water ammonia nitrogen (NH 3N), suspended matter concentration (SS), suspended matter solid concentration (MLSS) and the like;

selecting the attribute with the minimum sum of squared errors in the sample attributes as a splitting attribute;

as an embodiment, the selecting, as the split attribute, an attribute with the smallest sum of squared errors in the training samples specifically includes the following steps:

the predicted value obtained by the random forest is the average value of output data in the subset;

the subset refers to that each tree of the random forest divides a sample to be trained into different subsets, the output data refers to the actual value of the characteristic, required to be predicted, of the sample in each subset, and the predicted value refers to the average value of the actual values of the characteristic, required to be predicted, of the sample in the subset;

traversing all possible values under each characteristic attribute, and finally selecting a segmentation point to ensure that the sum of square errors obtained under the segmentation point is minimum;

wherein, the value is the value of the characteristic attribute;

comparing the sum of squared errors of all attributes, selecting the attribute with the minimum sum of squared errors as an optimal partition attribute, and dividing the subset to be trained into two sub-nodes;

as an embodiment, the calculation formula of the sum of squared errors in the selected training subset is:

the regression tree selects a segmentation point to divide the attribute into two parts: c1 C2, yi is a predicted value of the training sample, c1, c2 is an average value of all predicted values in the attribute dividing part, s is that each feature has s values, and j represents one of the values of the features;

as an embodiment, the calculating the feature importance specifically includes the following steps:

for each regression tree in the random forest (regression problem), its out-of-bag sample error, denoted err, is calculated using the corresponding out-of-bag sample ₁ ；

Randomly adding noise interference to the characteristics of the sample outside the bag, and calculating the error outside the bag again, and recording the error as err ₂ ；

Wherein, the out-of-bag sample refers to sample data which is not used as a training sample;

the formula for calculating the importance of a certain feature is as follows:

wherein n is the number of samples of the out-of-bag sample, f is the sum of the out-of-bag sample error and the out-of-bag error added with noise interference, and f is taken as the value of the importance of a certain characteristic;

sorting the features according to the calculated feature importance, and screening out important features;

constructing the sample with the screened characteristics into a new training sample;

putting the training samples into a first regression tree to predict the result of the training samples, and calculating an error value;

wherein, the error value refers to the error between the predicted value and the real value;

putting the error value as an input end into the next regression tree to continue calculating the error value;

iterating m regression trees, and accumulating error values of each iteration to form a gradient lifting tree model;

as an embodiment, the configuration of the regression tree specifically includes the following steps:

recursively constructing a regression tree according to a criterion of a least square error;

wherein the least squares error criterion is determined based on the following equation:

according to the establishment mode of the decision tree, selecting the jth dimension feature and the corresponding threshold value s of the sample x at each decision node as a segmentation feature and a segmentation threshold value, dividing the node into two regions, wherein the specific formula is as follows:

R ₁ (j，s)＝{x|x[j]< = s } and R ₂ (j，s)＝{x|x[j]＞s}

The method comprises the following steps that a sample x refers to one of samples of a sample to be trained, x [ j ] refers to the value of the jth characteristic of the sample x, and the segmentation characteristic refers to the fact that the x divides the characteristic attribute by setting a segmentation threshold value to select the characteristic j so as to achieve the purpose of dividing the characteristic attribute;

as an embodiment, the calculation formula for dividing the node into two regions is as follows:

wherein yi is the real value of the sample, and c1 and c2 are the predicted values of the regression tree;

a regression tree divides the feature space (input space) into M units { R } ₁ ，R ₂ ，...R _M Each leaf node of the regression tree corresponds to a cell, which in turn has a fixed output value C _m When the input feature is x, the regression tree judges the input feature to be a leaf node, and the output value C corresponding to the leaf node is used _m As an output of the regression tree, the calculation formula of the regression tree is as follows:

as an embodiment, the constructing the gradient lifting tree model specifically includes the following steps:

the gradient lifting tree model is formed by iterative addition of M regression trees, and the calculation formula is as follows:

f _m (x)＝f _m-1 (x)+T(x+θ _m )，m＝1…M，

wherein f is _m-1 (x) Is the current lifting tree model, T (x, θ) _m ) Is a new regression tree generated, θ _m Is the coefficient of the regression tree, this coefficient makes the m regression tree error minimum;

the gradient lifting tree needs to select proper regression tree parameters to minimize the loss function, and the calculation formula is as follows:

wherein, y _i Is the true value of the current m-th tree, f _m-1 (x _i )+T(x _i ；θ _m ) The current m-tree prediction value is obtained by adding the current m-tree predicted value and the accumulated error value calculated by the previous m-1 trees, the formula aims to determine a parameter theta which minimizes the loss function L, theta is a parameter of the current m-tree, the formula is different from the previous formula in that the formula is used for obtaining a parameter of the optimal regression tree so that the loss function error is minimized, and the previous similar formula is used for describing the gradient lifting tree addition model.

An approximation of the round of losses in the iterative process is fitted according to the negative gradient of the loss function, thereby determining the parameter θ that minimizes the loss function, by using a squared loss function, as represented by the following equation:

L(y _i ，f _m-1 (x _i )+T(x _i ；θ _m ))＝[y _i -f _m-1 (x _i )-T(x _i ，θ _m )] ²

wherein, the loss function L is L in the calculation formula for determining the proper regression tree parameter;

the expression for the formula for approximating the loss value of the computational iteration using a negative gradient is as follows:

L(y _i ，f _m-1 (x _i )+T(x _i ；θ _m ))＝[y _i -f _m-1 (x _i )-T(x _i ，θ _m )] ² ＝[r _m，i -T(x _i ，θ _m )] ²

where is the L loss function (in this model)The mean square error is used as a loss function), f (x) _i ) A predicted value r obtained by training for one training sample in the samples to be trained _m，i For the negative gradient calculation formula, finally determining a parameter theta which minimizes the loss function by using the above formula;

training each regression tree according to the method for determining the regression tree parameters in the previous step, and finally finishing the training when the mth tree is iteratively trained;

in training the mth regression tree, the output of the leaf nodes may be represented as follows:

wherein c is _m，j The predicted value is accumulated with the predicted values obtained from previous m-1 trees to finally obtain the predicted values obtained from previous m-tree training, so that the loss function reaches the minimum, and the previous m regression trees are trained completely;

the model expression of the final gradient lifting tree is as follows:

And finally, putting the data monitored by the sewage plant in real time into a gradient lifting tree model to predict the sewage outlet water index of the sewage plant for a period of time in the future.

Obviously, the method for predicting the effluent index of the sewage based on the random forest and the gradient lifting tree, which is provided by the invention, can also be designed into a system based on the combination of the random forest and the gradient lifting tree, and comprises a sample construction module, a random forest training module, a data screening module, a gradient lifting construction module, a gradient lifting tree training module and a prediction module;

the sample construction module is used for randomly extracting samples in a place-back mode in the original training set and constructing a plurality of sample sets so as to facilitate the training of a subsequent random forest model;

the random forest module is used for integrating and constructing a random forest according to the established multiple regression trees;

the random forest training module is used for training the constructed sample set by the random forest so as to obtain the importance of each characteristic attribute;

the data screening module is used for sequencing the importance of each feature attribute calculated by the random forest training module, deleting the features with low importance and finally keeping the important features;

the gradient lifting tree construction module is used for constructing a gradient lifting tree model by iterating all regression trees;

the gradient lifting tree training module is used for training a training set constructed by the characteristics obtained by screening in advance through the constructed gradient lifting tree model;

the prediction module is used for obtaining prediction data, putting the prediction data into a gradient lifting tree for prediction after screening characteristics, and obtaining a prediction result of the sewage effluent index data in a future period of time by adopting an error accumulation mode according to prediction collection.

In conclusion, the invention screens out important factors by using random forests to further improve the model training efficiency and accuracy, and predicts by using a gradient lifting tree model with higher precision than a neural network, so that a sewage plant predicts whether the sewage quality in a future period is within the national discharge standard. Finally, the aeration value input by the factory can be controlled to achieve the purposes of saving the operation cost of the factory and green and safe emission.

While the present invention has been described in detail with reference to the preferred embodiments, it should be understood that the above description should not be taken as limiting the invention. Various modifications and alterations to this invention will become apparent to those skilled in the art upon reading the foregoing description. Accordingly, the scope of the invention should be limited only by the attached claims.

Claims

1. A method for predicting sewage effluent indexes based on random forests and gradient lifting trees is characterized by comprising the following steps:

step 1: extracting samples in a place where the samples are put back in an original data training set to form a plurality of sample sets;

step 2: constructing a random forest according to the sample; calculating the feature importance according to the random forest, performing attribute screening,

step 2.1: traversing possible values under each characteristic attribute, and finally selecting a point with the minimum sum of square errors as a segmentation point;

step 2.3: constructing a regression tree for each divided sample set;

step 2.4: building a plurality of regression trees into a regression forest;

step 2.6: sorting the features according to the feature importance, and screening out important features; and 3, step 3: constructing a gradient lifting tree model according to a sample formed by the screened attributes,

step 3.2: each regression tree approximately calculates the loss value of the iterative process by using a negative gradient so as to determine the optimal parameter of each regression tree; updating the calculated difference value of each regression tree, and putting the updated difference value into the next regression tree;

step 3.3: accumulating a plurality of regression trees to form a gradient lifting tree model, wherein the gradient lifting tree model is as follows:

wherein J is the number of leaf nodes, I is whether the value representing c belongs to the jth leaf node, and fm (x) is the predicted value of the final model;

2. The method for predicting effluent indicators for a random forest and gradient spanning tree as set forth in claim 1, wherein the step 1 further comprises the steps of: randomly extracting samples according to the original training set and putting back to construct a regression tree; the samples that are not drawn each time are grouped into the same number of out-of-bag samples as the regression tree.