CN113361199A

CN113361199A - Multi-dimensional pollutant emission intensity prediction method based on time series

Info

Publication number: CN113361199A
Application number: CN202110642414.4A
Authority: CN
Inventors: 黄欣逸; 李扬; 赵勐; 刘卫杰; 张伟; 宋俊男
Original assignee: Chengdu Zvan Technology Co ltd
Current assignee: Chengdu Zvan Technology Co ltd
Priority date: 2021-06-09
Filing date: 2021-06-09
Publication date: 2021-09-07

Abstract

The invention relates to a multi-dimensional pollutant emission intensity prediction method based on a time sequence, which comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the method comprises the following steps: setting a search space and updating a step length, and initializing parameter combination and minimum error of an RF model; circularly traversing the parameter combination, using the data set and the parameter combination obtained by data preprocessing as the input of the RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model; updating the parameter combination of the RF model according to the obtained MSE error; when the termination condition is reached, outputting the optimal parameter combination; and establishing a GS-RF prediction model by the obtained optimal parameter combination, and predicting. The invention realizes the recovery of the data of the broken layer and improves the data prediction effect of the continuous time sequence; random forest parameters are optimized by using a grid search algorithm, and each parameter is comprehensively evaluated by combining a cross-validation method, so that the optimization effect of the model is improved.

Description

Multi-dimensional pollutant emission intensity prediction method based on time series

Technical Field

The invention relates to the technical field of pollutant emission prediction, in particular to a multi-dimensional pollutant emission intensity prediction method based on a time sequence.

Background

With the gradual improvement of living standard of people, people have stronger and stronger environmental protection consciousness, but the problem of pollutant discharge is still difficult to stop, some industrial sewage of many enterprises are inevitably discharged into rivers, and although the sewage may be treated by some treatment, the discharged pollutant is not overproof, so that how to predict the discharge intensity of the pollutant of the enterprises to realize the monitoring of the discharged pollutant is a problem to be solved at the present stage.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a multi-dimensional pollutant emission intensity prediction method based on time series. The problem of prediction analysis of pollutant emission is solved.

The purpose of the invention is realized by the following technical scheme: the multi-dimensional pollutant emission intensity prediction method based on the time series comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the GS-RF prediction model building and predicting steps comprise:

s21, setting a search space and an update step length of each parameter in the RF model, and initializing a parameter combination and a minimum error of the RF model;

s22, circularly traversing parameter combinations by using a GS algorithm, taking kth pollutant data sets and parameter combinations which are obtained by the data preprocessing step and are constructed by a logical relationship as input of an RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model;

s23, updating the parameter combination of the RF model according to the obtained MSE error;

s24, judging whether the GS algorithm reaches the termination condition of iteration, if so, ending the search, outputting the parameter combination of the RF model at the moment as the optimal parameter combination, and if not, returning to the step S22 to continue the iteration;

and S25, establishing a GS-RF prediction model according to the optimal parameter combination obtained in the step S24, and predicting the k-dimension pollutant emission intensity on a time series through the prediction model.

The data preprocessing step comprises:

s11, carrying out duplicate removal processing on the acquired data set by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing the primary cleaning of the data set;

s12, acquiring data arranged according to the time field sequence by adopting a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;

s13, combining a data analysis library Pandas, instantiating the data set into a data frame object, and using a set _ index () function to designate a time field as an index;

s14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as a time axis boundary, and utilizing a reindex () function to fill up the missing time field data on the time axis, wherein the rest fields are filled with a value of 0;

s15, carrying out interpolation processing on the 0-filled intensity data, circularly traversing the data set, recording interpolation indexes and calculating the interpolation number;

s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () function, and performing linear interpolation on the index position;

and S17, setting the training scale of the sample and the corresponding regression sequence, and constructing a logical mapping relation.

The linear interpolation specifically includes:

traversing the index of the missing data, setting the ith strip as the missing data, and constructing a linear function

Interpolation processing is carried out on various pollutant data;

wherein i-d represents the position of the first non-missing data from the ith data; i + h represents the position of the first non-missing data from the ith data, so as to construct a linear proximity function of each type of pollutant at the missing data index i, and output interpolation data of the corresponding pollutant through the input index i.

The setting of the training scale and the corresponding regression sequence of the sample, and the constructing of the logical mapping relationship comprises:

setting a training scale L to represent the length of continuous time nodes, wherein each L pieces of data form a training sample and are used for predicting the data of the next time node;

dividing the data set obtained in the step S16 into two parts by a training scale L, reconstructing n-L training sets by the first n-1 samples in a sliding mode through a scale L window, and taking the last n-L samples as a regression sequence;

and constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation.

The training of the RF model by the cross-validation method in step S22 includes:

a1, randomly equally dividing the training set into k parts;

a2, taking 1 part of the verification set as a verification set for model evaluation, and taking the remaining k-1 parts as a training set for model training;

a3, repeating the step A2 k times, and taking 1 part of different subsets as a verification set each time to obtain k different models and evaluation indexes thereof;

and A4, evaluating the performance of the whole model by using the comprehensive evaluation indexes of the k models.

Parameters in the RF model include: n _ estimators, representing the number of trees in the random forest, with the search space set to [ a ]₁,b₁]The search step is set to step₁(ii) a max _ depth, representing the maximum depth of the tree, with the search space set to [ a ]₂,b₂]The search step is set to step₂(ii) a min _ samples _ leaf, which represents the minimum number of samples contained in each child node after node branching, and the search space is set as [ a ]₃,b₃]The search step is set to step₃(ii) a min _ samples _ split, representing the minimum number of training samples contained by a node, with the search space set to [ a₄,b₄]The search step is set to step₄(ii) a max _ features, representing the maximum feature dimension when restricting branching, with the search space set to [ a ]₅,b₅]The search step is set to step₅。

The search space represents the tuning limit of the parameters, the search step represents the optimization value interval of the parameters, and a network space is formed according to the search space and the search step of each parameter in the RF model:

the invention has the following advantages: the multi-dimensional pollutant emission intensity prediction method based on the time sequence comprises the following steps of 1, realizing the recovery of the data of the fault layer, and improving the data prediction effect of the continuous time sequence; 2. random forest parameters are optimized by using a grid search algorithm, and each parameter is comprehensively evaluated by combining a cross-validation method, so that the optimization effect of the model is improved; 3. according to the characteristic of small data quantity, the random forest algorithm is used for overcoming the defects of high RNN deep network weight, complex reasoning and strong data dependence.

Drawings

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of the logical relationship architecture of the present invention;

FIG. 3 is a schematic diagram of an equally divided training set of the cross-validation method of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.

As shown in fig. 1, the invention relates to a multidimensional pollutant emission intensity prediction method based on a time series, which utilizes front-end equipment to regularly acquire and upload pollutant intensity data discharged from an enterprise pollution source discharge port and builds a multidimensional pollutant emission intensity prediction model based on the time series. Because the acquired data may lack faults and repeat on the time axis and part of the acquired data is NaN non-numerical attributes, the data preprocessing step comprises the following contents:

s11, executing deduplication processing on the data by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing preliminary cleaning of the data set;

s12, acquiring data arranged according to the time field sequence by using a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;

s13, combining a data analysis library Pandas, instantiating the data set into a data frame (Dataframe) object, and using a set _ index () method to designate a time field as an index;

the format of the pollutant emission intensity data obtained by the database by setting the index is shown as the following table:

therefore, each dimension data has a head-up field, and the detection time field is used as a list index to complete the timestamp.

S14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as the time axis boundary, and using a reindex () method to fill up the missing time field data on the time axis, wherein the rest fields are filled with 0;

and (3) time stamp filling: since step S13 designates the time field as an index, the positions of missing data in the detection start and end periods can be calibrated by referring to the normal time stamp, the missing data in the detection time field is filled up, and the remaining dimensions are filled with the value 0.

S15, after the data set is processed in the step S14, Interpolation processing (Interpolation) still needs to be executed on the intensity data filled with 0, the data set is circularly traversed, Interpolation indexes are recorded, and the Interpolation number is calculated;

s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () method thereof, and performing linear interpolation on the index position;

the data format of the padding timestamp is shown in the following table:

where the time stamps are consecutive with consecutive increasing values, and # represents the original data.

Traversing the missing data index (as shown in item 2, item k-1 and item k in the table above), setting item i as the missing data, and aiming at various pollutants (setting m types), the linear function construction mode is as follows:

wherein i-d represents the position of the first non-missing piece of data above the ith piece of data; i + h represents the position of the first piece of non-missing data from the ith piece of data. Therefore, a linear proximity function of each type of pollutant at the missing data index i is constructed, and interpolation data of the corresponding pollutant is output through the input index i.

In order to realize the reasoning and prediction of the multi-dimensional pollutant emission intensity on a time axis, a training scale of a sample and a corresponding regression sequence are required to be set, and a logical mapping relation is constructed;

s17, setting a training scale L, and dividing the data set obtained in the step S16 into two parts: the first n-1 samples are used for reconstructing n-L training samples in a sliding mode through a scale L window, and the last n-L samples are used as regression sequences. And constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation, and finishing the preprocessing of the data set.

Setting a scale L, representing that the length of the continuous time node is L, wherein each L pieces of data form a training sample for predicting the data of the next time node, and the mathematical formula can be expressed as follows:

therefore, the present invention can predict data of a next time node using continuous time-series data.

Taking one of the pollutants as an example, the data set can be represented as shown in fig. 2, so that a logical mapping relationship can be constructed for a single type of pollutant data to establish a pollutant emission prediction model, and a plurality of prediction models are required to realize the emission intensity prediction of the multi-dimensional pollutants.

Random Forest (RF) can show excellent processing effect for a large number of data sets, is good at dealing with high-dimensional data, is modeled in a non-deviation estimation mode, and is beneficial to building a model with high generalization performance and strong prediction capability.

The random forest algorithm has various adjustable parameters, and manual intervention on the parameters can play a role in optimizing the random forest regression model, so that the generalization and prediction capability of the model are improved. Important parameters of random forests are:

(1) max _ depth: the maximum depth of the tree determines the complexity of the classification decision of the tree model;

(2) min _ samples _ split: the minimum training sample number contained in the node influences the generalization of the basic model;

(3) min _ samples _ leaf: the minimum sample number contained in each child node after node branching influences the generalization of the basic model;

(4) max _ features: limiting the maximum characteristic dimension during branching and influencing the complexity of a basic model;

(5) n _ estimators: the greater the number of trees in a random forest, the better the model performance, but the lower the efficiency.

Aiming at the adjustable parameters in the above 5 and depending on Mean Squared Error (MSE) functions, the invention establishes a GS-RF pollutant emission prediction model by a random forest parameter optimization method based on Grid Search (GS) in combination with a data set obtained in the data preprocessing step S17, which specifically includes the following contents:

in the method, Grid Search algorithm (GS) can arrange parameters of each dimension in different growth directions in parallel in a specified Search space. The basic idea is as follows: and dividing each parameter to be optimized in the search space into a grid shape, and searching all parameter combinations existing in the grid once until the optimal combination is found. The search attributes for each parameter are shown in the following table:

in the table, the search space represents the tuning margin of the parameter, and the search step represents the optimization interval of the parameter, which may form a mesh space:

the goal of the GS algorithm is to find the best combination of parameters in this mesh space.

As shown in fig. 3, the RF model is trained by using Cross Validation (CV), which can effectively evaluate the parameter combination of the mesh space and play an indispensable role in guiding data modeling. The process for realizing the k-fold CV by the model comprises the following specific steps:

(1) dividing the training set into k parts at random;

(2) taking one of the k subsets as a verification set for model evaluation, taking the remaining k-1 subsets as a training set for model training, repeating the step k times, taking one different subset as the verification set each time, and obtaining k different models and evaluation indexes thereof;

(3) the performance of the entire model was evaluated using the integrated (average) evaluation index of the k models.

The value of k depends on the specific conditions of data modeling. For a general training set, the larger k is, the smaller learning deviation of the random forest to the training set is, and the generalization performance of the model is favorably improved; however, for the training set with too large sample variance, the larger k is, the longer training period of the random forest is, so that the simulation efficiency of the model is reduced, and therefore, it is necessary to set a k value with a proper size.

The construction process of the GS-RF pollutant emission prediction model is as follows:

(1) setting a search space and an updating step length of each parameter, wherein the minimum error of the initialized parameter combination and the model is (Inf, …, Inf), Inf;

initializing parameter combination: the loop traversal of the algorithm has not started yet, and the purpose of the initialized model minimum error to be infinite inf is as follows: the parameter combination and the minimum error are guaranteed to be updated and replaced when the algorithm is iterated for the first time (the error is necessarily lower than inf).

(2) Using a GS algorithm to circularly traverse the parameter combination, taking the kth pollutant data set constructed by the logical relationship and the parameter combination as the input of regression RF, training an RF model, and outputting the cross validation MSE error of the model; the MSE error function equation is as follows:

in the formula, n represents the number of samples, y_iRepresenting a real sample, y_i' denotes a prediction sample.

(3) Updating the parameter combination of the RF according to the model error obtained in the step (2);

wherein, the parameter combination updating condition is as follows: when the mse error generated by the iteration is less than the minimum mse error, the minimum mse error is the mse error of the iteration, and the optimal parameter combination is the parameter combination of the iteration.

(4) Judging whether the algorithm reaches the termination condition of iteration; if the termination condition is reached, the search is ended and the parameter combination of the best RF, that is, the best parameter combination, is output. Otherwise, returning to the step (2) to continue the iteration.

Wherein the termination condition is as follows: when all parameters in the mesh space are traversed circularly, the end condition of the algorithm iteration is reached.

(5) And (5) establishing a GS-RF pollutant emission prediction model by using the optimal parameter combination obtained in the step (4), and realizing prediction of the k-dimension pollutant emission intensity on a time sequence.

And traversing each dimension pollutant data set constructed by the logical relationship, and establishing a GS-RF pollutant emission prediction model based on each dimension pollutant data, so that the prediction of the multidimensional data can be realized.

If n pieces of pollutant emission intensity data are predicted from the current time, the specific content can be represented as follows:

note that T represents the deadline of the original data (the reporting time of the last piece of data, such as 2021/5/2908: 00:00), and tinterval represents the sampling period of the data (at intervals of hours, such as 2021/5/2908: 00:00+01:00:00), and the emission intensity unit of each type of pollutant is (mg/L).

The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. The multidimensional pollutant emission intensity prediction method based on the time series is characterized by comprising the following steps: the prediction method comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the GS-RF prediction model building and predicting steps comprise:

2. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 1, characterized in that: the data preprocessing step comprises:

3. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 2, characterized in that: the linear interpolation specifically includes:

Interpolation processing is carried out on various pollutant data;

4. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 2, characterized in that: the setting of the training scale and the corresponding regression sequence of the sample, and the constructing of the logical mapping relationship comprises:

5. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 4, characterized in that: the training of the RF model by the cross-validation method in step S22 includes:

a1, randomly equally dividing the training set into k parts;

6. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 1, characterized in that: parameters in the RF model include: n _ estimators, representing the number of trees in the random forest, with the search space set to [ a ]₁,b₁]The search step is set to step₁(ii) a max _ depth, representing maximum depth of tree, search space setIs set as [ a ]₂,b₂]The search step is set to step₂(ii) a min _ samples _ leaf, which represents the minimum number of samples contained in each child node after node branching, and the search space is set as [ a ]₃,b₃]The search step is set to step₃(ii) a min _ samples _ split, representing the minimum number of training samples contained by a node, with the search space set to [ a₄,b₄]The search step is set to step₄(ii) a max _ features, representing the maximum feature dimension when restricting branching, with the search space set to [ a ]₅,b₅]The search step is set to step₅。

7. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 6, characterized in that: the search space represents the tuning limit of the parameters, the search step represents the optimization value interval of the parameters, and a network space is formed according to the search space and the search step of each parameter in the RF model: