CN113361199A - Multi-dimensional pollutant emission intensity prediction method based on time series - Google Patents

Multi-dimensional pollutant emission intensity prediction method based on time series Download PDF

Info

Publication number
CN113361199A
CN113361199A CN202110642414.4A CN202110642414A CN113361199A CN 113361199 A CN113361199 A CN 113361199A CN 202110642414 A CN202110642414 A CN 202110642414A CN 113361199 A CN113361199 A CN 113361199A
Authority
CN
China
Prior art keywords
data
model
time
training
emission intensity
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110642414.4A
Other languages
Chinese (zh)
Inventor
黄欣逸
李扬
赵勐
刘卫杰
张伟
宋俊男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Zvan Technology Co ltd
Original Assignee
Chengdu Zvan Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Zvan Technology Co ltd filed Critical Chengdu Zvan Technology Co ltd
Priority to CN202110642414.4A priority Critical patent/CN113361199A/en
Publication of CN113361199A publication Critical patent/CN113361199A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F30/00Computer-aided design [CAD]
    • G06F30/20Design optimisation, verification or simulation
    • G06F30/27Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2282Tablespace storage structures; Management thereof
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • G06Q10/063Operations research, analysis or management
    • G06Q10/0639Performance analysis of employees; Performance analysis of enterprise or organisation operations
    • G06Q10/06393Score-carding, benchmarking or key performance indicator [KPI] analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
    • G06Q50/10Services
    • G06Q50/26Government or public services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2119/00Details relating to the type or aim of the analysis or the optimisation
    • G06F2119/12Timing analysis or timing optimisation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Human Resources & Organizations (AREA)
  • General Engineering & Computer Science (AREA)
  • Strategic Management (AREA)
  • Databases & Information Systems (AREA)
  • Economics (AREA)
  • Tourism & Hospitality (AREA)
  • Evolutionary Computation (AREA)
  • Development Economics (AREA)
  • Educational Administration (AREA)
  • Quality & Reliability (AREA)
  • General Business, Economics & Management (AREA)
  • Marketing (AREA)
  • Software Systems (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computational Linguistics (AREA)
  • Game Theory and Decision Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Mathematical Physics (AREA)
  • Operations Research (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Medical Informatics (AREA)
  • Computer Hardware Design (AREA)
  • Geometry (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Primary Health Care (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention relates to a multi-dimensional pollutant emission intensity prediction method based on a time sequence, which comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the method comprises the following steps: setting a search space and updating a step length, and initializing parameter combination and minimum error of an RF model; circularly traversing the parameter combination, using the data set and the parameter combination obtained by data preprocessing as the input of the RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model; updating the parameter combination of the RF model according to the obtained MSE error; when the termination condition is reached, outputting the optimal parameter combination; and establishing a GS-RF prediction model by the obtained optimal parameter combination, and predicting. The invention realizes the recovery of the data of the broken layer and improves the data prediction effect of the continuous time sequence; random forest parameters are optimized by using a grid search algorithm, and each parameter is comprehensively evaluated by combining a cross-validation method, so that the optimization effect of the model is improved.

Description

Multi-dimensional pollutant emission intensity prediction method based on time series
Technical Field
The invention relates to the technical field of pollutant emission prediction, in particular to a multi-dimensional pollutant emission intensity prediction method based on a time sequence.
Background
With the gradual improvement of living standard of people, people have stronger and stronger environmental protection consciousness, but the problem of pollutant discharge is still difficult to stop, some industrial sewage of many enterprises are inevitably discharged into rivers, and although the sewage may be treated by some treatment, the discharged pollutant is not overproof, so that how to predict the discharge intensity of the pollutant of the enterprises to realize the monitoring of the discharged pollutant is a problem to be solved at the present stage.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a multi-dimensional pollutant emission intensity prediction method based on time series. The problem of prediction analysis of pollutant emission is solved.
The purpose of the invention is realized by the following technical scheme: the multi-dimensional pollutant emission intensity prediction method based on the time series comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the GS-RF prediction model building and predicting steps comprise:
s21, setting a search space and an update step length of each parameter in the RF model, and initializing a parameter combination and a minimum error of the RF model;
s22, circularly traversing parameter combinations by using a GS algorithm, taking kth pollutant data sets and parameter combinations which are obtained by the data preprocessing step and are constructed by a logical relationship as input of an RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model;
s23, updating the parameter combination of the RF model according to the obtained MSE error;
s24, judging whether the GS algorithm reaches the termination condition of iteration, if so, ending the search, outputting the parameter combination of the RF model at the moment as the optimal parameter combination, and if not, returning to the step S22 to continue the iteration;
and S25, establishing a GS-RF prediction model according to the optimal parameter combination obtained in the step S24, and predicting the k-dimension pollutant emission intensity on a time series through the prediction model.
The data preprocessing step comprises:
s11, carrying out duplicate removal processing on the acquired data set by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing the primary cleaning of the data set;
s12, acquiring data arranged according to the time field sequence by adopting a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;
s13, combining a data analysis library Pandas, instantiating the data set into a data frame object, and using a set _ index () function to designate a time field as an index;
s14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as a time axis boundary, and utilizing a reindex () function to fill up the missing time field data on the time axis, wherein the rest fields are filled with a value of 0;
s15, carrying out interpolation processing on the 0-filled intensity data, circularly traversing the data set, recording interpolation indexes and calculating the interpolation number;
s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () function, and performing linear interpolation on the index position;
and S17, setting the training scale of the sample and the corresponding regression sequence, and constructing a logical mapping relation.
The linear interpolation specifically includes:
traversing the index of the missing data, setting the ith strip as the missing data, and constructing a linear function
Figure BDA0003108503090000021
Interpolation processing is carried out on various pollutant data;
wherein i-d represents the position of the first non-missing data from the ith data; i + h represents the position of the first non-missing data from the ith data, so as to construct a linear proximity function of each type of pollutant at the missing data index i, and output interpolation data of the corresponding pollutant through the input index i.
The setting of the training scale and the corresponding regression sequence of the sample, and the constructing of the logical mapping relationship comprises:
setting a training scale L to represent the length of continuous time nodes, wherein each L pieces of data form a training sample and are used for predicting the data of the next time node;
dividing the data set obtained in the step S16 into two parts by a training scale L, reconstructing n-L training sets by the first n-1 samples in a sliding mode through a scale L window, and taking the last n-L samples as a regression sequence;
and constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation.
The training of the RF model by the cross-validation method in step S22 includes:
a1, randomly equally dividing the training set into k parts;
a2, taking 1 part of the verification set as a verification set for model evaluation, and taking the remaining k-1 parts as a training set for model training;
a3, repeating the step A2 k times, and taking 1 part of different subsets as a verification set each time to obtain k different models and evaluation indexes thereof;
and A4, evaluating the performance of the whole model by using the comprehensive evaluation indexes of the k models.
Parameters in the RF model include: n _ estimators, representing the number of trees in the random forest, with the search space set to [ a ]1,b1]The search step is set to step1(ii) a max _ depth, representing the maximum depth of the tree, with the search space set to [ a ]2,b2]The search step is set to step2(ii) a min _ samples _ leaf, which represents the minimum number of samples contained in each child node after node branching, and the search space is set as [ a ]3,b3]The search step is set to step3(ii) a min _ samples _ split, representing the minimum number of training samples contained by a node, with the search space set to [ a4,b4]The search step is set to step4(ii) a max _ features, representing the maximum feature dimension when restricting branching, with the search space set to [ a ]5,b5]The search step is set to step5
The search space represents the tuning limit of the parameters, the search step represents the optimization value interval of the parameters, and a network space is formed according to the search space and the search step of each parameter in the RF model:
Figure BDA0003108503090000031
the invention has the following advantages: the multi-dimensional pollutant emission intensity prediction method based on the time sequence comprises the following steps of 1, realizing the recovery of the data of the fault layer, and improving the data prediction effect of the continuous time sequence; 2. random forest parameters are optimized by using a grid search algorithm, and each parameter is comprehensively evaluated by combining a cross-validation method, so that the optimization effect of the model is improved; 3. according to the characteristic of small data quantity, the random forest algorithm is used for overcoming the defects of high RNN deep network weight, complex reasoning and strong data dependence.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of the logical relationship architecture of the present invention;
FIG. 3 is a schematic diagram of an equally divided training set of the cross-validation method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the invention relates to a multidimensional pollutant emission intensity prediction method based on a time series, which utilizes front-end equipment to regularly acquire and upload pollutant intensity data discharged from an enterprise pollution source discharge port and builds a multidimensional pollutant emission intensity prediction model based on the time series. Because the acquired data may lack faults and repeat on the time axis and part of the acquired data is NaN non-numerical attributes, the data preprocessing step comprises the following contents:
s11, executing deduplication processing on the data by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing preliminary cleaning of the data set;
s12, acquiring data arranged according to the time field sequence by using a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;
s13, combining a data analysis library Pandas, instantiating the data set into a data frame (Dataframe) object, and using a set _ index () method to designate a time field as an index;
the format of the pollutant emission intensity data obtained by the database by setting the index is shown as the following table:
Figure BDA0003108503090000041
Figure BDA0003108503090000051
therefore, each dimension data has a head-up field, and the detection time field is used as a list index to complete the timestamp.
S14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as the time axis boundary, and using a reindex () method to fill up the missing time field data on the time axis, wherein the rest fields are filled with 0;
and (3) time stamp filling: since step S13 designates the time field as an index, the positions of missing data in the detection start and end periods can be calibrated by referring to the normal time stamp, the missing data in the detection time field is filled up, and the remaining dimensions are filled with the value 0.
S15, after the data set is processed in the step S14, Interpolation processing (Interpolation) still needs to be executed on the intensity data filled with 0, the data set is circularly traversed, Interpolation indexes are recorded, and the Interpolation number is calculated;
s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () method thereof, and performing linear interpolation on the index position;
the data format of the padding timestamp is shown in the following table:
Figure BDA0003108503090000052
where the time stamps are consecutive with consecutive increasing values, and # represents the original data.
Traversing the missing data index (as shown in item 2, item k-1 and item k in the table above), setting item i as the missing data, and aiming at various pollutants (setting m types), the linear function construction mode is as follows:
Figure BDA0003108503090000053
wherein i-d represents the position of the first non-missing piece of data above the ith piece of data; i + h represents the position of the first piece of non-missing data from the ith piece of data. Therefore, a linear proximity function of each type of pollutant at the missing data index i is constructed, and interpolation data of the corresponding pollutant is output through the input index i.
In order to realize the reasoning and prediction of the multi-dimensional pollutant emission intensity on a time axis, a training scale of a sample and a corresponding regression sequence are required to be set, and a logical mapping relation is constructed;
s17, setting a training scale L, and dividing the data set obtained in the step S16 into two parts: the first n-1 samples are used for reconstructing n-L training samples in a sliding mode through a scale L window, and the last n-L samples are used as regression sequences. And constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation, and finishing the preprocessing of the data set.
Setting a scale L, representing that the length of the continuous time node is L, wherein each L pieces of data form a training sample for predicting the data of the next time node, and the mathematical formula can be expressed as follows:
Figure BDA0003108503090000061
therefore, the present invention can predict data of a next time node using continuous time-series data.
Taking one of the pollutants as an example, the data set can be represented as shown in fig. 2, so that a logical mapping relationship can be constructed for a single type of pollutant data to establish a pollutant emission prediction model, and a plurality of prediction models are required to realize the emission intensity prediction of the multi-dimensional pollutants.
Random Forest (RF) can show excellent processing effect for a large number of data sets, is good at dealing with high-dimensional data, is modeled in a non-deviation estimation mode, and is beneficial to building a model with high generalization performance and strong prediction capability.
The random forest algorithm has various adjustable parameters, and manual intervention on the parameters can play a role in optimizing the random forest regression model, so that the generalization and prediction capability of the model are improved. Important parameters of random forests are:
(1) max _ depth: the maximum depth of the tree determines the complexity of the classification decision of the tree model;
(2) min _ samples _ split: the minimum training sample number contained in the node influences the generalization of the basic model;
(3) min _ samples _ leaf: the minimum sample number contained in each child node after node branching influences the generalization of the basic model;
(4) max _ features: limiting the maximum characteristic dimension during branching and influencing the complexity of a basic model;
(5) n _ estimators: the greater the number of trees in a random forest, the better the model performance, but the lower the efficiency.
Aiming at the adjustable parameters in the above 5 and depending on Mean Squared Error (MSE) functions, the invention establishes a GS-RF pollutant emission prediction model by a random forest parameter optimization method based on Grid Search (GS) in combination with a data set obtained in the data preprocessing step S17, which specifically includes the following contents:
in the method, Grid Search algorithm (GS) can arrange parameters of each dimension in different growth directions in parallel in a specified Search space. The basic idea is as follows: and dividing each parameter to be optimized in the search space into a grid shape, and searching all parameter combinations existing in the grid once until the optimal combination is found. The search attributes for each parameter are shown in the following table:
Figure BDA0003108503090000071
in the table, the search space represents the tuning margin of the parameter, and the search step represents the optimization interval of the parameter, which may form a mesh space:
Figure BDA0003108503090000072
the goal of the GS algorithm is to find the best combination of parameters in this mesh space.
As shown in fig. 3, the RF model is trained by using Cross Validation (CV), which can effectively evaluate the parameter combination of the mesh space and play an indispensable role in guiding data modeling. The process for realizing the k-fold CV by the model comprises the following specific steps:
(1) dividing the training set into k parts at random;
(2) taking one of the k subsets as a verification set for model evaluation, taking the remaining k-1 subsets as a training set for model training, repeating the step k times, taking one different subset as the verification set each time, and obtaining k different models and evaluation indexes thereof;
(3) the performance of the entire model was evaluated using the integrated (average) evaluation index of the k models.
The value of k depends on the specific conditions of data modeling. For a general training set, the larger k is, the smaller learning deviation of the random forest to the training set is, and the generalization performance of the model is favorably improved; however, for the training set with too large sample variance, the larger k is, the longer training period of the random forest is, so that the simulation efficiency of the model is reduced, and therefore, it is necessary to set a k value with a proper size.
The construction process of the GS-RF pollutant emission prediction model is as follows:
(1) setting a search space and an updating step length of each parameter, wherein the minimum error of the initialized parameter combination and the model is (Inf, …, Inf), Inf;
initializing parameter combination: the loop traversal of the algorithm has not started yet, and the purpose of the initialized model minimum error to be infinite inf is as follows: the parameter combination and the minimum error are guaranteed to be updated and replaced when the algorithm is iterated for the first time (the error is necessarily lower than inf).
(2) Using a GS algorithm to circularly traverse the parameter combination, taking the kth pollutant data set constructed by the logical relationship and the parameter combination as the input of regression RF, training an RF model, and outputting the cross validation MSE error of the model; the MSE error function equation is as follows:
Figure BDA0003108503090000081
in the formula, n represents the number of samples, yiRepresenting a real sample, yi' denotes a prediction sample.
(3) Updating the parameter combination of the RF according to the model error obtained in the step (2);
wherein, the parameter combination updating condition is as follows: when the mse error generated by the iteration is less than the minimum mse error, the minimum mse error is the mse error of the iteration, and the optimal parameter combination is the parameter combination of the iteration.
(4) Judging whether the algorithm reaches the termination condition of iteration; if the termination condition is reached, the search is ended and the parameter combination of the best RF, that is, the best parameter combination, is output. Otherwise, returning to the step (2) to continue the iteration.
Wherein the termination condition is as follows: when all parameters in the mesh space are traversed circularly, the end condition of the algorithm iteration is reached.
(5) And (5) establishing a GS-RF pollutant emission prediction model by using the optimal parameter combination obtained in the step (4), and realizing prediction of the k-dimension pollutant emission intensity on a time sequence.
And traversing each dimension pollutant data set constructed by the logical relationship, and establishing a GS-RF pollutant emission prediction model based on each dimension pollutant data, so that the prediction of the multidimensional data can be realized.
If n pieces of pollutant emission intensity data are predicted from the current time, the specific content can be represented as follows:
Figure BDA0003108503090000091
note that T represents the deadline of the original data (the reporting time of the last piece of data, such as 2021/5/2908: 00:00), and tinterval represents the sampling period of the data (at intervals of hours, such as 2021/5/2908: 00:00+01:00:00), and the emission intensity unit of each type of pollutant is (mg/L).
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. The multidimensional pollutant emission intensity prediction method based on the time series is characterized by comprising the following steps: the prediction method comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the GS-RF prediction model building and predicting steps comprise:
s21, setting a search space and an update step length of each parameter in the RF model, and initializing a parameter combination and a minimum error of the RF model;
s22, circularly traversing parameter combinations by using a GS algorithm, taking kth pollutant data sets and parameter combinations which are obtained by the data preprocessing step and are constructed by a logical relationship as input of an RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model;
s23, updating the parameter combination of the RF model according to the obtained MSE error;
s24, judging whether the GS algorithm reaches the termination condition of iteration, if so, ending the search, outputting the parameter combination of the RF model at the moment as the optimal parameter combination, and if not, returning to the step S22 to continue the iteration;
and S25, establishing a GS-RF prediction model according to the optimal parameter combination obtained in the step S24, and predicting the k-dimension pollutant emission intensity on a time series through the prediction model.
2. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 1, characterized in that: the data preprocessing step comprises:
s11, carrying out duplicate removal processing on the acquired data set by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing the primary cleaning of the data set;
s12, acquiring data arranged according to the time field sequence by adopting a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;
s13, combining a data analysis library Pandas, instantiating the data set into a data frame object, and using a set _ index () function to designate a time field as an index;
s14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as a time axis boundary, and utilizing a reindex () function to fill up the missing time field data on the time axis, wherein the rest fields are filled with a value of 0;
s15, carrying out interpolation processing on the 0-filled intensity data, circularly traversing the data set, recording interpolation indexes and calculating the interpolation number;
s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () function, and performing linear interpolation on the index position;
and S17, setting the training scale of the sample and the corresponding regression sequence, and constructing a logical mapping relation.
3. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 2, characterized in that: the linear interpolation specifically includes:
traversing the index of the missing data, setting the ith strip as the missing data, and constructing a linear function
Figure FDA0003108503080000021
Interpolation processing is carried out on various pollutant data;
wherein i-d represents the position of the first non-missing data from the ith data; i + h represents the position of the first non-missing data from the ith data, so as to construct a linear proximity function of each type of pollutant at the missing data index i, and output interpolation data of the corresponding pollutant through the input index i.
4. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 2, characterized in that: the setting of the training scale and the corresponding regression sequence of the sample, and the constructing of the logical mapping relationship comprises:
setting a training scale L to represent the length of continuous time nodes, wherein each L pieces of data form a training sample and are used for predicting the data of the next time node;
dividing the data set obtained in the step S16 into two parts by a training scale L, reconstructing n-L training sets by the first n-1 samples in a sliding mode through a scale L window, and taking the last n-L samples as a regression sequence;
and constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation.
5. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 4, characterized in that: the training of the RF model by the cross-validation method in step S22 includes:
a1, randomly equally dividing the training set into k parts;
a2, taking 1 part of the verification set as a verification set for model evaluation, and taking the remaining k-1 parts as a training set for model training;
a3, repeating the step A2 k times, and taking 1 part of different subsets as a verification set each time to obtain k different models and evaluation indexes thereof;
and A4, evaluating the performance of the whole model by using the comprehensive evaluation indexes of the k models.
6. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 1, characterized in that: parameters in the RF model include: n _ estimators, representing the number of trees in the random forest, with the search space set to [ a ]1,b1]The search step is set to step1(ii) a max _ depth, representing maximum depth of tree, search space setIs set as [ a ]2,b2]The search step is set to step2(ii) a min _ samples _ leaf, which represents the minimum number of samples contained in each child node after node branching, and the search space is set as [ a ]3,b3]The search step is set to step3(ii) a min _ samples _ split, representing the minimum number of training samples contained by a node, with the search space set to [ a4,b4]The search step is set to step4(ii) a max _ features, representing the maximum feature dimension when restricting branching, with the search space set to [ a ]5,b5]The search step is set to step5
7. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 6, characterized in that: the search space represents the tuning limit of the parameters, the search step represents the optimization value interval of the parameters, and a network space is formed according to the search space and the search step of each parameter in the RF model:
Figure FDA0003108503080000031
CN202110642414.4A 2021-06-09 2021-06-09 Multi-dimensional pollutant emission intensity prediction method based on time series Pending CN113361199A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110642414.4A CN113361199A (en) 2021-06-09 2021-06-09 Multi-dimensional pollutant emission intensity prediction method based on time series

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110642414.4A CN113361199A (en) 2021-06-09 2021-06-09 Multi-dimensional pollutant emission intensity prediction method based on time series

Publications (1)

Publication Number Publication Date
CN113361199A true CN113361199A (en) 2021-09-07

Family

ID=77533413

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110642414.4A Pending CN113361199A (en) 2021-06-09 2021-06-09 Multi-dimensional pollutant emission intensity prediction method based on time series

Country Status (1)

Country Link
CN (1) CN113361199A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115099321A (en) * 2022-06-17 2022-09-23 杭州电子科技大学 Bidirectional autoregression unsupervised pre-training fine-tuning type abnormal pollution discharge monitoring method and application

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408774A (en) * 2018-11-07 2019-03-01 上海海事大学 The method of prediction sewage effluent index based on random forest and gradient boosted tree
CN110085281A (en) * 2019-04-26 2019-08-02 成都之维安科技股份有限公司 A kind of environmental pollution traceability system and method based on feature pollution factor source resolution
CN112149887A (en) * 2020-09-08 2020-12-29 北京工业大学 PM2.5 concentration prediction method based on data space-time characteristics
CN112667613A (en) * 2020-12-25 2021-04-16 内蒙古京隆发电有限责任公司 Flue gas NOx prediction method and system based on multi-delay characteristic multivariable correction

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109408774A (en) * 2018-11-07 2019-03-01 上海海事大学 The method of prediction sewage effluent index based on random forest and gradient boosted tree
CN110085281A (en) * 2019-04-26 2019-08-02 成都之维安科技股份有限公司 A kind of environmental pollution traceability system and method based on feature pollution factor source resolution
CN112149887A (en) * 2020-09-08 2020-12-29 北京工业大学 PM2.5 concentration prediction method based on data space-time characteristics
CN112667613A (en) * 2020-12-25 2021-04-16 内蒙古京隆发电有限责任公司 Flue gas NOx prediction method and system based on multi-delay characteristic multivariable correction

Non-Patent Citations (7)

* Cited by examiner, † Cited by third party
Title
CHIANG,PH等: "Forecasting of Solar Photovoltaic System Power Generation using Wavelet Decomposition and Bias-compensated Random Forest", 《2017 NINTH ANNUAL IEEE GREEN TECHNOLOGIES CONFERENCE》 *
吉宏达: "基于数据挖掘的环境信息分析及系统实现", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 *
桂良明等: "基于RF-GBDT的燃煤锅炉NO_x排放预测", 《电站系统工程》 *
王鑫圆等: "基于随机森林与改进极限学习机的PM2.5浓度模型", 《软件》 *
肖祥武等: "基于大数据平台和并行随机森林算法的能耗预测模型优化", 《华电技术》 *
阮敬: "《Python数据分析基础》", 30 September 2017, 中国统计出版社 *
高永彬等: "《Hadoop大数据分析》", 31 July 2019, 中国铁道出版社 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115099321A (en) * 2022-06-17 2022-09-23 杭州电子科技大学 Bidirectional autoregression unsupervised pre-training fine-tuning type abnormal pollution discharge monitoring method and application
CN115099321B (en) * 2022-06-17 2023-08-04 杭州电子科技大学 Bidirectional autoregressive non-supervision pretraining fine-tuning type pollution discharge abnormality monitoring method and application

Similar Documents

Publication Publication Date Title
CN110674604B (en) Transformer DGA data prediction method based on multi-dimensional time sequence frame convolution LSTM
CN112350876A (en) Network flow prediction method based on graph neural network
CN110782658B (en) Traffic prediction method based on LightGBM algorithm
Krause et al. Multiple imputation for longitudinal network data
CN111506637B (en) Multi-dimensional anomaly detection method and device based on KPI (Key Performance indicator) and storage medium
CN110083699B (en) News popularity prediction model training method based on deep neural network
CN110717535A (en) Automatic modeling method and system based on data analysis processing system
CN111738477A (en) Deep feature combination-based power grid new energy consumption capability prediction method
CN109583588B (en) Short-term wind speed prediction method and system
CN110909928A (en) Energy load short-term prediction method and device, computer equipment and storage medium
CN110969252A (en) Knowledge inference method and device based on knowledge base and electronic equipment
CN113361199A (en) Multi-dimensional pollutant emission intensity prediction method based on time series
WO2020233245A1 (en) Method for bias tensor factorization with context feature auto-encoding based on regression tree
CN110569883B (en) Air quality index prediction method based on Kohonen network clustering and Relieff feature selection
Lawrence et al. Explaining neural matrix factorization with gradient rollback
CN114841412A (en) Method for predicting pH value of sea cucumber growing water
CN116245019A (en) Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm
CN114694379A (en) Traffic flow prediction method and system based on self-adaptive dynamic graph convolution
CN117150256A (en) Data generalization method for network security event
CN115238583B (en) Business process remaining time prediction method and system supporting incremental log
CN115145903A (en) Data interpolation method based on production process
CN115883424A (en) Method and system for predicting traffic data between high-speed backbone networks
CN114880490A (en) Knowledge graph completion method based on graph attention network
Shelokar et al. A multiobjective variant of the subdue graph mining algorithm based on the NSGA-II selection mechanism
CN113539386A (en) CLMVO-ELM-based dissolved oxygen concentration prediction method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination