CN113361199A - Multi-dimensional pollutant emission intensity prediction method based on time series - Google Patents
Multi-dimensional pollutant emission intensity prediction method based on time series Download PDFInfo
- Publication number
- CN113361199A CN113361199A CN202110642414.4A CN202110642414A CN113361199A CN 113361199 A CN113361199 A CN 113361199A CN 202110642414 A CN202110642414 A CN 202110642414A CN 113361199 A CN113361199 A CN 113361199A
- Authority
- CN
- China
- Prior art keywords
- data
- model
- time
- training
- emission intensity
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 239000003344 environmental pollutant Substances 0.000 title claims abstract description 52
- 231100000719 pollutant Toxicity 0.000 title claims abstract description 52
- 238000000034 method Methods 0.000 title claims abstract description 32
- 238000012549 training Methods 0.000 claims abstract description 39
- 238000007637 random forest analysis Methods 0.000 claims abstract description 29
- 238000002790 cross-validation Methods 0.000 claims abstract description 12
- 238000007781 pre-processing Methods 0.000 claims abstract description 11
- 238000005457 optimization Methods 0.000 claims abstract description 6
- 238000004422 calculation algorithm Methods 0.000 claims description 12
- 238000011156 evaluation Methods 0.000 claims description 9
- 238000013507 mapping Methods 0.000 claims description 9
- 238000012545 processing Methods 0.000 claims description 9
- 238000012795 verification Methods 0.000 claims description 8
- 238000005070 sampling Methods 0.000 claims description 4
- 208000025174 PANDAS Diseases 0.000 claims description 3
- 208000021155 Paediatric autoimmune neuropsychiatric disorders associated with streptococcal infection Diseases 0.000 claims description 3
- 235000016496 Panda oleosa Nutrition 0.000 claims description 3
- 239000008186 active pharmaceutical agent Substances 0.000 claims description 3
- 238000004364 calculation method Methods 0.000 claims description 3
- 238000004140 cleaning Methods 0.000 claims description 3
- 238000007405 data analysis Methods 0.000 claims description 3
- 238000012886 linear function Methods 0.000 claims description 3
- 240000000220 Panda oleosa Species 0.000 claims 1
- 230000000694 effects Effects 0.000 abstract description 5
- 238000010845 search algorithm Methods 0.000 abstract description 3
- 238000011084 recovery Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 7
- 238000001514 detection method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 3
- 240000004718 Panda Species 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 230000008569 process Effects 0.000 description 2
- 239000010865 sewage Substances 0.000 description 2
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000007613 environmental effect Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 238000012544 monitoring process Methods 0.000 description 1
- 238000004088 simulation Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F30/00—Computer-aided design [CAD]
- G06F30/20—Design optimisation, verification or simulation
- G06F30/27—Design optimisation, verification or simulation using machine learning, e.g. artificial intelligence, neural networks, support vector machines [SVM] or training a model
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/22—Indexing; Data structures therefor; Storage structures
- G06F16/2282—Tablespace storage structures; Management thereof
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/242—Query formulation
- G06F16/2433—Query languages
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2474—Sequence data queries, e.g. querying versioned data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/24323—Tree-organised classifiers
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/06—Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
- G06Q10/063—Operations research, analysis or management
- G06Q10/0639—Performance analysis of employees; Performance analysis of enterprise or organisation operations
- G06Q10/06393—Score-carding, benchmarking or key performance indicator [KPI] analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2119/00—Details relating to the type or aim of the analysis or the optimisation
- G06F2119/12—Timing analysis or timing optimisation
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Human Resources & Organizations (AREA)
- General Engineering & Computer Science (AREA)
- Strategic Management (AREA)
- Databases & Information Systems (AREA)
- Economics (AREA)
- Tourism & Hospitality (AREA)
- Evolutionary Computation (AREA)
- Development Economics (AREA)
- Educational Administration (AREA)
- Quality & Reliability (AREA)
- General Business, Economics & Management (AREA)
- Marketing (AREA)
- Software Systems (AREA)
- Entrepreneurship & Innovation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computational Linguistics (AREA)
- Game Theory and Decision Science (AREA)
- Bioinformatics & Computational Biology (AREA)
- Mathematical Physics (AREA)
- Operations Research (AREA)
- Fuzzy Systems (AREA)
- Probability & Statistics with Applications (AREA)
- Medical Informatics (AREA)
- Computer Hardware Design (AREA)
- Geometry (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Primary Health Care (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
The invention relates to a multi-dimensional pollutant emission intensity prediction method based on a time sequence, which comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the method comprises the following steps: setting a search space and updating a step length, and initializing parameter combination and minimum error of an RF model; circularly traversing the parameter combination, using the data set and the parameter combination obtained by data preprocessing as the input of the RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model; updating the parameter combination of the RF model according to the obtained MSE error; when the termination condition is reached, outputting the optimal parameter combination; and establishing a GS-RF prediction model by the obtained optimal parameter combination, and predicting. The invention realizes the recovery of the data of the broken layer and improves the data prediction effect of the continuous time sequence; random forest parameters are optimized by using a grid search algorithm, and each parameter is comprehensively evaluated by combining a cross-validation method, so that the optimization effect of the model is improved.
Description
Technical Field
The invention relates to the technical field of pollutant emission prediction, in particular to a multi-dimensional pollutant emission intensity prediction method based on a time sequence.
Background
With the gradual improvement of living standard of people, people have stronger and stronger environmental protection consciousness, but the problem of pollutant discharge is still difficult to stop, some industrial sewage of many enterprises are inevitably discharged into rivers, and although the sewage may be treated by some treatment, the discharged pollutant is not overproof, so that how to predict the discharge intensity of the pollutant of the enterprises to realize the monitoring of the discharged pollutant is a problem to be solved at the present stage.
Disclosure of Invention
The invention aims to overcome the defects of the prior art and provides a multi-dimensional pollutant emission intensity prediction method based on time series. The problem of prediction analysis of pollutant emission is solved.
The purpose of the invention is realized by the following technical scheme: the multi-dimensional pollutant emission intensity prediction method based on the time series comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the GS-RF prediction model building and predicting steps comprise:
s21, setting a search space and an update step length of each parameter in the RF model, and initializing a parameter combination and a minimum error of the RF model;
s22, circularly traversing parameter combinations by using a GS algorithm, taking kth pollutant data sets and parameter combinations which are obtained by the data preprocessing step and are constructed by a logical relationship as input of an RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model;
s23, updating the parameter combination of the RF model according to the obtained MSE error;
s24, judging whether the GS algorithm reaches the termination condition of iteration, if so, ending the search, outputting the parameter combination of the RF model at the moment as the optimal parameter combination, and if not, returning to the step S22 to continue the iteration;
and S25, establishing a GS-RF prediction model according to the optimal parameter combination obtained in the step S24, and predicting the k-dimension pollutant emission intensity on a time series through the prediction model.
The data preprocessing step comprises:
s11, carrying out duplicate removal processing on the acquired data set by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing the primary cleaning of the data set;
s12, acquiring data arranged according to the time field sequence by adopting a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;
s13, combining a data analysis library Pandas, instantiating the data set into a data frame object, and using a set _ index () function to designate a time field as an index;
s14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as a time axis boundary, and utilizing a reindex () function to fill up the missing time field data on the time axis, wherein the rest fields are filled with a value of 0;
s15, carrying out interpolation processing on the 0-filled intensity data, circularly traversing the data set, recording interpolation indexes and calculating the interpolation number;
s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () function, and performing linear interpolation on the index position;
and S17, setting the training scale of the sample and the corresponding regression sequence, and constructing a logical mapping relation.
The linear interpolation specifically includes:
traversing the index of the missing data, setting the ith strip as the missing data, and constructing a linear functionInterpolation processing is carried out on various pollutant data;
wherein i-d represents the position of the first non-missing data from the ith data; i + h represents the position of the first non-missing data from the ith data, so as to construct a linear proximity function of each type of pollutant at the missing data index i, and output interpolation data of the corresponding pollutant through the input index i.
The setting of the training scale and the corresponding regression sequence of the sample, and the constructing of the logical mapping relationship comprises:
setting a training scale L to represent the length of continuous time nodes, wherein each L pieces of data form a training sample and are used for predicting the data of the next time node;
dividing the data set obtained in the step S16 into two parts by a training scale L, reconstructing n-L training sets by the first n-1 samples in a sliding mode through a scale L window, and taking the last n-L samples as a regression sequence;
and constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation.
The training of the RF model by the cross-validation method in step S22 includes:
a1, randomly equally dividing the training set into k parts;
a2, taking 1 part of the verification set as a verification set for model evaluation, and taking the remaining k-1 parts as a training set for model training;
a3, repeating the step A2 k times, and taking 1 part of different subsets as a verification set each time to obtain k different models and evaluation indexes thereof;
and A4, evaluating the performance of the whole model by using the comprehensive evaluation indexes of the k models.
Parameters in the RF model include: n _ estimators, representing the number of trees in the random forest, with the search space set to [ a ]1,b1]The search step is set to step1(ii) a max _ depth, representing the maximum depth of the tree, with the search space set to [ a ]2,b2]The search step is set to step2(ii) a min _ samples _ leaf, which represents the minimum number of samples contained in each child node after node branching, and the search space is set as [ a ]3,b3]The search step is set to step3(ii) a min _ samples _ split, representing the minimum number of training samples contained by a node, with the search space set to [ a4,b4]The search step is set to step4(ii) a max _ features, representing the maximum feature dimension when restricting branching, with the search space set to [ a ]5,b5]The search step is set to step5。
The search space represents the tuning limit of the parameters, the search step represents the optimization value interval of the parameters, and a network space is formed according to the search space and the search step of each parameter in the RF model:
the invention has the following advantages: the multi-dimensional pollutant emission intensity prediction method based on the time sequence comprises the following steps of 1, realizing the recovery of the data of the fault layer, and improving the data prediction effect of the continuous time sequence; 2. random forest parameters are optimized by using a grid search algorithm, and each parameter is comprehensively evaluated by combining a cross-validation method, so that the optimization effect of the model is improved; 3. according to the characteristic of small data quantity, the random forest algorithm is used for overcoming the defects of high RNN deep network weight, complex reasoning and strong data dependence.
Drawings
FIG. 1 is a schematic flow diagram of the present invention;
FIG. 2 is a schematic diagram of the logical relationship architecture of the present invention;
FIG. 3 is a schematic diagram of an equally divided training set of the cross-validation method of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present application clearer, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is obvious that the described embodiments are only a part of the embodiments of the present application, and not all the embodiments. The components of the embodiments of the present application, generally described and illustrated in the figures herein, can be arranged and designed in a wide variety of different configurations. Thus, the detailed description of the embodiments of the present application provided below in connection with the appended drawings is not intended to limit the scope of the claimed application, but is merely representative of selected embodiments of the application. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present application without making any creative effort, shall fall within the protection scope of the present application. The invention is further described below with reference to the accompanying drawings.
As shown in fig. 1, the invention relates to a multidimensional pollutant emission intensity prediction method based on a time series, which utilizes front-end equipment to regularly acquire and upload pollutant intensity data discharged from an enterprise pollution source discharge port and builds a multidimensional pollutant emission intensity prediction model based on the time series. Because the acquired data may lack faults and repeat on the time axis and part of the acquired data is NaN non-numerical attributes, the data preprocessing step comprises the following contents:
s11, executing deduplication processing on the data by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing preliminary cleaning of the data set;
s12, acquiring data arranged according to the time field sequence by using a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;
s13, combining a data analysis library Pandas, instantiating the data set into a data frame (Dataframe) object, and using a set _ index () method to designate a time field as an index;
the format of the pollutant emission intensity data obtained by the database by setting the index is shown as the following table:
therefore, each dimension data has a head-up field, and the detection time field is used as a list index to complete the timestamp.
S14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as the time axis boundary, and using a reindex () method to fill up the missing time field data on the time axis, wherein the rest fields are filled with 0;
and (3) time stamp filling: since step S13 designates the time field as an index, the positions of missing data in the detection start and end periods can be calibrated by referring to the normal time stamp, the missing data in the detection time field is filled up, and the remaining dimensions are filled with the value 0.
S15, after the data set is processed in the step S14, Interpolation processing (Interpolation) still needs to be executed on the intensity data filled with 0, the data set is circularly traversed, Interpolation indexes are recorded, and the Interpolation number is calculated;
s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () method thereof, and performing linear interpolation on the index position;
the data format of the padding timestamp is shown in the following table:
where the time stamps are consecutive with consecutive increasing values, and # represents the original data.
Traversing the missing data index (as shown in item 2, item k-1 and item k in the table above), setting item i as the missing data, and aiming at various pollutants (setting m types), the linear function construction mode is as follows:
wherein i-d represents the position of the first non-missing piece of data above the ith piece of data; i + h represents the position of the first piece of non-missing data from the ith piece of data. Therefore, a linear proximity function of each type of pollutant at the missing data index i is constructed, and interpolation data of the corresponding pollutant is output through the input index i.
In order to realize the reasoning and prediction of the multi-dimensional pollutant emission intensity on a time axis, a training scale of a sample and a corresponding regression sequence are required to be set, and a logical mapping relation is constructed;
s17, setting a training scale L, and dividing the data set obtained in the step S16 into two parts: the first n-1 samples are used for reconstructing n-L training samples in a sliding mode through a scale L window, and the last n-L samples are used as regression sequences. And constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation, and finishing the preprocessing of the data set.
Setting a scale L, representing that the length of the continuous time node is L, wherein each L pieces of data form a training sample for predicting the data of the next time node, and the mathematical formula can be expressed as follows:
therefore, the present invention can predict data of a next time node using continuous time-series data.
Taking one of the pollutants as an example, the data set can be represented as shown in fig. 2, so that a logical mapping relationship can be constructed for a single type of pollutant data to establish a pollutant emission prediction model, and a plurality of prediction models are required to realize the emission intensity prediction of the multi-dimensional pollutants.
Random Forest (RF) can show excellent processing effect for a large number of data sets, is good at dealing with high-dimensional data, is modeled in a non-deviation estimation mode, and is beneficial to building a model with high generalization performance and strong prediction capability.
The random forest algorithm has various adjustable parameters, and manual intervention on the parameters can play a role in optimizing the random forest regression model, so that the generalization and prediction capability of the model are improved. Important parameters of random forests are:
(1) max _ depth: the maximum depth of the tree determines the complexity of the classification decision of the tree model;
(2) min _ samples _ split: the minimum training sample number contained in the node influences the generalization of the basic model;
(3) min _ samples _ leaf: the minimum sample number contained in each child node after node branching influences the generalization of the basic model;
(4) max _ features: limiting the maximum characteristic dimension during branching and influencing the complexity of a basic model;
(5) n _ estimators: the greater the number of trees in a random forest, the better the model performance, but the lower the efficiency.
Aiming at the adjustable parameters in the above 5 and depending on Mean Squared Error (MSE) functions, the invention establishes a GS-RF pollutant emission prediction model by a random forest parameter optimization method based on Grid Search (GS) in combination with a data set obtained in the data preprocessing step S17, which specifically includes the following contents:
in the method, Grid Search algorithm (GS) can arrange parameters of each dimension in different growth directions in parallel in a specified Search space. The basic idea is as follows: and dividing each parameter to be optimized in the search space into a grid shape, and searching all parameter combinations existing in the grid once until the optimal combination is found. The search attributes for each parameter are shown in the following table:
in the table, the search space represents the tuning margin of the parameter, and the search step represents the optimization interval of the parameter, which may form a mesh space:
the goal of the GS algorithm is to find the best combination of parameters in this mesh space.
As shown in fig. 3, the RF model is trained by using Cross Validation (CV), which can effectively evaluate the parameter combination of the mesh space and play an indispensable role in guiding data modeling. The process for realizing the k-fold CV by the model comprises the following specific steps:
(1) dividing the training set into k parts at random;
(2) taking one of the k subsets as a verification set for model evaluation, taking the remaining k-1 subsets as a training set for model training, repeating the step k times, taking one different subset as the verification set each time, and obtaining k different models and evaluation indexes thereof;
(3) the performance of the entire model was evaluated using the integrated (average) evaluation index of the k models.
The value of k depends on the specific conditions of data modeling. For a general training set, the larger k is, the smaller learning deviation of the random forest to the training set is, and the generalization performance of the model is favorably improved; however, for the training set with too large sample variance, the larger k is, the longer training period of the random forest is, so that the simulation efficiency of the model is reduced, and therefore, it is necessary to set a k value with a proper size.
The construction process of the GS-RF pollutant emission prediction model is as follows:
(1) setting a search space and an updating step length of each parameter, wherein the minimum error of the initialized parameter combination and the model is (Inf, …, Inf), Inf;
initializing parameter combination: the loop traversal of the algorithm has not started yet, and the purpose of the initialized model minimum error to be infinite inf is as follows: the parameter combination and the minimum error are guaranteed to be updated and replaced when the algorithm is iterated for the first time (the error is necessarily lower than inf).
(2) Using a GS algorithm to circularly traverse the parameter combination, taking the kth pollutant data set constructed by the logical relationship and the parameter combination as the input of regression RF, training an RF model, and outputting the cross validation MSE error of the model; the MSE error function equation is as follows:
in the formula, n represents the number of samples, yiRepresenting a real sample, yi' denotes a prediction sample.
(3) Updating the parameter combination of the RF according to the model error obtained in the step (2);
wherein, the parameter combination updating condition is as follows: when the mse error generated by the iteration is less than the minimum mse error, the minimum mse error is the mse error of the iteration, and the optimal parameter combination is the parameter combination of the iteration.
(4) Judging whether the algorithm reaches the termination condition of iteration; if the termination condition is reached, the search is ended and the parameter combination of the best RF, that is, the best parameter combination, is output. Otherwise, returning to the step (2) to continue the iteration.
Wherein the termination condition is as follows: when all parameters in the mesh space are traversed circularly, the end condition of the algorithm iteration is reached.
(5) And (5) establishing a GS-RF pollutant emission prediction model by using the optimal parameter combination obtained in the step (4), and realizing prediction of the k-dimension pollutant emission intensity on a time sequence.
And traversing each dimension pollutant data set constructed by the logical relationship, and establishing a GS-RF pollutant emission prediction model based on each dimension pollutant data, so that the prediction of the multidimensional data can be realized.
If n pieces of pollutant emission intensity data are predicted from the current time, the specific content can be represented as follows:
note that T represents the deadline of the original data (the reporting time of the last piece of data, such as 2021/5/2908: 00:00), and tinterval represents the sampling period of the data (at intervals of hours, such as 2021/5/2908: 00:00+01:00:00), and the emission intensity unit of each type of pollutant is (mg/L).
The foregoing is illustrative of the preferred embodiments of this invention, and it is to be understood that the invention is not limited to the precise form disclosed herein and that various other combinations, modifications, and environments may be resorted to, falling within the scope of the concept as disclosed herein, either as described above or as apparent to those skilled in the relevant art. And that modifications and variations may be effected by those skilled in the art without departing from the spirit and scope of the invention as defined by the appended claims.
Claims (7)
1. The multidimensional pollutant emission intensity prediction method based on the time series is characterized by comprising the following steps: the prediction method comprises a data preprocessing step and a GS-RF prediction model building and predicting step; the GS-RF prediction model building and predicting steps comprise:
s21, setting a search space and an update step length of each parameter in the RF model, and initializing a parameter combination and a minimum error of the RF model;
s22, circularly traversing parameter combinations by using a GS algorithm, taking kth pollutant data sets and parameter combinations which are obtained by the data preprocessing step and are constructed by a logical relationship as input of an RF model, and training the RF model by a cross validation method to obtain the MSE error of the RF model;
s23, updating the parameter combination of the RF model according to the obtained MSE error;
s24, judging whether the GS algorithm reaches the termination condition of iteration, if so, ending the search, outputting the parameter combination of the RF model at the moment as the optimal parameter combination, and if not, returning to the step S22 to continue the iteration;
and S25, establishing a GS-RF prediction model according to the optimal parameter combination obtained in the step S24, and predicting the k-dimension pollutant emission intensity on a time series through the prediction model.
2. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 1, characterized in that: the data preprocessing step comprises:
s11, carrying out duplicate removal processing on the acquired data set by combining with the SQL statement, deleting the data with the field attribute corresponding to NaN, and finishing the primary cleaning of the data set;
s12, acquiring data arranged according to the time field sequence by adopting a DB-API of Python, and ensuring the time dimension presentation increasing trend of the data set;
s13, combining a data analysis library Pandas, instantiating the data set into a data frame object, and using a set _ index () function to designate a time field as an index;
s14, specifying the sampling time corresponding to the first and the last two pieces of data of the data frame object as a time axis boundary, and utilizing a reindex () function to fill up the missing time field data on the time axis, wherein the rest fields are filled with a value of 0;
s15, carrying out interpolation processing on the 0-filled intensity data, circularly traversing the data set, recording interpolation indexes and calculating the interpolation number;
s16, according to the interpolation index and the interpolation number of the data set, combining the array calculation library Numpy and the interp () function, and performing linear interpolation on the index position;
and S17, setting the training scale of the sample and the corresponding regression sequence, and constructing a logical mapping relation.
3. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 2, characterized in that: the linear interpolation specifically includes:
traversing the index of the missing data, setting the ith strip as the missing data, and constructing a linear functionInterpolation processing is carried out on various pollutant data;
wherein i-d represents the position of the first non-missing data from the ith data; i + h represents the position of the first non-missing data from the ith data, so as to construct a linear proximity function of each type of pollutant at the missing data index i, and output interpolation data of the corresponding pollutant through the input index i.
4. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 2, characterized in that: the setting of the training scale and the corresponding regression sequence of the sample, and the constructing of the logical mapping relationship comprises:
setting a training scale L to represent the length of continuous time nodes, wherein each L pieces of data form a training sample and are used for predicting the data of the next time node;
dividing the data set obtained in the step S16 into two parts by a training scale L, reconstructing n-L training sets by the first n-1 samples in a sliding mode through a scale L window, and taking the last n-L samples as a regression sequence;
and constructing the data corresponding to the continuous time sequence and the data corresponding to the next time node into a logical mapping relation.
5. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 4, characterized in that: the training of the RF model by the cross-validation method in step S22 includes:
a1, randomly equally dividing the training set into k parts;
a2, taking 1 part of the verification set as a verification set for model evaluation, and taking the remaining k-1 parts as a training set for model training;
a3, repeating the step A2 k times, and taking 1 part of different subsets as a verification set each time to obtain k different models and evaluation indexes thereof;
and A4, evaluating the performance of the whole model by using the comprehensive evaluation indexes of the k models.
6. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 1, characterized in that: parameters in the RF model include: n _ estimators, representing the number of trees in the random forest, with the search space set to [ a ]1,b1]The search step is set to step1(ii) a max _ depth, representing maximum depth of tree, search space setIs set as [ a ]2,b2]The search step is set to step2(ii) a min _ samples _ leaf, which represents the minimum number of samples contained in each child node after node branching, and the search space is set as [ a ]3,b3]The search step is set to step3(ii) a min _ samples _ split, representing the minimum number of training samples contained by a node, with the search space set to [ a4,b4]The search step is set to step4(ii) a max _ features, representing the maximum feature dimension when restricting branching, with the search space set to [ a ]5,b5]The search step is set to step5。
7. The time-series-based multi-dimensional pollutant emission intensity prediction method according to claim 6, characterized in that: the search space represents the tuning limit of the parameters, the search step represents the optimization value interval of the parameters, and a network space is formed according to the search space and the search step of each parameter in the RF model:
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110642414.4A CN113361199A (en) | 2021-06-09 | 2021-06-09 | Multi-dimensional pollutant emission intensity prediction method based on time series |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110642414.4A CN113361199A (en) | 2021-06-09 | 2021-06-09 | Multi-dimensional pollutant emission intensity prediction method based on time series |
Publications (1)
Publication Number | Publication Date |
---|---|
CN113361199A true CN113361199A (en) | 2021-09-07 |
Family
ID=77533413
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110642414.4A Pending CN113361199A (en) | 2021-06-09 | 2021-06-09 | Multi-dimensional pollutant emission intensity prediction method based on time series |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113361199A (en) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115099321A (en) * | 2022-06-17 | 2022-09-23 | 杭州电子科技大学 | Bidirectional autoregression unsupervised pre-training fine-tuning type abnormal pollution discharge monitoring method and application |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408774A (en) * | 2018-11-07 | 2019-03-01 | 上海海事大学 | The method of prediction sewage effluent index based on random forest and gradient boosted tree |
CN110085281A (en) * | 2019-04-26 | 2019-08-02 | 成都之维安科技股份有限公司 | A kind of environmental pollution traceability system and method based on feature pollution factor source resolution |
CN112149887A (en) * | 2020-09-08 | 2020-12-29 | 北京工业大学 | PM2.5 concentration prediction method based on data space-time characteristics |
CN112667613A (en) * | 2020-12-25 | 2021-04-16 | 内蒙古京隆发电有限责任公司 | Flue gas NOx prediction method and system based on multi-delay characteristic multivariable correction |
-
2021
- 2021-06-09 CN CN202110642414.4A patent/CN113361199A/en active Pending
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109408774A (en) * | 2018-11-07 | 2019-03-01 | 上海海事大学 | The method of prediction sewage effluent index based on random forest and gradient boosted tree |
CN110085281A (en) * | 2019-04-26 | 2019-08-02 | 成都之维安科技股份有限公司 | A kind of environmental pollution traceability system and method based on feature pollution factor source resolution |
CN112149887A (en) * | 2020-09-08 | 2020-12-29 | 北京工业大学 | PM2.5 concentration prediction method based on data space-time characteristics |
CN112667613A (en) * | 2020-12-25 | 2021-04-16 | 内蒙古京隆发电有限责任公司 | Flue gas NOx prediction method and system based on multi-delay characteristic multivariable correction |
Non-Patent Citations (7)
Title |
---|
CHIANG,PH等: "Forecasting of Solar Photovoltaic System Power Generation using Wavelet Decomposition and Bias-compensated Random Forest", 《2017 NINTH ANNUAL IEEE GREEN TECHNOLOGIES CONFERENCE》 * |
吉宏达: "基于数据挖掘的环境信息分析及系统实现", 《中国优秀硕士学位论文全文数据库信息科技辑(月刊)》 * |
桂良明等: "基于RF-GBDT的燃煤锅炉NO_x排放预测", 《电站系统工程》 * |
王鑫圆等: "基于随机森林与改进极限学习机的PM2.5浓度模型", 《软件》 * |
肖祥武等: "基于大数据平台和并行随机森林算法的能耗预测模型优化", 《华电技术》 * |
阮敬: "《Python数据分析基础》", 30 September 2017, 中国统计出版社 * |
高永彬等: "《Hadoop大数据分析》", 31 July 2019, 中国铁道出版社 * |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN115099321A (en) * | 2022-06-17 | 2022-09-23 | 杭州电子科技大学 | Bidirectional autoregression unsupervised pre-training fine-tuning type abnormal pollution discharge monitoring method and application |
CN115099321B (en) * | 2022-06-17 | 2023-08-04 | 杭州电子科技大学 | Bidirectional autoregressive non-supervision pretraining fine-tuning type pollution discharge abnormality monitoring method and application |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110674604B (en) | Transformer DGA data prediction method based on multi-dimensional time sequence frame convolution LSTM | |
CN112350876A (en) | Network flow prediction method based on graph neural network | |
CN110782658B (en) | Traffic prediction method based on LightGBM algorithm | |
Krause et al. | Multiple imputation for longitudinal network data | |
CN111506637B (en) | Multi-dimensional anomaly detection method and device based on KPI (Key Performance indicator) and storage medium | |
CN110083699B (en) | News popularity prediction model training method based on deep neural network | |
CN110717535A (en) | Automatic modeling method and system based on data analysis processing system | |
CN111738477A (en) | Deep feature combination-based power grid new energy consumption capability prediction method | |
CN109583588B (en) | Short-term wind speed prediction method and system | |
CN110909928A (en) | Energy load short-term prediction method and device, computer equipment and storage medium | |
CN110969252A (en) | Knowledge inference method and device based on knowledge base and electronic equipment | |
CN113361199A (en) | Multi-dimensional pollutant emission intensity prediction method based on time series | |
WO2020233245A1 (en) | Method for bias tensor factorization with context feature auto-encoding based on regression tree | |
CN110569883B (en) | Air quality index prediction method based on Kohonen network clustering and Relieff feature selection | |
Lawrence et al. | Explaining neural matrix factorization with gradient rollback | |
CN114841412A (en) | Method for predicting pH value of sea cucumber growing water | |
CN116245019A (en) | Load prediction method, system, device and storage medium based on Bagging sampling and improved random forest algorithm | |
CN114694379A (en) | Traffic flow prediction method and system based on self-adaptive dynamic graph convolution | |
CN117150256A (en) | Data generalization method for network security event | |
CN115238583B (en) | Business process remaining time prediction method and system supporting incremental log | |
CN115145903A (en) | Data interpolation method based on production process | |
CN115883424A (en) | Method and system for predicting traffic data between high-speed backbone networks | |
CN114880490A (en) | Knowledge graph completion method based on graph attention network | |
Shelokar et al. | A multiobjective variant of the subdue graph mining algorithm based on the NSGA-II selection mechanism | |
CN113539386A (en) | CLMVO-ELM-based dissolved oxygen concentration prediction method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |