CN112819244B - Meteorological factor-based RF-HW water quality index hybrid prediction method - Google Patents

Meteorological factor-based RF-HW water quality index hybrid prediction method Download PDF

Info

Publication number
CN112819244B
CN112819244B CN202110204105.9A CN202110204105A CN112819244B CN 112819244 B CN112819244 B CN 112819244B CN 202110204105 A CN202110204105 A CN 202110204105A CN 112819244 B CN112819244 B CN 112819244B
Authority
CN
China
Prior art keywords
data
water quality
quality index
model
factor
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110204105.9A
Other languages
Chinese (zh)
Other versions
CN112819244A (en
Inventor
张仪萍
姚欣宇
刘小为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang University ZJU
Original Assignee
Zhejiang University ZJU
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang University ZJU filed Critical Zhejiang University ZJU
Priority to CN202110204105.9A priority Critical patent/CN112819244B/en
Publication of CN112819244A publication Critical patent/CN112819244A/en
Application granted granted Critical
Publication of CN112819244B publication Critical patent/CN112819244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/04Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/01Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q50/00Systems or methods specially adapted for specific business sectors, e.g. utilities or tourism
    • G06Q50/06Electricity, gas or water supply
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A20/00Water conservation; Efficient water supply; Efficient water use
    • Y02A20/152Water filtration

Abstract

The invention discloses a meteorological factor-based RF-HW water quality index hybrid prediction method, which comprises the following steps of: acquiring meteorological factor data and water quality index data, and after filtering abnormal values of the acquired data, constructing a sample database containing meteorological factor time sequence data and water quality index time sequence data; constructing a water quality index prediction model comprising a Holter-Wett model, a random forest model and a random forest correction model; when the method is applied, the meteorological factor time series data and the water quality index time series data are input into a water quality index prediction model, and a mixed prediction value of the water quality index time series data at the future time is obtained through calculation. The fixed defect of a single model is overcome by adopting a mode of combining a random forest algorithm and a Hall special-Wentt model, and the prediction stability and precision of the model can be improved.

Description

Meteorological factor-based RF-HW water quality index hybrid prediction method
Technical Field
The invention belongs to the technical field of water environment monitoring, and particularly relates to a meteorological factor-based RF-HW water quality index hybrid prediction method.
Background
The drinking water safety is an essential condition for people's life, and as a source of origin and a source of supply of drinking water, a drinking water source is a key place for ensuring the drinking water safety. By predicting the change of the quality of raw water in rivers, lakes, reservoirs and other water source places, effective treatment measures can be taken in advance in time according to the water quality fluctuation condition, and the quality of supplied water is ensured.
The water quality prediction model can be mainly divided into a mechanism model and a non-mechanism model. The water system change is described by a mathematical means on the basis of the reaction mechanisms of physical, chemical, biological and the like of all the components in the water circulation in the water body. The earliest water quality model was the street-Phelps model (S-P model) used to study BOD and DO growth and decay change laws in one-dimensional rivers. After that, the mechanical water quality model is continuously developed and perfected, and EFDC, WASP, MIKE, QUASAR, QUAL series and other models appear. The model scale is from one dimension to three dimensions, the water quality components are from single components to multiple components, and the research on the water quality change mechanism is also deepened. Although the mechanistic model has clear physical significance and is based on a mathematical theory, the modeling process is complex, the model parameters are many and are difficult to obtain, a large amount of simplification is often needed in practical application, and many parameters are directly set as constants or even ignored, so that uncertainty exists in the model and the solving process, and a simulation result cannot really reflect the actual situation.
Therefore, non-mechanistic models have been developed with the aid of system theory and computer technology advances. The non-mechanistic model has less consideration on the influence mechanism of physical, chemical and biological factors on water quality, does not pay attention to the internal mechanism of the water environment, and can be divided into a statistical prediction model and an intelligent prediction model according to a model establishing method. Common statistical prediction models include time series models, regression models, gray theory models, chaos theory models, and the like. Due to the mutual influence of a plurality of factors in the water environment, errors in monitoring data and the like, the prediction stability and the fault tolerance of the statistical model are reduced.
With the development of computer technology, intelligent prediction models are gradually developed, and Artificial Neural Networks (ANN), Support Vector Machines (SVM), and the like have been frequently used in the field of water quality prediction. In the water quality prediction of the neural network, the problem that local optimization is easy to form exists, and the like, and scholars optimize the BP neural network by continuously introducing intelligent algorithms such as a genetic algorithm, a quantum genetic algorithm, a particle swarm algorithm, simulated annealing and the like, so that the applicability of the model is further improved. The SVM has strong learning generalization capability, can overcome the defect of overfitting, is introduced into water quality prediction by experts at home and abroad in recent years, and achieves great results, but the SVM is difficult to implement on large-scale training samples, and the model precision highly depends on the selection of parameters. Generally speaking, most of the existing non-mechanistic water quality models need a plurality of related water quality variables as inputs, so that the purpose of forecasting is difficult to realize, and the models are not stable enough.
The random forest algorithm is provided on the basis of Bagging integrated learning theory and a random subspace method, has the advantages of high prediction precision, good generalization capability, high convergence speed, few adjusting parameters and the like, is not easy to generate overfitting, is widely applied to a plurality of fields of medicine, biology, economy and the like, and is relatively less in application in a water quality model.
Disclosure of Invention
In view of the above, the present invention provides a meteorological factor-based RF-HW water quality indicator hybrid prediction method, which adopts a form of combining a Random forest algorithm (RF) and a Holt-Winters (HW) model to overcome the fixed defect of a single model, and can improve the prediction stability and accuracy of the model.
In order to achieve the purpose of the invention, the technical scheme provided by the invention is as follows:
a meteorological factor-based RF-HW water quality index hybrid prediction method comprises the following steps:
acquiring meteorological factor data and water quality index data, and after filtering abnormal values of the acquired data, constructing a sample database containing meteorological factor time series data and water quality index time series data;
constructing a water quality index prediction model comprising a Holter-Wett model, a random forest model and a random forest correction model, wherein the input and the output of the Holter-Wett model are water quality index time sequence data; inputting weather factor time sequence data and water quality index lag time sequence data into a random forest model, and outputting a predicted value of the water quality index time sequence data; the input of the random forest correction model is a predicted value of water quality index time sequence data output by the Holter-Wentt model and the random forest model, the output of the random forest correction model is a mixed predicted value of the water quality index time sequence data, a data smoothing factor, a trend smoothing factor and a season change smoothing factor of the Holter-Wentt model are optimized by using data in a sample database, and the number of trees, the maximum characteristic number and the deepest depth of the trees of the random forest model and the random forest correction model are optimized to realize parameter optimization of the water quality index prediction model;
when the method is applied, the meteorological factor time series data and the water quality index time series data are input into the water quality index prediction model, and the mixed prediction value of the water quality index time series data at the future time is obtained through calculation.
Preferably, the collected data is subjected to outlier filtering by adopting a moving window-based quartile elimination method, wherein the moving window-based quartile elimination method is calculated by means of the following rules:
Lu=Q0.75+a·h
Ld=Q0.25-a·h
h=Q0.75-Q0.25
in the formula, Lu,LdFor the upper and lower bounds of rejection, Q, of window-length subsamples0.75And Q0.25The upper quartile and the lower quartile of the subsample are respectively; h is a quartile range, reflecting the dispersion degree of the middle 50% of data; a is a constant determined according to data distribution and required accuracy, and the value range is 0.5-2.5.
Preferably, the meteorological factor data comprises air temperature data, rainfall data, wind speed data, wind direction data and the like, and the water quality index data comprises dissolved oxygen data, permanganate index data, ammonia nitrogen data, turbidity data and the like;
when a moving window-based four-decimal place elimination method is carried out on the air temperature factor data, the value range of the constant a is 1-1.5, preferably, the value range of the constant a is 1-1.2, and the value of the constant a is 1;
when a moving window-based four-decimal bit elimination method is carried out on rainfall factor data, the value range of the constant a is 0.5-1, preferably, the value range of the constant a is 0.5-0.7, and the value of the constant a is 0.5;
when a moving window-based four-decimal place elimination method is carried out on wind speed factor data, the value range of the constant a is 1-2, preferably, the value range of the constant a is 1.4-1.6, and the value of the constant a is 1.5;
when a moving window-based four-decimal place elimination method is carried out on wind direction factor data, the value range of the constant a is 1.5-2.5, preferably, the value range of the constant a is 1.8-2, and the value of the constant a is 2;
when a moving window-based quartile elimination method is carried out on dissolved oxygen data, the value range of the constant a is 1-1.5, preferably, the value range of the constant a is 1-1.2, and the value of the constant a is 1;
when a moving window-based quartile elimination method is carried out on permanganate index data, the value range of the constant a is 1-1.5, preferably, the value range of the constant a is 1-1.2, and the value of the constant a is 1.2;
when a moving window-based quartile elimination method is carried out on ammonia nitrogen data, the value range of the constant a is 0.5-1, preferably, the value range of the constant a is 0.6-0.8, and the value of the constant a is 0.8;
when the moving window-based quartile elimination method is carried out on the turbidity data, the value range of the constant a is 1.5-2.5, preferably, the value range of the constant a is 1.8-2, and the value of the constant a is 2.
Preferably, when the Holter-Went model is constructed, an addition Holter-Went model is adopted, and the predicted value of the water quality index time sequence of the Holter-Went model at the future h moment is added
Figure BDA0002949159030000041
Comprises the following steps:
lt=α(yt-st-m)+(1-α)(lt-1+bt-1)
bt=β(lt-lt-1)+(1-β)bt-1
st=γ(yt-lt-1-bt-1)+(1-γ)st-m
Figure BDA0002949159030000051
wherein lt、bt、stWater quality index time series data y at time ttThe three components of (a) are a horizontal term, a trend term and a seasonal term, m is a period length, and s ist-mRepresenting the seasonal item, s, corresponding to time t in the previous cyclet+h-mA seasonal term representing the predicted time of day in the previous cycle, alpha being a data smoothing factor, 0<α<1, beta is a trend smoothing factor, 0<β<1, gamma is a seasonal change smoothing factor, 0<γ<1。
Preferably, the water quality index time-series data needs to be supplemented by a linear interpolation method before being input into the holter-wenter model.
Preferably, the implementation process of the random forest model is as follows:
(1) randomly extracting K different sub-sample data sets from a sample database by adopting a bootstrap method according to the number of preset decision trees as a sub-training set of each decision tree, wherein the volume of each sample is the same as that of an original sample database, and the samples can be repeatedly extracted, and the data which are not sampled each time form data outside a bag;
(2) respectively establishing decision trees for the K sub-sample data sets, realizing recursive branch growth aiming at the continuous circulation process a of each decision tree until a cut-off condition is reached, stopping the growth of the decision tree, and determining the predicted value of the water quality index time sequence data of each decision tree according to the last sub-node;
the process a is as follows: the method comprises the steps that input meteorological factors and water quality indexes of all categories are used as characteristic variables, at least one characteristic variable is randomly selected from all the characteristic variables corresponding to a current node, then 1 characteristic variable and the corresponding variable value are selected from the selected characteristic variables to determine the optimal segmentation variable and the optimal segmentation value, and all sample variable values corresponding to the current node and identical to the segmentation variable are subjected to segmentation splitting of left and right sub-nodes according to the segmentation variable and the segmentation value;
(3) and combining all the decision trees into a random forest, and taking the average value of the predicted values of the water quality index time sequence data of the K decision trees as the predicted value of the final water quality index time sequence data.
Preferably, the quality of the segmentation variable and the segmentation value is measured by the weighted sum of the purities of the left and right sub-nodes after segmentation:
Figure BDA0002949159030000061
wherein x isiDenotes the ith slicing variable, vijRepresenting a slicing variable xiCorresponding jthTangent value, Ns、nleft、nrightRespectively representing the number of samples, X, corresponding to the current node and the left and right sub-nodes after segmentationleft、XrightRespectively representing sample sets of left and right child nodes, H (-) represents a function for measuring the node purity, and the mean square error between the predicted value and the true value of the water quality index time sequence data is adopted.
Preferably, when the left and right child nodes are split of the current node, if the sample variable value is less than or equal to the split value of the current node, the left child node of the current node is accessed; if the sample variable value is greater than the current node tangent value, the right child node of the current node is accessed.
Preferably, the implementation process of the random forest correction model is the same as that of the random forest model, the processed data is the predicted value of the water quality index time series data output by the Holter-Wentt model and the random forest model, namely, the predicted value of the water quality index time series data output by the Holter-Wentt model and the random forest model is used as a characteristic variable to perform segmentation and splitting of left and right sub-nodes, so that the decision tree is generated to be long, and the mixed predicted value of the water quality index time series data is output.
Preferably, during application, meteorological factor time series data in a test sample are input into a Hall special-temperature special model after parameter optimization, and a predicted value of water quality index time series data at a future moment is calculated; simultaneously inputting meteorological factor time series data and water quality index lag time series data in a test sample into a random forest model after parameter optimization, and calculating to obtain a predicted value of the water quality index time series data at a future moment; and then inputting the predicted values of the water quality index time-series data at the future time output by the Hall special-Wentt model and the random forest model into a parameter-optimized random forest correction model, calculating to obtain a mixed predicted value of the water quality index time-series data at the future time, and adding the mixed predicted value into a sample database.
Compared with the prior art, the invention has the beneficial effects that at least:
the meteorological factor-based RF-HW water quality index hybrid prediction method provided by the embodiment of the invention utilizes a water quality index prediction model combining an RF method and an HW method, overcomes the inherent defect of a single model, and realizes the function of predicting in advance by using the meteorological factor as input and the characteristics of forecasting and easy acquisition of the meteorological factor. The database of the model can be updated in real time, so that the reliability of the output predicted value is further improved. The method can improve the stability of the model, simplify the input of the model, be widely applied to water quality monitoring, be embedded in a monitoring system and realize the short-term forecast of the water quality. The model can also be used for checking the automatic monitoring data, manual maintenance is not needed in the operation stage of the prediction model, the problems that the automatic water quality monitoring equipment is blocked and the like possibly occur in long-term operation, the monitoring value is deviated, and the abnormal operation of the automatic monitoring equipment can be found in time by comparing the predicted value with the monitoring value.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.
FIG. 1 is a flow chart of a meteorological factor-based RF-HW water quality indicator hybrid prediction method according to an embodiment of the present invention;
FIG. 2 is a graph comparing filtered and interpolated data of a Holt-Winters model training set provided by an embodiment of the present invention with original numbers;
fig. 3 is a distribution diagram of the predicted value and the measured value of the model according to an embodiment of the present invention, wherein (a) is the distribution diagram of the predicted value and the measured value of the single model; (b) the distribution diagram of the predicted value and the measured value of the water quality index prediction model provided by the embodiment is provided.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention will be further described in detail with reference to the accompanying drawings and examples. It should be understood that the detailed description and specific examples, while indicating the scope of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention.
Fig. 1 is a flowchart of a meteorological factor-based RF-HW water quality indicator hybrid prediction method according to an embodiment of the present invention. As shown in fig. 1, the meteorological factor-based RF-HW water quality indicator hybrid prediction method provided by the embodiment includes the following steps:
step 1, collecting meteorological factor data and water quality index data, filtering abnormal values of the collected data, and establishing a sample database containing meteorological factor time sequence data and water quality index time sequence data.
The water quality index data comprises water quality index data such as dissolved oxygen, the time series is also used for representing the layer meteorological factor time series data, the meteorological factor corresponds to a unique mapping relation with the water quality index at each moment, and when a certain meteorological factor at a certain moment is determined, the water quality index at the moment can be correspondingly obtained. The initial sample database contains 13068 pieces of data in total, and the recording time interval is 4 hours.
After obtaining the data, in order to remove invalid data, the acquired data also needs to be subjected to outlier filtering, and specifically, the data may be preprocessed by a moving window-based quartile elimination method, and the outlier filtering method is used to filter outliers, where the moving window-based quartile elimination method is calculated by means of the following rules:
Lu=Q0.75+a·h
Ld=Q0.25-a·h
h=Q0.75-Q0.25
in the formula, Lu,LdFor the upper and lower bounds of rejection, Q, of window-length subsamples0.75And Q0.25The upper quartile and the lower quartile of the subsample are respectively; h is a quartile range, and reflects the dispersion degree of middle 50% of data, in the embodiment, the constant a takes a value of 1 for the air temperature factor data, 0.5 for the rainfall factor data, and 0 for the dissolved oxygen index dataThe value of a is 1. By slicing the samples, more anomalies can be found, and moving the window avoids the impact of slicing boundaries. After the above outlier filtering, the final available data is 11087 pieces.
And 2, constructing a training set and a testing set.
And the constructed training set and the test set are obtained by extracting sample data from the sample database. The training set of the random forest model comprises model input variables, namely meteorological factor time series data X (X1, X2 …) of air temperature, rainfall and the like and a lag time sequence (Y-1, Y-2, …) of a water quality index time sequence Y to be predicted, the output variables are the water quality index time sequence Y, and the testing set of the random forest model comprises meteorological factor forecast data and a water quality factor lag time value in a prediction time period. The training set and the testing set of the random forest model and the random forest correction model are historical time sequences of water quality indexes.
In this embodiment, the water quality index is dissolved oxygen, and the corresponding water quality index historical time series is a dissolved oxygen historical time series.
And step 3, establishing a Holt-Winters model.
Because the original data may have missing abnormal values and the extreme values are also removed in the screening process, the water quality factor time sequence needs to be supplemented completely, and model parameters, namely a data smoothing factor alpha, a trend smoothing factor beta and a seasonal change smoothing factor gamma, are determined according to the performance of the model on a training set.
In the embodiment, a linear interpolation method is adopted for supplement, and the linear interpolation method approximately solves the missing value of the time sequence by using an equal ratio relation between time and an observed value. Given a set of time series T, T is knowni、TkThe observed values corresponding to the time are respectively y (T)i)、y(Tk),TjTime of day data sample value y (T)j) Deletion of wherein i<j<k, then y (T)j) The estimated equation of (a) is:
Figure BDA0002949159030000091
it must be ensured that the data deletion sites are not at both ends of the time sequence, otherwise the linear interpolation method cannot be realized. For meteorological and hydrological time sequences with small variability, the linear interpolation method is more accurate in estimation compared with a K nearest neighbor interpolation method, a spline interpolation method, a polynomial interpolation method and a kernel density estimation method.
The Holt-Winters seasonal model is a branch of a cubic exponential smoothing algorithm, and combines a time series decomposition method and an exponential smoothing method to decompose a time series into a trend term, a seasonal term and a horizontal term and estimate three components respectively. The model is divided into an addition mode and a multiplication mode, and the addition mode is adopted in the embodiment. Setting a periodic time series Yn(Yn=y1,y2,...,yn). In the addition Holt-windows model, the three components of the time series are the horizontal terms ltTrend item btSeason item stAdding the predicted value of Holt-Winters at the future h moment
Figure BDA0002949159030000101
Comprises the following steps:
lt=α(yt-st-m)+(1-α)(lt-1+bt-1)
bt=β(lt-lt-1)+(1-β)bt-1
st=γ(yt-lt-1-bt-1)+(1-γ)st-m
Figure BDA0002949159030000102
wherein lt、bt、stWater quality index time series data y at time ttThe three components of (a) are a horizontal term, a trend term and a seasonal term, m is a period length, and s ist-mRepresenting the seasonal item, s, corresponding to time t in the previous cyclet+h-mShowing the seasonal item of the predicted time in the last period, and comparing the filtered and interpolated data of the training set with the original data in FIG. 2, removing abnormal values and completing the sequence of the data after visible processing, and then obtaining the predicted time in the last periodValues of the data calibration smoothing factor alpha, the trend smoothing factor beta and the season change smoothing factor gamma are respectively 0.2, 0.2 and 0.08.
And 4, establishing a random forest model.
The random forest algorithm is to collect a plurality of weak learners to obtain a strong learner, and because the output of the water quality index is a continuous value, a regression algorithm of the random forest is adopted. The meteorological factor and the water quality factor lag time sequence are used as input, model parameters are determined according to the estimation accuracy rate of the data outside the bag of the model (namely the mean square error of the data outside the bag), and the main adjusting parameters comprise the number of trees, the maximum characteristic number considered when the optimal model of the decision tree is constructed and the deepest depth of the tree.
The random forest model is realized by the following steps:
(1) randomly extracting K different sub-sample data sets from a sample database by adopting a bootstrap method according to the number of preset decision trees as a sub-training set of each decision tree, wherein the volume of each sample is the same as that of an original sample database, samples can be repeatedly extracted, and non-sampled data at each time form data outside a bag;
(2) respectively establishing decision trees for the K sub-sample data sets, realizing recursive branch growth aiming at the continuous circulation process a of each decision tree until a cut-off condition is reached, stopping the growth of the decision tree, and determining the predicted value of the water quality index time sequence data of each decision tree according to the last sub-node;
the process a: taking the input meteorological factors and water quality indexes of all categories as characteristic variables, randomly selecting at least one characteristic variable from all the characteristic variables corresponding to the current node, then selecting 1 characteristic variable and a corresponding variable value from the selected characteristic variables to determine the optimal segmentation variable and segmentation value, performing segmentation and splitting of left and right sub-nodes on all sample variable values corresponding to the current node and identical to the segmentation variable according to the segmentation variable and the segmentation value, and accessing the left sub-node of the current node if the sample variable value is less than or equal to the segmentation value of the current node; if the sample variable value is larger than the tangent value of the current node, accessing the right child node of the current node;
(3) and combining all the decision trees into a random forest, and taking the average value of the predicted values of the water quality index time sequence data of the K decision trees as the predicted value of the final water quality index time sequence data.
In the embodiment, the quality of the segmentation variable and the segmentation value is measured by the weighted sum of the purities of the left and right sub-nodes after segmentation:
Figure BDA0002949159030000111
wherein x isiDenotes the ith slicing variable, vijRepresenting a slicing variable xiCorresponding j-th tangent value, Ns、nleft、nrightRespectively representing the number of samples, X, corresponding to the current node and the left and right sub-nodes after segmentationleft、XrightRespectively representing sample sets of left and right child nodes, H (-) represents a function for measuring the node purity, and the Mean Square Error (MSE) between the predicted value and the true value of the water quality index time sequence data is adopted.
In the embodiment, since the total number of model input variables is only 4, and the influence on the calculation time is not large, the maximum feature number is set to be 4, after the calibration, the tree number of the tree reaches 200, when the deepest depth reaches 50, the improvement of the model accuracy by improving the parameters is not obvious, and the calculation time is increased, so that the tree number of the tree is set to be 200, and the deepest depth is set to be 50.
And 5, establishing a random forest correction model.
The random forest correction model is realized in the same process as the random forest model, and the determined model parameters are the number of trees, the maximum characteristic number considered when the decision tree optimal model is constructed and the deepest depth of the trees.
The difference is that the processed data is the predicted value of the water quality index time sequence data output by the Hall special-Wentt model and the random forest model, namely the predicted value of the water quality index time sequence data output by the Hall special-Wentt model and the random forest model is used as a characteristic variable to carry out segmentation and splitting of left and right sub-nodes, the generation length of a decision tree is used, and the mixed predicted value of the water quality index time sequence data is output.
And 6, predicting by using a water quality index prediction model consisting of the established Holter-Wett model, the random forest model and the random forest correction model.
The specific prediction process is as follows: inputting the meteorological factor time series data in the test sample into a Hall special-Wentt model after parameter optimization, and calculating to obtain a predicted value of water quality index time series data at a future moment; simultaneously inputting meteorological factor time series data and water quality index lag time series data in a test sample into a random forest model after parameter optimization, and calculating to obtain a predicted value of the water quality index time series data at a future moment; and then inputting the predicted values of the water quality index time-series data at the future time output by the Hall special-Wentt model and the random forest model into a parameter-optimized random forest correction model, calculating to obtain a mixed predicted value of the water quality index time-series data at the future time, and adding the mixed predicted value into a sample database.
In this example, the ratio of 9: 1, dividing an original data set into a training set and a test set in proportion, predicting data at a later moment at one time, updating a database, continuing prediction at the next moment, and simulating real-time updating of the database in actual application at one time;
fig. 3 is a distribution diagram of the norm predictive value and the measured value, in which (a) is a single model and (b) is a mixed model, and it can be seen from the figure that by combining the two prediction methods, R2 of the straight line fitting the predictive value and the measured value is increased from 0.93 to 0.99, and the average relative error of the test set is also decreased from 5.1% to 1.2%. The precision of the mixed model is obviously improved compared with that of a single model.
The above-mentioned embodiments are intended to illustrate the technical solutions and advantages of the present invention, and it should be understood that the above-mentioned embodiments are only the most preferred embodiments of the present invention, and are not intended to limit the present invention, and any modifications, additions, equivalents, etc. made within the scope of the principles of the present invention should be included in the scope of the present invention.

Claims (10)

1. A meteorological factor-based RF-HW water quality index hybrid prediction method is characterized by comprising the following steps:
acquiring meteorological factor data and water quality index data, and after filtering abnormal values of the acquired data, constructing a sample database containing meteorological factor time sequence data and water quality index time sequence data;
constructing a water quality index prediction model comprising a Holter-Went model, a random forest model and a random forest correction model, wherein the input and the output of the Holter-Went model are water quality index time series data; inputting weather factor time sequence data and water quality index lag time sequence data into a random forest model, and outputting a predicted value of the water quality index time sequence data; the implementation process of the random forest correction model is the same as that of the random forest model, the predicted values of the water quality index time sequence data output by the Hall special-Wentt model and the random forest model are input, the mixed predicted values of the water quality index time sequence data are output, the data smoothing factor, the trend smoothing factor and the seasonal change smoothing factor of the Hall special-Wentt model are optimized by using data in a sample database, and the number of trees, the maximum characteristic number and the deepest depth of the trees of the random forest model and the random forest correction model are optimized to realize parameter optimization of the water quality index prediction model;
when the method is applied, the meteorological factor time series data and the water quality index time series data are input into the water quality index prediction model, and the mixed prediction value of the water quality index time series data at the future time is obtained through calculation.
2. The meteorological-factor-based RF-HW water quality index hybrid prediction method according to claim 1, wherein the collected data is outlier filtered using a moving window based quartile elimination method, wherein the moving window based quartile elimination method is calculated by the following rules:
Lu=Q0.75+a·h
Ld=Q0.25-a·h
h=Q0.75-Q0.25
in the formula, Lu,LdFor the upper and lower bounds of rejection, Q, of window-length subsamples0.75And Q0.25The upper quartile and the lower quartile of the subsample are respectively; h is a quartile range, reflecting the dispersion degree of the middle 50% of data; a is a constant determined according to data distribution and required accuracy, and the value range is 0.5-2.5.
3. The weather factor-based RF-HW water quality index hybrid forecasting method according to claim 2, wherein the weather factor data comprises air temperature factor data, rainfall factor data, wind speed factor data, wind direction factor data, and the water quality index data comprises dissolved oxygen data, permanganate index data, ammonia nitrogen data, turbidity data;
when a moving window-based four-decimal place elimination method is carried out on the air temperature factor data, the value range of the constant a is 1-1.5;
when a moving window-based quartile elimination method is carried out on rainfall factor data, the value range of the constant a is 0.5-1;
when a moving window-based four-decimal place elimination method is carried out on wind speed factor data, the value range of the constant a is 1-2;
when a moving window-based four-decimal place elimination method is carried out on wind direction factor data, the value range of the constant a is 1.5-2.5;
when a moving window-based four-decimal place elimination method is carried out on the dissolved oxygen data, the value range of the constant a is 1-1.5;
when a moving window-based quartile elimination method is carried out on permanganate index data, the value range of the constant a is 1-1.5;
when a moving window-based quartile elimination method is carried out on ammonia nitrogen data, the value range of the constant a is 0.5-1;
when a quartile elimination method based on a moving window is carried out on the turbidity data, the value range of the constant a is 1.5-2.5.
4. The meteorological-factor-based RF-HW water quality index hybrid prediction method according to claim 2, wherein the meteorological factor-based RF-HW water quality index hybrid prediction method is characterized in thatThe method is characterized in that when the Hall special-temperature model is constructed, an additive Hall special-temperature model is adopted, and the predicted value of the water quality index time sequence of the additive Hall special-temperature model at the future h moment is
Figure FDA0003615562770000032
Comprises the following steps:
lt=α(yt-st-m)+(1-α)(lt-1+bt-1)
bt=β(lt-lt-1)+(1-β)bt-1
st=γ(yt-lt-1-bt-1)+(1-γ)st-m
Figure FDA0003615562770000031
wherein lt、bt、stWater quality index time series data y at time ttThe three components of (a) are a horizontal term, a trend term and a seasonal term, m is a period length, and s ist-mRepresenting the seasonal item, s, corresponding to time t in the previous cyclet+h-mA seasonal term representing the predicted time of day in the previous cycle, alpha being a data smoothing factor, 0<α<1, beta is a trend smoothing factor, 0<β<1, gamma is a seasonal change smoothing factor, 0<γ<1。
5. The meteorological-factor-based RF-HW water quality index hybrid forecasting method according to claim 1 or 4, wherein the water quality index time series data needs to be supplemented by linear interpolation before being input into the Holter-Wintert model.
6. The meteorological-factor-based RF-HW water quality index hybrid prediction method according to claim 1, wherein the random forest model is realized by the following steps:
(1) randomly extracting K different sub-sample data sets from a sample database by adopting a bootstrap method according to the number of preset decision trees as a sub-training set of each decision tree, wherein the volume of each sample is the same as that of an original sample database, samples can be repeatedly extracted, and non-sampled data at each time form data outside a bag;
(2) respectively establishing decision trees for the K sub-sample data sets, realizing recursive branch growth aiming at the continuous circulation process a of each decision tree until a cut-off condition is reached, stopping the growth of the decision tree, and determining the predicted value of the water quality index time sequence data of each decision tree according to the last sub-node;
the process a: taking the input meteorological factors and water quality indexes of all categories as characteristic variables, randomly selecting at least one characteristic variable from all the characteristic variables corresponding to the current node, then selecting 1 characteristic variable and a corresponding variable value from the selected characteristic variables to determine the optimal segmentation variable and segmentation value, and performing segmentation and splitting of left and right sub-nodes on all sample variable values corresponding to the current node and the same as the segmentation variable according to the segmentation variable and the segmentation value;
(3) and combining all the decision trees into a random forest, and taking the average value of the predicted values of the water quality index time sequence data of the K decision trees as the predicted value of the final water quality index time sequence data.
7. The meteorological-factor-based RF-HW water quality index hybrid prediction method according to claim 6, wherein the quality of the segmentation variables and the segmentation values is measured by the weighted sum of the purities of the left and right sub-nodes after segmentation:
Figure FDA0003615562770000041
wherein x isiDenotes the ith slicing variable, vijRepresenting a slicing variable xiCorresponding j-th tangent value, Ns、nleft、nrightRespectively representing the number of samples, X, corresponding to the current node and the left and right sub-nodes after segmentationleft、XrightRespectively representing sample sets of left and right child nodes, and H (-) representing the measure of the node purityThe function of (2) is the mean square error between the predicted value and the true value of the water quality index time series data.
8. The meteorological-factor-based RF-HW water quality index hybrid prediction method according to claim 6, wherein when the left and right child nodes are split of the current node, if the sample variable value is less than or equal to the split value of the current node, the left child node of the current node is accessed; if the sample variable value is greater than the current node tangent value, the right child node of the current node is accessed.
9. The meteorological-factor-based RF-HW water quality index hybrid forecasting method according to any one of claims 6 to 7, wherein the processed data of the random forest correction model is a forecast value of water quality index time series data output by the Hall Tett-Tett model and the random forest model, namely, the forecast value of the water quality index time series data output by the Hall Tett-Tett model and the random forest model is used as a characteristic variable to split left and right sub-nodes, a decision tree is generated, and the hybrid forecast value of the water quality index time series data is output.
10. The meteorological-factor-based RF-HW water quality index hybrid prediction method according to claim 1, wherein when in use, meteorological factor time series data in a test sample are input into a Hall special-Wentt model after parameter optimization, and the predicted value of the water quality index time series data at the future moment is calculated; simultaneously inputting meteorological factor time series data and water quality index lag time series data in a test sample into a random forest model after parameter optimization, and calculating to obtain a predicted value of the water quality index time series data at a future moment; and then inputting the predicted values of the water quality index time-series data at the future time output by the Hall special-Wentt model and the random forest model into a parameter-optimized random forest correction model, calculating to obtain a mixed predicted value of the water quality index time-series data at the future time, and adding the mixed predicted value into a sample database.
CN202110204105.9A 2021-02-23 2021-02-23 Meteorological factor-based RF-HW water quality index hybrid prediction method Active CN112819244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110204105.9A CN112819244B (en) 2021-02-23 2021-02-23 Meteorological factor-based RF-HW water quality index hybrid prediction method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110204105.9A CN112819244B (en) 2021-02-23 2021-02-23 Meteorological factor-based RF-HW water quality index hybrid prediction method

Publications (2)

Publication Number Publication Date
CN112819244A CN112819244A (en) 2021-05-18
CN112819244B true CN112819244B (en) 2022-06-21

Family

ID=75865163

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110204105.9A Active CN112819244B (en) 2021-02-23 2021-02-23 Meteorological factor-based RF-HW water quality index hybrid prediction method

Country Status (1)

Country Link
CN (1) CN112819244B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114297454B (en) * 2021-12-30 2023-01-03 医渡云(北京)技术有限公司 Method and device for discretizing features, electronic equipment and computer readable medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014157753A1 (en) * 2013-03-28 2014-10-02 부산대학교 산학협력단 System and method for providing water quality information capable of diagnosing and predicting state of water quality of water system
CN106991437A (en) * 2017-03-20 2017-07-28 浙江工商大学 The method and system of sewage quality data are predicted based on random forest
CN109242203A (en) * 2018-09-30 2019-01-18 中冶华天南京工程技术有限公司 A kind of water quality prediction of river and water quality impact factors assessment method
CN109472321A (en) * 2018-12-03 2019-03-15 北京工业大学 A kind of prediction towards time series type surface water quality big data and assessment models construction method
CN110308705A (en) * 2019-06-19 2019-10-08 上海华高汇元工程服务有限公司 A kind of apparatus control method based on big data and artificial intelligence water quality prediction

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11138090B2 (en) * 2018-10-23 2021-10-05 Oracle International Corporation Systems and methods for forecasting time series with variable seasonality

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2014157753A1 (en) * 2013-03-28 2014-10-02 부산대학교 산학협력단 System and method for providing water quality information capable of diagnosing and predicting state of water quality of water system
CN106991437A (en) * 2017-03-20 2017-07-28 浙江工商大学 The method and system of sewage quality data are predicted based on random forest
CN109242203A (en) * 2018-09-30 2019-01-18 中冶华天南京工程技术有限公司 A kind of water quality prediction of river and water quality impact factors assessment method
CN109472321A (en) * 2018-12-03 2019-03-15 北京工业大学 A kind of prediction towards time series type surface water quality big data and assessment models construction method
CN110308705A (en) * 2019-06-19 2019-10-08 上海华高汇元工程服务有限公司 A kind of apparatus control method based on big data and artificial intelligence water quality prediction

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于随机森林与支持向量机的水库长期径流预报;李伶杰等;《水利水运工程学报》;20201103(第04期);全文 *

Also Published As

Publication number Publication date
CN112819244A (en) 2021-05-18

Similar Documents

Publication Publication Date Title
CN110084367B (en) Soil moisture content prediction method based on LSTM deep learning model
CN106598917B (en) A kind of upper ocean heat structure prediction technique based on deepness belief network
CN111967688B (en) Power load prediction method based on Kalman filter and convolutional neural network
CN109272364A (en) Automatic Valuation Modelling modeling method
CN111754034A (en) Time sequence prediction method based on chaos optimization neural network model
Xu et al. A water level prediction model based on ARIMA-RNN
CN110619427A (en) Traffic index prediction method and device based on sequence-to-sequence learning model
CN111652425A (en) River water quality prediction method based on rough set and long and short term memory network
CN116187203A (en) Watershed water quality prediction method, system, electronic equipment and storage medium
CN112819244B (en) Meteorological factor-based RF-HW water quality index hybrid prediction method
CN113435124A (en) Water quality space-time correlation prediction method based on long-time and short-time memory and radial basis function neural network
CN114692981A (en) Medium-and-long-term runoff forecasting method and system based on Seq2Seq model
CN112215410B (en) Power load prediction method based on improved deep learning
CN114091768A (en) STL (Standard template library) and LSTM (local Scale TM) with attention mechanism based tourism demand prediction method
CN114862032B (en) XGBoost-LSTM-based power grid load prediction method and device
CN116401962A (en) Method for pushing optimal characteristic scheme of water quality model
CN115963788A (en) Multi-sampling-rate industrial process key quality index online prediction method
CN113487069B (en) Regional flood disaster risk assessment method based on GRACE daily degradation scale and novel DWSDI index
CN115759291A (en) Space nonlinear regression method and system based on ensemble learning
CN115730744A (en) Water consumption prediction method and system based on user mode and deep learning combined model
CN116127833A (en) Wind power prediction method, system, device and medium based on VMD and LSTM fusion model
CN115408483A (en) Beidou navigation service positioning performance intelligent prediction method and device
CN114971005A (en) Bay water temperature combination prediction method based on LSTM and differential regression model dynamic weighting
CN114492952A (en) Short-term rainfall forecasting method and device based on deep learning, electronic equipment and storage medium
CN114036846A (en) Pond culture dissolved oxygen deficiency data interpolation method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant