CN117494862B - Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test - Google Patents
Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test Download PDFInfo
- Publication number
- CN117494862B CN117494862B CN202311000418.8A CN202311000418A CN117494862B CN 117494862 B CN117494862 B CN 117494862B CN 202311000418 A CN202311000418 A CN 202311000418A CN 117494862 B CN117494862 B CN 117494862B
- Authority
- CN
- China
- Prior art keywords
- model
- sample
- hypothesis
- prediction
- sigma
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000012360 testing method Methods 0.000 title claims abstract description 66
- 238000000034 method Methods 0.000 title claims abstract description 41
- 238000005457 optimization Methods 0.000 title claims abstract description 26
- 238000005070 sampling Methods 0.000 claims abstract description 45
- 238000012549 training Methods 0.000 claims abstract description 29
- 238000013507 mapping Methods 0.000 claims abstract description 7
- 238000007689 inspection Methods 0.000 claims abstract description 4
- 238000004364 calculation method Methods 0.000 claims description 6
- 238000000342 Monte Carlo simulation Methods 0.000 claims description 4
- 238000005315 distribution function Methods 0.000 claims description 4
- 238000012216 screening Methods 0.000 claims description 4
- 238000000528 statistical test Methods 0.000 claims description 3
- 238000000638 solvent extraction Methods 0.000 abstract description 2
- 230000006870 function Effects 0.000 description 16
- FIAFUQMPZJWCLV-UHFFFAOYSA-N suramin Chemical compound OS(=O)(=O)C1=CC(S(O)(=O)=O)=C2C(NC(=O)C3=CC=C(C(=C3)NC(=O)C=3C=C(NC(=O)NC=4C=C(C=CC=4)C(=O)NC=4C(=CC=C(C=4)C(=O)NC=4C5=C(C=C(C=C5C(=CC=4)S(O)(=O)=O)S(O)(=O)=O)S(O)(=O)=O)C)C=CC=3)C)=CC=C(S(O)(=O)=O)C2=C1 FIAFUQMPZJWCLV-UHFFFAOYSA-N 0.000 description 11
- 238000005259 measurement Methods 0.000 description 8
- 238000010998 test method Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000012795 verification Methods 0.000 description 2
- 230000001364 causal effect Effects 0.000 description 1
- 230000019771 cognition Effects 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 230000007423 decrease Effects 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000012417 linear regression Methods 0.000 description 1
- 238000010801 machine learning Methods 0.000 description 1
- 238000007726 management method Methods 0.000 description 1
- 238000013178 mathematical model Methods 0.000 description 1
- 238000005065 mining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- XLYOFNOQVPJJNP-UHFFFAOYSA-N water Substances O XLYOFNOQVPJJNP-UHFFFAOYSA-N 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/04—Forecasting or optimisation specially adapted for administrative or management purposes, e.g. linear programming or "cutting stock problem"
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/18—Complex mathematical operations for evaluating statistical data, e.g. average values, frequency distributions, probability functions, regression analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/04—Inference or reasoning models
- G06N5/041—Abduction
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q50/00—Information and communication technology [ICT] specially adapted for implementation of business processes of specific business sectors, e.g. utilities or tourism
- G06Q50/10—Services
- G06Q50/26—Government or public services
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Human Resources & Organizations (AREA)
- Economics (AREA)
- General Engineering & Computer Science (AREA)
- Tourism & Hospitality (AREA)
- Strategic Management (AREA)
- Computing Systems (AREA)
- General Business, Economics & Management (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Operations Research (AREA)
- Marketing (AREA)
- Pure & Applied Mathematics (AREA)
- Development Economics (AREA)
- Mathematical Optimization (AREA)
- Mathematical Analysis (AREA)
- Computational Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Probability & Statistics with Applications (AREA)
- Primary Health Care (AREA)
- Algebra (AREA)
- General Health & Medical Sciences (AREA)
- Databases & Information Systems (AREA)
- Health & Medical Sciences (AREA)
- Educational Administration (AREA)
- Quality & Reliability (AREA)
- Computational Linguistics (AREA)
- Entrepreneurship & Innovation (AREA)
- Game Theory and Decision Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
Abstract
The invention provides a data-driven runoff forecasting model optimization method under a limited sample based on hypothesis test, which comprises the following steps: establishing a mapping relation between runoff and a forecasting factor, selecting a plurality of data driving models to describe the mapping relation, and establishing an alternative model set; calibrating model parameters for the selected model hypothesis to be used as overall estimation of a real model; sampling the overall model, dividing a sample set into a training set and a test set, and performing model calibration and inspection on each group of dividing schemes to obtain sampling distribution of prediction precision; determining an optimal sample set partitioning capacity; calculating the prediction precision and the observation value of statistics of the test set; judging whether to reject the original assumption according to whether the observed value of the statistic falls in a reject domain; if a plurality of alternative models pass the hypothesis test, model optimization is performed based on a criterion of minimum rejection rate. The method can verify the reliability of the prediction model on the unknown overall inference prediction, and provides decision basis for the optimization of the prediction model.
Description
Technical Field
The invention relates to the technical field of runoff forecasting, in particular to a data-driven runoff forecasting model optimization method under limited samples based on hypothesis testing.
Background
The medium-and-long-term runoff forecast is a classical and difficult problem in hydrologic forecast, and reliable runoff forecast information is an important decision basis for guiding scientific planning and utilization of water resources and safe and economic operation of hydraulic engineering. The model is optimized to seek a high-precision forecasting model, so that reliable runoff forecasting information is provided for management decision makers, and the method is always one of the key tasks of current hydrological forecasting research.
Annual runoff forecasting with long forecast periods often adopts a data-driven model based on statistics, and the unknown samples are inferred and predicted by mining intrinsic rules contained in sample data. The essence of such predictive modeling methods is to select a suitable mathematical model to fit the rules of the finite samples, and then predict the unknown samples based on the fitted model, hopefully to reduce random uncertainty prediction errors as much as possible. In the aspect of model selection, as the sample population has unknowns, a plurality of alternative models are generally selected from a plurality of models for modeling according to experience cognition in the traditional modeling method, a training set is adopted for model parameter calibration, and then a model with higher prediction precision on a test set is selected for actual runoff forecasting in a comparing and optimizing mode.
However, according to the mathematical statistical inference theory, the measured runoff samples are a set of observations from the population that occur according to probability, and do not necessarily occur in the future, and the shorter the measured sample sequence, the greater the probability of deviating from the population distribution. Because the annual runoff observation sequence is relatively short, usually for decades, model calibration and prediction accuracy inspection are performed based on limited actual measurement samples with short observation sequences, and the obtained prediction accuracy index also occurs according to probability. Therefore, models which are optimized by taking the prediction precision of the test set as a criterion in the traditional modeling method are difficult to verify on the prediction uncertainty of the unknown population, and an effective verification method is not yet available. Although the statistical inference problem of limited samples provides a theoretical basis in the hypothesis test theory in statistics, how to combine the key links of data-driven prediction modeling to construct a runoff prediction model hypothesis test method focusing on the prediction accuracy of a model on unknown samples still remains a technical problem to be solved in the present.
Disclosure of Invention
Aiming at the problems and the technical problems, the invention provides a data-driven runoff forecasting model optimization method under a limited sample based on hypothesis test, which is used for verifying the reliability of the forecasting model on unknown overall inference forecasting and providing decision basis for optimization of the forecasting model.
The invention is realized by the following technical scheme: a data-driven runoff forecasting model optimization method under a limited sample based on hypothesis test comprises the following steps:
Step 1: screening forecasting factors according to historical runoff data and meteorological data, establishing a mapping relation between runoffs and the forecasting factors, selecting a plurality of data driving models to describe the mapping relation, and establishing an alternative model set;
Step 2: selecting a model as a model hypothesis according to the alternative model set in the step 1, and defining an original hypothesis H 0 and an alternative hypothesis H 1 for checking the authenticity of the forecast model; wherein:
H 0: model y=f (X; ω *)+ε,ε~g(θ*) assuming true, predictive model derived using training set calibration Is a real model, its prediction accuracy σ (n, m) at test set D L,m=(Xm,ym) is equal to model real accuracy σ *;
H 1: forecasting model Is not true, i.e., σ (n, m) is not equal to σ *;
If the predictive model is true, σ (n, m) approximates σ * within a certain allowable deviation range, and a statistic is constructed:
Z=σ(n,m)-σ* (2)
When σ (n, m) causes the absolute value Z 0 of the statistic Z to be excessively large, hypothesis H 0 is rejected, whereas hypothesis H 0 is accepted;
step 3: for one model hypothesis selected in step 2, a measured sample with sample size L is used Calibrating model parameters as overall estimation of a real model, wherein the overall model is marked as y=f (X; omega *)+ε,ε~g(θ*), and the real accuracy of the model is estimated as sigma *;
Step 4: sampling the overall model y=f (X; omega *)+ε,ε~g(θ*) with a sample capacity of L by adopting a Monte Carlo method, recording the sampling times of C, recording the sample capacity of C times as L, recording one sample as a sample set D L, dividing the sample set D L into a training set D L,n=(Xn,yn) and a test set D L,m=(Xm,ym with sample capacities of n and m respectively for each sample, recording the sample set division scheme as L (n, m), and using the training set D L,n for model parameter calibration and the test set D L,m for model prediction precision test; carrying out model calibration and inspection on each group of division scheme L (n, m) scheme by adopting C times of sampling to obtain sampling distribution of prediction precision;
Step 5: determining the optimal sample set division capacity L (n r,mr) by taking the minimum average distance as a principle, so that the sampling distribution of the statistic Z is as close as possible to the real H 0, and if the average distance index of the sampling distribution of Z and the real precision of the model is smaller, the sampling distribution of the statistic Z is sufficiently close to the real H 0, namely the sampling distribution of the prediction precision sigma (n, m) of the optimal sample set division capacity L (n r,mr) is close to the real precision sigma * of the model, and the probability that the original assumption H 0 is true is larger;
Step 6: constructing sampling distribution of statistic Z according to the prediction precision sigma (n, m) sample set of the optimal sample set division capacity L (n r,mr) in the step 5, and determining a refusal domain of the sampling distribution of the statistic Z according to the sampling distribution of the statistic Z and the given significance level alpha;
step 7: based on the measured sample Constructing a prediction model with sample division capacity L (n r,mr), and calculating prediction accuracy/>, of a test setAnd an observation Z 0 of statistic Z;
Step 8: judging whether to reject the original hypothesis H 0 according to whether the observed value Z 0 of the statistic Z falls in the reject domain of the sampling distribution of the statistic Z, if the original hypothesis H 0 is rejected, the current model does not pass the hypothesis test and is not suitable as a preferred model; if a plurality of alternative models pass the hypothesis test, the p-value method is further adopted to give the minimum significance level that the original hypothesis H 0 can be rejected, and the calculation formula of the p-value is as follows:
p=P{|z|≥z0}
the model with the minimum rejection rate 1-p for the original hypothesis H 0 is considered to be closest to the real model, and model optimization is performed based on the criterion of minimum rejection rate.
Further, the data driving model in step 1 is composed of a model structure, parameters, input variables and output variables, and the model expression is as follows:
y=f(X;ω)+ε,ε~g(θ)
Wherein, the input variable X is a predictor related to runoff; the output variable y is runoff; f (X; omega) is a model form with certain structural characteristics; omega is a model parameter and is obtained through historical sample training; epsilon is a random term with a distribution function g (theta).
Further, in step 3, the model real precision index σ * includes a root mean square error, an average absolute error, and a correlation coefficient
Further, the objective function of model solution in step 3 is composed of a fitting error function and a penalty function, and the general expression of the objective function is as follows:
wherein L (y i,f(Xi; omega)) is the fitting error of the model, and represents the model to fit the training sample as much as possible; n is the training sample capacity; λΩ (ω) is a penalty function introduced by constraining the model to overfitting to improve the generalization ability of the model in unknown sample sets.
Further, the step 5 specifically includes:
Step 5.1: taking average deviation d (n, m, sigma *) as a discrimination index of sample set division, recording the statistical test sampling frequency of a prediction model with sample set division capacity L (n, m) as C, recording the test set prediction precision sample set sampled for C times as sigma m, wherein the C-th sample is sigma m (C), and adopting a root mean square distance index to quantify the average deviation degree of sigma m (C epsilon C) and true precision sigma *, and defining an average deviation index d (n, m, sigma *), wherein the calculation formula is as follows:
Step 5.2: if d (n, m, σ *) obtains the minimum value in the multiple groups of sample set dividing schemes L (n, m), the minimum value is the sample set dividing capacity, and the sample set dividing capacity is called as 'optimal sample set capacity' L (nr, mr); at this time, the prediction accuracy σ (n, m) distribution deviates as little as possible from the true σ *, i.e., the statistic Z distribution is sufficiently close to H 0.
The present invention has the following advantages.
(1) The invention provides a hypothesis testing method of a runoff forecasting model under a limited sample based on a small probability event practical impossibility principle of a hypothesis testing theory in the step 2; in the steps 4-5, adopting a Monte Carlo method to simulate the distribution of the prediction precision to obtain sampling distribution with minimum risk deviated from a real model; in the steps 7-8, a model preference criterion based on minimum rejection rate is provided, and the reliability of the preference model to unknown overall inference prediction is guaranteed with maximum probability. As shown in the results of FIG. 6 in the examples, the confidence level of the mean model preferred by the hypothesis test method is 0.91, which is increased by 0.89 and 0.61 respectively compared with the confidence level of the model preferred by the traditional preferred methods (2:1 and 4:1), and the superiority of the hypothesis test preferred method provided by the invention is verified.
(2) The prediction model optimization method can be popularized to prediction problems in different fields, and a reliability criterion-based optimization method is provided for data-driven prediction model construction under limited samples.
Drawings
FIG. 1 is a schematic diagram of probability distribution of model predictive accuracy and double-sided test reject domain in an embodiment of the invention;
FIG. 2 is a flowchart of a preferred method for data-driven runoff forecasting model under limited samples based on hypothesis testing according to an embodiment of the present invention;
FIG. 3 is a diagram showing the process of the annual runoff sequence and the level and first order difference sequence in the embodiment of the invention;
FIG. 4 is a graph showing the variation of 95% confidence intervals and average deviation indexes of prediction accuracy of 6 alternative models such as a mean model along with the sample capacity n of a training set according to the embodiment of the invention;
FIG. 5 is a graph of distribution positions of the 95% confidence intervals and Z 0 of the sampling distribution of 6 alternative model statistics Z of the mean model and the like;
FIG. 6 is a graph comparing the preferred results of the predictive model of the hypothesis testing method and the conventional preferred method of the present invention.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
As shown in fig. 2, the preferred method for data-driven runoff forecasting model under limited samples based on hypothesis testing according to the present embodiment includes the following steps:
Step 1: and screening the forecasting factors according to the historical runoff data and the meteorological data, selecting a plurality of data driving models to establish a mapping relation between the runoffs and the forecasting factors, and establishing an alternative model set.
The data driving model consists of a model structure, parameters, input variables and output variables, and the model expression is as follows:
y=f(X;ω)+ε,ε~g(θ) (1)
Wherein, the input variable X is a predictor with correlation with runoff, such as an autocorrelation factor or an external causal correlation factor; the output variable y is runoff; f (X; omega) is a model form with certain structural characteristics; omega is a model parameter and is obtained through historical sample training; epsilon is a random term with a distribution function g (theta).
The example collected year runoff data from 1956 to 2010, with a sample size of 55. The runoff sequence and the first order differential sequence are shown in fig. 3. And (3) carrying out linear and nonlinear autocorrelation analysis on the 55-year actual measurement runoff sequence and the differential sequence, screening forecasting factors and establishing an autocorrelation model. Because the linear autocorrelation of the actual measurement runoff sequence is not obvious, modeling can be performed by adopting a mean model; the autocorrelation of the first-order differential sequence is remarkable when the order p=1-2, and the first-order differential autoregressive models ARI (1, 1) and ARI (2, 1) can be adopted; the autocorrelation of the actual measurement runoff sequence is quantified by adopting mutual information, and the result shows that the actual measurement runoff sequence has certain nonlinear autocorrelation when the order p=1-3, and SVR (1), SVR (2) and SVR (3) can be modeled by adopting a support vector regression model. Thus, the set of candidate models is constructed as: mean model, ARI (1, 1), ARI (2, 1), SVR (2), and SVR (3).
Step 2: according to the alternative model set in the step 1, one model is sequentially selected as a model hypothesis.
The set D L={(X,y)|X∈RL×d,y∈RL of samples of known sample size L is a sample from the population d= (X, y), and the population d= (X, y) is inferred using the known sample D L to construct the model y=f (X; ω *)+ε,ε~g(θ*) because the actual population is unknown. The known sample set D L is divided into training set D L,n=(Xn,yn) and test set D L,m=(Xm,ym with sample volumes n and m, respectively, and the sample set division scheme is noted as L (n, m). Training set D L,n was used for model parameter calibration and test set D L,m was used for verification of model prediction accuracy. The empirical model calibrated by training set D L,n is recorded asΩ (D L,n) is an estimate of the parameter ω *.
To verify the authenticity of the predictive model, null hypothesis H 0 and alternative hypothesis H 1 are defined.
H 0: model assumption is true, and prediction model obtained by training set calibration is adoptedThe prediction accuracy sigma (n, m) of the real model in the test set is equal to the model real accuracy sigma *.
H 1: forecasting modelIt is not true, i.e. its prediction accuracy σ (n, m) in the test set is not equal to the model true accuracy σ *. Constructing statistics:
Z=σ(n,m)-σ* (2)
When σ (n, m) causes the absolute value Z of the observed value of the statistic Z to be excessively large, the assumption H 0 is rejected, whereas the assumption H 0 is accepted.
A schematic diagram of the sampling distribution and rejection fields of Z and the observed value Z 0 of Z is shown in fig. 1.
Step 3: 2, assuming the model selected in the step 2, and calibrating model parameters by adopting an actual measurement sample with sample capacity of L=55 as overall estimation of a real model, wherein the model is marked as y=f (X; omega *)+ε,ε~g(θ*), and the real precision of the model is estimated as sigma *;
The objective function of model solution in step 3 is generally composed of a fitting error function and a penalty function, and the general expression of the objective function is as follows:
Wherein L (y i,f(Xi; omega)) is the fitting error of the model, and represents the model to fit the training sample as much as possible; n is the training sample capacity; λΩ (ω) is a penalty function introduced by constraining the model to overfitting to improve the generalization ability of the model in unknown sample sets. The traditional linear regression model usually adopts a minimum mean square error (Mean Squared Error, MSE) as an objective function, and adopts a least square method to solve; machine learning models such as support vector regression (Support Vector Regression, SVR) often have a parametric optimization objective function that includes an error function and a penalty function, also called a loss function, and is solved by using an optimization algorithm such as gradient descent.
In the step 3, the model real precision index sigma * can select different indexes such as root mean square error, average absolute error, correlation coefficient and the like. Taking root mean square error (Root Mean Square Error, RMSE) as an example, the calculation formula of the prediction accuracy evaluation index of the test set is as follows:
Where σ is the root mean square error, the value is the square of the mean square error (Mean Square Error, MSE), m is the test sample capacity, y i is the measured value of runoff, And the predicted value of the radial flow is an empirical model.
Fitting modelEstimating the parameters of the formula (3) according to the objective function and the solving method; the residual sequence epsilon adopts K-S to test and judge that the residual sequences all meet normal distribution. Residual root mean square error estimates σ * for 6 candidate model mean models, ARI (1, 1), ARI (2, 1), SVR (2), and SVR (3) are 698, 804, 775, 679, 599, 534, respectively.
Step 4: according to the average model, ARI (1, 1), ARI (2, 1), SVR (2) and SVR (3) and other 6 model estimation populations, C=500 times random sampling with sample capacity of L=55 is carried out from the population by adopting a Monte Carlo method, and one sampling is recorded as a sample set D L. The sample set D L is divided into training and testing sets with sample volumes L (n, m) in time series.
Set 46 sets of L (n, m) schemes: n is between 5 and 50, and the corresponding m is between 50 and 5. For each group of L (n, m) schemes, a predictive model training and testing of c=500 samples is performed, and a test set prediction accuracy sample set σ m of 46 groups of schemes is obtained in sequence.
Step 5: if the average distance index of the prediction accuracy sampling distribution and the model true accuracy is smaller, the sampling distribution of the representative statistic Z is close to the true H 0, namely the sampling distribution of sigma (n, m) is close to the model true accuracy sigma *, and the probability of the original assumption H 0 being true is larger. The optimal sample set partitioning capacity L (n r,mr) is determined on the basis of the minimum average distance so that the sampling distribution of the statistic Z approximates the true H 0 as much as possible. The method comprises the following specific steps:
Step 5.1: the average deviation d (n, m, sigma *) of the formula (5) is adopted as a discrimination index of sample set division. The statistical test sampling number of the prediction model with the sample set dividing capacity of L (n, m) is C, and the sample set of the test set prediction precision (root mean square prediction error) sampled for C times is sigma m, wherein the C-th sample is sigma m (C). The average deviation degree of sigma m (C epsilon C) and the true precision sigma * is quantized by adopting a root mean square distance index, an average deviation index d (n, m, sigma *) is defined, and the calculation formula is as follows:
Step 5.2: if d (n, m, σ *) obtains the minimum value in the multiple-set sample set dividing scheme L (n, m), it is the sample set dividing capacity, which is called "optimal sample set capacity" L (nr, mr). At this time, the prediction accuracy σ (n, m) distribution deviates as little as possible from the true σ *, i.e., the statistical Z distribution in equation (2) is sufficiently close to H 0.
Predicting a precision sample set from the test set of the 46-group L (n, m) scheme in step 4, according to formula (5)
The mean deviation d (n, m, σ *) of σ m samples from σ * under each L (n, m) scheme was calculated. And drawing 95% confidence intervals and average deviation index results of prediction accuracy of 6 alternative models such as a mean model, ARI (1, 1), ARI (2, 1), SVR (2) and SVR (3) along with the change of the sample capacity n of the training set, as shown in fig. 4.
The average deviation d (n, m, σ *) of the 6 models in fig. 4 (a) to (f) was analyzed: as the training set n increases (m decreases), d (n, m, σ *) of each model is a trend of decreasing first and then increasing, and the minimum value is obtained at a certain n and m. The optimal sample set division of the models can be judged according to the method, the training sample volumes are n=15, 21, 24, 26 and 25, and the test set volumes are 55-n.
Step 6: a sampling distribution of the statistic Z is constructed from the set of prediction accuracy σ (n, m) samples of the sample division capacity L (n r,mr) in step 5, and a reject domain is determined from the sampling distribution and the given significance level α.
And constructing a statistic Z according to a formula (5) by adopting a test set prediction precision sigma m under the optimal sample capacity division, obtaining a confidence interval with a significance level alpha=0.05 according to the sampling distribution of Z, and respectively taking alpha/2=2.5% and 1-alpha/2=97.5% quantiles of the sampling distribution of Z by using the lower limit and the upper limit of the interval of double-side test.
Step 7: according to the steps of 4.1 and 4.2, the prediction precision of the test set of the actual measurement sample under the optimal sample capacity is calculated and calculatedThe observed value Z 0 of the statistic Z is further calculated according to equation (2).
The statistics Z of the 6 alternative models were plotted with a 95% confidence interval sampling and the distribution location of Z 0, with a significance level of 0.05, as shown in fig. 5.
Step 8: according to the quantiles of Z 0 falling in the Z sampling distribution, the p value is calculated by adopting a formula (6), and the credibility of the assumption of the 6 models is respectively quantized. The confidence p values and reject rates 1-p are shown in table 1.
TABLE 1 predictive model hypothesis test results Table at best sample Capacity division
Comparison analysis of the test results of the 6 models in table 1 shows that: the rejection rates of the 6 models showed the highest confidence levels for the mean model and SVR (1), the ARI (2, 1) times, the ARI (1, 1) and SVR (2) had lower confidence levels, and SVR (3) was examined as an untrusted model at a significance level of 0.05. Therefore, it is recommended to select the mean model and SVR (1) as the optimal models, with optimal sample capacities of (15, 40) and (26, 29), respectively.
Further, the forecast model preference method provided above is compared with existing methods:
The hypothesis testing method is the method provided by the invention, and model optimization is carried out based on the reject rate minimum criterion under the optimal sample set division; the conventional optimization method divides actual measurement samples into a training set and a test set according to experience proportions (2:1 and 4:1), and model optimization is performed based on the highest prediction precision of the test set.
The correspondence between the test precision sigma and rejection rate 1-p of the actually measured samples of the 6 models under the hypothesis test method, the conventional optimization method (2:1) and the conventional optimization method (4:1) is respectively given, as shown in fig. 6.
The optimal mean model and sub-optimal model SVR (1) selected by the hypothesis test method (reject rate minimum criterion) in fig. 6 (a) not only has higher test accuracy than the other 4 models on the actually measured sample, but also has the lowest reject rate of 0.09 and 0.1 respectively when the models are true; whereas FIG. 6 (b) conventional optimization (2:1) recommends SVR (3) as the best model with the highest accuracy, but model rejection rate is 0.98, rejecting the assumption that model is true at 95% confidence level; FIG. 6 (c) the traditional optimization method (4:1) recommends ARI (2, 1) as the most accurate optimal model and SVR (3) as the suboptimal model, but the rejection rate reaches 0.7 and 0.85, and the reliability of the model is lower. It can be seen that models based solely on the test accuracy preference of the traditional preference methods (2:1 and 4:1) may lead to unreasonable model preference results due to sampling uncertainties in the prediction accuracy statistics.
The reliability of the mean model preferred by the hypothesis testing method is 0.91, which is increased by 0.89 and 0.61 respectively compared with that of the model preferred by the traditional preferred methods (2:1 and 4:1), thereby verifying the superiority of the proposed hypothesis testing preferred method.
The foregoing is only a specific embodiment of a certain embodiment of the present invention, but the protection scope of the present invention is not limited thereto, and any changes or substitutions that are easily conceivable by those skilled in the art within the technical scope of the present invention should be covered in the protection scope of the present invention. Therefore, the protection scope of the present invention should be subject to the protection scope of the claims.
Claims (4)
1. The data-driven runoff forecasting model optimization method under the limited sample based on hypothesis test is characterized by comprising the following steps:
Step1: screening forecasting factors according to historical runoff data and meteorological data, establishing a mapping relation between runoffs and the forecasting factors, selecting a plurality of data driving models to describe the mapping relation, and establishing an alternative model set; the data driving model consists of a model structure, parameters, input variables and output variables, and the model expression is as follows:
y=f(X;ω)+ε,ε~g(θ)
Wherein, the input variable X is a predictor related to runoff; the output variable y is runoff; f (X; omega) is a model form with certain structural characteristics; omega is a model parameter and is obtained through historical sample training; epsilon is a random term with a distribution function of g (theta);
Step 2: according to the alternative model set in the step 1, sequentially selecting each alternative model as a model hypothesis; a sample set D L of known sample size L is a sample from population D, a model y=f (X; ω *)+ε,ε~g(θ*) is constructed using a known sample D L to infer population D, ω * is a model parameter inferred from population D with sample D L, the distribution function is g (θ *) at this time, as an overall estimate of the real model, and the model real accuracy is estimated as σ *; dividing a known sample set D L into a training set D L,n and a test set D L,m with sample volumes of n and m respectively, and recording the sample set dividing scheme as L (n, m); training set D L,n is used for model parameter calibration, and testing set D L,m is used for model prediction accuracy test; the method adopts a forecasting model which is rated by a training set D L,n as Ω (D L,n) is an estimate of the parameter ω *;
Defining an original hypothesis H 0 and an alternative hypothesis H 1 for checking the authenticity of the forecasting model; wherein:
H 0: model y=f (X; ω *)+ε,ε~g(θ*) assuming true, and using training set D L,n to model the resulting forecast model Is a real model, its prediction accuracy σ (n, m) at test set D L,m is equal to model real accuracy σ *;
H 1: forecasting model Is not true, i.e., the prediction accuracy σ (n, m) is not equal to σ *;
If the predictive model is true, σ (n, m) approximates σ * within a certain allowable deviation range, and a statistic is constructed:
Z=σ(n,m)-σ*;
When σ (n, m) causes the absolute value of the observed value Z 0 of the statistic Z to be excessively large, the assumption H 0 is rejected, whereas the assumption H 0 is accepted;
Step 3: 2, assuming the model selected in the step 2, calibrating model parameters by adopting a measured sample D L 0 with sample capacity L as the overall estimation of a real model, wherein the real precision estimation of the model is sigma *;
Step 4: sampling the overall model y=f (X; omega *)+ε,ε~g(θ*) with a sample capacity of L by adopting a Monte Carlo method, recording the sampling times of C, recording the sample capacity of C times as L, and carrying out model calibration and inspection on each group of division schemes L (n, m) by adopting C times of sampling to obtain the sampling distribution of prediction precision;
Step 5: determining the optimal sample set division capacity L (n r,mr) by taking the minimum average distance as a principle, so that the sampling distribution of the statistic Z is as close as possible to the real H 0, and if the average distance index of the sampling distribution of Z and the real precision of the model is smaller, the sampling distribution of the statistic Z is sufficiently close to the real H 0, namely the sampling distribution of the prediction precision sigma (n, m) of the optimal sample set division capacity L (n r,mr) is close to the real precision sigma * of the model, and the probability that the original assumption H 0 is true is larger;
Step 6: constructing sampling distribution of statistic Z according to the prediction precision sigma (n, m) sample set of the optimal sample set division capacity L (n r,mr) in the step 5, and determining a refusal domain of the sampling distribution of the statistic Z according to the sampling distribution of the statistic Z and the given significance level alpha;
step 7: based on the measured sample Constructing a prediction model with sample set division capacity L (n r,mr), and calculating prediction accuracy/>, of a test setAnd an observation Z 0 of statistic Z;
Step 8: judging whether to reject the original hypothesis H 0 according to whether the observed value Z 0 of the statistic Z falls in the reject domain of the sampling distribution of the statistic Z, if the original hypothesis H 0 is rejected, the current model does not pass the hypothesis test and is not suitable as a preferred model; after performing hypothesis testing on each candidate model, if a plurality of candidate models pass the hypothesis testing, further using a p-value method to give a minimum significance level that the original hypothesis H 0 can be rejected, wherein the calculation formula of the p value is as follows:
p=P{|z|≥z0};
the model with the minimum rejection rate 1-p for the original hypothesis H 0 is considered to be closest to the real model, and model optimization is performed based on the criterion of minimum rejection rate.
2. The hypothesis testing based data-driven runoff forecasting model optimization method under limited samples as claimed in claim 1, wherein: the model real precision index sigma * in the step 3 comprises root mean square error, average absolute error and correlation coefficient.
3. The hypothesis testing based data-driven runoff forecasting model optimization method under limited samples as claimed in claim 1, wherein: the objective function of model solution in the step 3 consists of a fitting error function and a punishment function, and the general expression of the objective function is as follows:
Wherein F (y i,f(Xi; omega)) is the fitting error of the model, and represents the model to fit the training sample as much as possible; n is the training sample capacity; λΩ (ω) is a penalty function introduced by constraining the model to overfitting to improve the generalization ability of the model in unknown sample sets.
4. The hypothesis testing based data-driven runoff forecasting model optimization method under limited samples as claimed in claim 1, wherein: the step 5 specifically comprises the following steps:
Step 5.1: taking average deviation d (n, m, sigma *) as a discrimination index of sample set division, recording the statistical test sampling frequency of a prediction model with sample set division capacity L (n, m) as C, recording the test set prediction precision sample set sampled for C times as sigma m, wherein the C-th sample is sigma m (C), and adopting a root mean square distance index to quantify the average deviation degree of sigma m (C epsilon C) and true precision sigma *, and defining an average deviation index d (n, m, sigma *), wherein the calculation formula is as follows:
Step 5.2: if d (n, m, σ *) obtains the minimum value in the multi-group sample set dividing scheme L (n, m), the sample set dividing capacity at this time is called "optimal sample set dividing capacity" L (n r,mr); at this time, the prediction accuracy σ (n, m) distribution deviates sufficiently little from the true σ *, i.e., the statistic Z distribution is sufficiently close to H 0.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311000418.8A CN117494862B (en) | 2023-08-09 | 2023-08-09 | Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202311000418.8A CN117494862B (en) | 2023-08-09 | 2023-08-09 | Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test |
Publications (2)
Publication Number | Publication Date |
---|---|
CN117494862A CN117494862A (en) | 2024-02-02 |
CN117494862B true CN117494862B (en) | 2024-05-28 |
Family
ID=89675076
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202311000418.8A Active CN117494862B (en) | 2023-08-09 | 2023-08-09 | Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN117494862B (en) |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120085171A (en) * | 2011-10-06 | 2012-07-31 | 주식회사 켐에쎈 | Multiple linear regression-artificial neural network hybrid model predicting parachor of pure organic compound |
CN109816167A (en) * | 2019-01-18 | 2019-05-28 | 昆仑(重庆)河湖生态研究院(有限合伙) | Runoff Forecast method and Runoff Forecast device |
CN110659996A (en) * | 2019-10-25 | 2020-01-07 | 重庆第二师范学院 | Stock investment risk early warning system and method based on machine learning |
CN111523732A (en) * | 2020-04-25 | 2020-08-11 | 中国海洋大学 | Japanese anchovy model screening and predicting method in winter |
CN112258019A (en) * | 2020-10-19 | 2021-01-22 | 佛山众陶联供应链服务有限公司 | Coal consumption assessment method |
CN113705877A (en) * | 2021-08-23 | 2021-11-26 | 武汉大学 | Real-time monthly runoff forecasting method based on deep learning model |
CN114966233A (en) * | 2022-05-16 | 2022-08-30 | 国网电力科学研究院武汉南瑞有限责任公司 | Lightning forecasting system and method based on deep neural network |
CN115099469A (en) * | 2022-06-06 | 2022-09-23 | 中国长江电力股份有限公司 | Medium-and-long-term runoff prediction method based on optimal climate factor and precision weight coefficient |
CN115455833A (en) * | 2022-09-20 | 2022-12-09 | 北京理工大学 | Pneumatic uncertainty characterization method considering classification |
-
2023
- 2023-08-09 CN CN202311000418.8A patent/CN117494862B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20120085171A (en) * | 2011-10-06 | 2012-07-31 | 주식회사 켐에쎈 | Multiple linear regression-artificial neural network hybrid model predicting parachor of pure organic compound |
CN109816167A (en) * | 2019-01-18 | 2019-05-28 | 昆仑(重庆)河湖生态研究院(有限合伙) | Runoff Forecast method and Runoff Forecast device |
CN110659996A (en) * | 2019-10-25 | 2020-01-07 | 重庆第二师范学院 | Stock investment risk early warning system and method based on machine learning |
CN111523732A (en) * | 2020-04-25 | 2020-08-11 | 中国海洋大学 | Japanese anchovy model screening and predicting method in winter |
CN112258019A (en) * | 2020-10-19 | 2021-01-22 | 佛山众陶联供应链服务有限公司 | Coal consumption assessment method |
CN113705877A (en) * | 2021-08-23 | 2021-11-26 | 武汉大学 | Real-time monthly runoff forecasting method based on deep learning model |
CN114966233A (en) * | 2022-05-16 | 2022-08-30 | 国网电力科学研究院武汉南瑞有限责任公司 | Lightning forecasting system and method based on deep neural network |
CN115099469A (en) * | 2022-06-06 | 2022-09-23 | 中国长江电力股份有限公司 | Medium-and-long-term runoff prediction method based on optimal climate factor and precision weight coefficient |
CN115455833A (en) * | 2022-09-20 | 2022-12-09 | 北京理工大学 | Pneumatic uncertainty characterization method considering classification |
Non-Patent Citations (1)
Title |
---|
Design Flood Estimation Using Univariate and Multivariate Frequency Analysis Methods;Muhammad Rizwan;博士电子期刊;20230115;全文 * |
Also Published As
Publication number | Publication date |
---|---|
CN117494862A (en) | 2024-02-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN106934237A (en) | Radar cross-section redaction measures of effectiveness creditability measurement implementation method | |
CN112966949B (en) | Tunnel construction risk assessment method and device and storage medium | |
CN111310981A (en) | Reservoir water level trend prediction method based on time series | |
CN104462808A (en) | Method for fitting safe horizontal displacement and dynamic data of variable sliding window of water level | |
CN114036452B (en) | Yield evaluation method applied to discrete production line | |
CN105005822A (en) | Optimal step length and dynamic model selection based ultrahigh arch dam response prediction method | |
CN109523077B (en) | Wind power prediction method | |
CN113269384B (en) | Method for early warning health state of river system | |
CN113484813A (en) | Intelligent ammeter fault rate estimation method and system under multi-environment stress | |
CN117131977B (en) | Runoff forecasting sample set partitioning method based on misjudgment risk minimum criterion | |
CN109657287B (en) | Hydrological model precision identification method based on comprehensive scoring method | |
CN117592609B (en) | On-line monitoring method, monitoring terminal and storage medium for canal water utilization coefficient | |
CN117494862B (en) | Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test | |
CN117851908A (en) | Improved on-line low-voltage transformer area electric energy meter misalignment monitoring method and device | |
CN117408171A (en) | Hydrologic set forecasting method of Copula multi-model condition processor | |
CN116681205A (en) | Method for evaluating and predicting development degree of rammed earth site gully disease | |
CN116595330A (en) | Runoff change attribution method considering uncertainty of hydrologic modeling | |
Atukeren | The relationship between the F-test and the Schwarz criterion: implications for Granger-causality tests | |
CN108665090B (en) | Urban power grid saturation load prediction method based on principal component analysis and Verhulst model | |
CN113627621B (en) | Active learning method for optical network transmission quality regression estimation | |
CN113987416B (en) | Oil gas resource amount calculation method and system based on confidence level | |
CN109117495B (en) | Robust data coordination method, device and storage medium in alumina production evaporation process | |
CN116720662B (en) | Distributed energy system applicability evaluation method based on set pair analysis | |
Chen et al. | Uncertain random accelerated degradation modelling and statistical analysis with aleatory and epistemic uncertainties from multiple dimensions | |
CN111882100B (en) | Hydrologic set interval forecast building method based on multi-model random linear combination |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |