CN117494862B

CN117494862B - Data-driven runoff forecasting model optimization method under limited sample based on hypothesis test

Info

Publication number: CN117494862B
Application number: CN202311000418.8A
Authority: CN
Inventors: 丁小玲; 胡维忠; 陈尚法; 罗斌; 苏培芳; 唐海华; 周超; 蔡林杰; 周廷璋
Original assignee: Changjiang Institute of Survey Planning Design and Research Co Ltd
Current assignee: Changjiang Institute of Survey Planning Design and Research Co Ltd
Priority date: 2023-08-09
Filing date: 2023-08-09
Publication date: 2024-05-28
Anticipated expiration: 2043-08-09
Also published as: CN117494862A

Abstract

The invention provides a data-driven runoff forecasting model optimization method under a limited sample based on hypothesis test, which comprises the following steps: establishing a mapping relation between runoff and a forecasting factor, selecting a plurality of data driving models to describe the mapping relation, and establishing an alternative model set; calibrating model parameters for the selected model hypothesis to be used as overall estimation of a real model; sampling the overall model, dividing a sample set into a training set and a test set, and performing model calibration and inspection on each group of dividing schemes to obtain sampling distribution of prediction precision; determining an optimal sample set partitioning capacity; calculating the prediction precision and the observation value of statistics of the test set; judging whether to reject the original assumption according to whether the observed value of the statistic falls in a reject domain; if a plurality of alternative models pass the hypothesis test, model optimization is performed based on a criterion of minimum rejection rate. The method can verify the reliability of the prediction model on the unknown overall inference prediction, and provides decision basis for the optimization of the prediction model.

Description

A data-driven runoff forecast model optimization method based on hypothesis testing under finite samples

技术领域Technical Field

本发明涉及径流预报技术领域，具体是一种基于假设检验的有限样本下数据驱动径流预报模型优选方法。The invention relates to the technical field of runoff forecasting, and in particular to a method for optimizing a data-driven runoff forecasting model under limited samples based on hypothesis testing.

背景技术Background technique

中长期径流预报是水文预报中的经典和难点问题，可靠的径流预报信息是指导水资源科学规划利用和水利工程安全经济运行的重要决策依据。通过模型优选寻求高精度预报模型，为管理决策者提供可靠的径流预报信息，一直是当前水文预报研究的重点任务之一。Medium- and long-term runoff forecasting is a classic and difficult problem in hydrological forecasting. Reliable runoff forecast information is an important decision-making basis for guiding the scientific planning and utilization of water resources and the safe and economic operation of water conservancy projects. Seeking high-precision forecasting models through model optimization and providing reliable runoff forecast information for management decision makers has always been one of the key tasks of current hydrological forecasting research.

具有长预见期的年径流预报常采用基于统计学的数据驱动模型，通过对样本数据所蕴含的内在规律进行挖掘以实现对未知样本的推断预测。此类预报建模方法的实质是选择合适的数学模型对有限样本的规律进行拟合，进而基于拟合模型对未知样本进行预测，希望尽可能减小随机不确定性预测误差。在模型选择方面，由于样本总体具有未知性，传统的建模方法中通常依据经验认知从众多模型中选择多个备选模型进行建模，采用训练集进行模型参数率定，然后比较优选出在测试集上具有更高预测精度的模型进行实际径流预报。Annual runoff forecasts with long forecast periods often use statistical data-driven models to mine the inherent laws contained in sample data to achieve inference and prediction of unknown samples. The essence of this type of forecast modeling method is to select a suitable mathematical model to fit the laws of finite samples, and then predict unknown samples based on the fitted model, hoping to minimize the prediction error of random uncertainty. In terms of model selection, due to the unknown nature of the sample population, traditional modeling methods usually select multiple alternative models from a large number of models based on empirical cognition for modeling, use training sets to calibrate model parameters, and then compare and select models with higher prediction accuracy on the test set for actual runoff forecasting.

然而，根据数理统计推断理论，实测径流样本为依概率发生的一组来自总体的观测值，并不是未来一定会发生，实测样本序列越短，偏离总体分布的概率越大。由于年径流观测序列相对较短，通常为数十年，基于观测序列较短的有限实测样本进行模型率定和预测精度检验，所得到的预测精度指标也是依概率发生。因此，传统建模方法中以测试集预测精度为准则所优选的模型，其对未知总体的预测不确定性难以验证，尚缺乏有效的验证方法。统计学中的假设检验理论虽有限样本的统计推断问题提供了理论基础，但如何结合数据驱动预报建模的关键环节，构建关注于模型对未知样本预测精度的径流预报模型假设检验方法，仍是目前亟待解决的技术难点问题。However, according to the theory of mathematical statistical inference, the measured runoff samples are a set of observations from the population that occur with probability, and they are not certain to occur in the future. The shorter the measured sample sequence, the greater the probability of deviation from the overall distribution. Since the annual runoff observation sequence is relatively short, usually decades, the prediction accuracy index obtained by model calibration and prediction accuracy test based on finite measured samples with shorter observation sequences also occurs with probability. Therefore, the model selected by the traditional modeling method based on the prediction accuracy of the test set is difficult to verify the prediction uncertainty of the unknown population, and there is still a lack of effective verification methods. Although the hypothesis testing theory in statistics provides a theoretical basis for the statistical inference problem of finite samples, how to combine the key links of data-driven forecast modeling to construct a hypothesis testing method for the runoff forecast model that focuses on the prediction accuracy of the model for unknown samples is still a technical difficulty that needs to be solved urgently.

发明内容Summary of the invention

针对上述存在的问题和技术难题，本发明提供一种基于假设检验的有限样本下数据驱动径流预报模型优选方法，用于验证预报模型对未知总体推断预测的可靠性，为预报模型的优选提供决策依据。In response to the above-mentioned problems and technical difficulties, the present invention provides a data-driven runoff forecast model optimization method under finite samples based on hypothesis testing, which is used to verify the reliability of the forecast model's inference prediction of the unknown population and provide a decision-making basis for the optimization of the forecast model.

本发明通过如下技术方案实现：一种基于假设检验的有限样本下数据驱动径流预报模型优选方法，包括如下步骤：The present invention is implemented by the following technical solution: a method for optimizing a data-driven runoff forecast model under a limited sample based on hypothesis testing, comprising the following steps:

步骤1：根据历史径流资料和气象数据筛选预报因子，建立径流与预报因子之间的映射关系，选择多个数据驱动模型对所述映射关系进行描述，建立备选模型集；Step 1: Screen prediction factors based on historical runoff data and meteorological data, establish a mapping relationship between runoff and prediction factors, select multiple data-driven models to describe the mapping relationship, and establish a candidate model set;

步骤2：根据步骤1中的备选模型集，选择一个模型作为模型假设，定义原假设H₀和备择假设H₁用于检验预报模型的真实性；其中：Step 2: Based on the alternative model set in step 1, select a model as the model hypothesis, and define the null hypothesis H ₀ and the alternative hypothesis H ₁ to test the authenticity of the forecast model; where:

H₀：模型y＝f(X；ω^*)+ε,ε～g(θ^*)假设为真，采用训练集率定得到的预报模型为真实模型，其在测试集D_L,m＝(X_m,y_m)的预测精度σ(n,m)等于模型真实精度σ^*；H ₀ : Model y = f(X; ω ^* ) + ε, ε ~ g(θ ^* ) is assumed to be true, and the prediction model obtained by calibration using the training set is the true model, whose prediction accuracy σ(n,m) on the test set _DL,m =( _Xm , _ym ) is equal to the model's true accuracy σ ^* ;

H₁：预报模型不为真，即σ(n,m)不等于σ^*；H ₁ : Forecast model is not true, that is, σ(n,m) is not equal to σ ^* ;

若预报模型为真，则σ(n,m)在一定的许可偏差范围内逼近σ^*，构造统计量：If the prediction model is true, then σ(n,m) approaches σ ^* within a certain allowable deviation range, and the constructed statistic is:

Z＝σ(n,m)-σ^* (2)Z＝σ(n,m)-σ ^* (2)

当σ(n,m)使得统计量Z的观察值z₀的绝对值|z|过分大的情况发生时，拒绝假设H₀，反之接受假设H₀；When σ(n,m) makes the absolute value |z| of the observed value z ₀ of the statistic Z too large, the hypothesis H ₀ is rejected, otherwise the hypothesis H ₀ is accepted;

步骤3：对步骤2中选择的一个模型假设，采用样本容量为L的实测样本率定模型参数，作为真实模型的总体估计，总体的模型记为y＝f(X；ω^*)+ε,ε～g(θ^*)，模型真实精度估计为σ^*；Step 3: For a model hypothesis selected in step 2, use a measured sample with a sample size of L The model parameters are calibrated as the overall estimate of the true model. The overall model is recorded as y = f(X; ω ^* ) + ε, ε ~ g(θ ^* ), and the true accuracy of the model is estimated as σ ^* ;

步骤4：采用蒙特卡洛法对总体的模型y＝f(X；ω^*)+ε,ε～g(θ^*)进行样本容量为L的抽样，记抽样次数为C，C次样本容量均为L，一次抽样记为样本集D_L，对于每一个抽样，将样本集D_L划分为样本容量分别为n和m的训练集D_L,n＝(X_n,y_n)和测试集D_L,m＝(X_m,y_m)，记该样本集划分方案为L(n,m)，将训练集D_L,n用于模型参数率定，测试集D_L,m用于模型预测精度的检验；对每一组划分方案L(n,m)方案，采用C次抽样进行模型率定和检验，得到预测精度的抽样分布；Step 4: Use the Monte Carlo method to sample the overall model y＝f(X；ω ^* )+ε,ε～g(θ ^* ) with a sample size of L. The number of samplings is denoted as C, and the sample size of C times is L. One sampling is denoted as sample set _DL . For each sampling, the sample set _DL is divided into a training set _DL,n ＝( _Xn , _yn ) and a test set _DL,m ＝( _Xm , _ym ) with sample sizes of n and m respectively. The sample set partitioning scheme is denoted as L(n,m). The training set _DL,n is used for model parameter calibration, and the test set DL _,m is used for model prediction accuracy testing. For each group of partitioning schemes L(n,m), C samplings are used for model calibration and testing to obtain the sampling distribution of prediction accuracy.

步骤5：以平均距离最小为原则确定最佳样本集划分容量L(n^r,m^r)，使得统计量Z的抽样分布尽可能逼近真实H₀，若Z的抽样分布与模型真实精度的平均距离指标越小，代表统计量Z的抽样分布足够逼近真实H₀，即最佳样本集划分容量L(n^r,m^r)的预测精度σ(n,m)的抽样分布与模型真实精度σ^*逼近，原假设H₀为真的概率越大；Step 5: Determine the optimal sample set partition capacity L(n ^r ,m ^r ) based on the principle of minimum average distance, so that the sampling distribution of the statistic Z is as close to the true H ₀ as possible. The smaller the average distance index between the sampling distribution of Z and the true accuracy of the model, the smaller the sampling distribution of the statistic Z is close enough to the true H ₀ , that is, the closer the sampling distribution of the prediction accuracy σ(n,m) of the optimal sample set partition capacity L(n ^r ,m ^r ) is to the true accuracy σ ^* of the model, the greater the probability that the original hypothesis H ₀ is true;

步骤6：根据步骤5中最佳样本集划分容量L(n^r,m^r)的预测精度σ(n,m)样本集构建统计量Z的抽样分布，根据统计量Z的抽样分布和给定显著性水平α确定统计量Z抽样分布的拒绝域；Step 6: Construct the sampling distribution of the statistic Z according to the prediction accuracy σ(n,m) sample set of the optimal sample set partition capacity L(n ^r ,m ^r ) in step 5, and determine the rejection region of the sampling distribution of the statistic Z according to the sampling distribution of the statistic Z and the given significance level α;

步骤7：基于实测样本构建样本划分容量为L(n^r,m^r)的预报模型，计算测试集的预测精度/>以及统计量Z的观察值z₀；Step 7: Based on the measured samples Construct a prediction model with a sample partition capacity of L(n ^r ,m ^r ) and calculate the prediction accuracy of the test set/> and the observed value z ₀ of the statistic Z;

步骤8：根据统计量Z的观察值z₀是否落在统计量Z抽样分布的拒绝域，判断是否拒绝原假设H₀，若拒绝原假设H₀，则当前模型未通过假设检验，不适合作为优选模型；若存在多个备选模型均通过假设检验，进一步采用p值法给出原假设H₀可被拒绝的最小显著性水平，p值的计算式为：Step 8: Determine whether to reject the null hypothesis H ₀ based on whether the observed value z ₀ of the statistic Z falls within the rejection region of the sampling distribution of the statistic Z. If the null hypothesis H ₀ is rejected, the current model fails the hypothesis test and is not suitable as the preferred model. If there are multiple alternative models that pass the hypothesis test, the p-value method is further used to give the minimum significance level at which the null hypothesis H ₀ can be rejected. The calculation formula for the p-value is:

p＝P｛|z|≥z₀}p＝P{|z|≥z ₀ }

认为对原假设H₀拒绝率1-p最小的模型最接近真实模型，基于“拒绝率最小”准则进行模型优选。The model with the smallest rejection rate 1-p for the null hypothesis H ₀ is considered to be closest to the true model, and the model optimization is performed based on the "minimum rejection rate" criterion.

进一步的，步骤1中所述数据驱动模型由模型结构、参数、输入变量和输出变量组成，模型表达式如下：Furthermore, the data-driven model described in step 1 is composed of a model structure, parameters, input variables, and output variables, and the model expression is as follows:

y＝f(X；ω)+ε，ε～g(θ)y＝f(X；ω)+ε，ε～g(θ)

式中，输入变量X为与径流具有相关性的预报因子；输出变量y为径流；f(X；ω)为具有一定结构特征的模型形式；ω为模型参数，通过历史样本训练得到；ε为分布函数为g(θ)的随机项。Wherein, the input variable X is a predictor correlated with runoff; the output variable y is runoff; f(X; ω) is a model form with certain structural characteristics; ω is a model parameter obtained through historical sample training; ε is a random term with a distribution function of g(θ).

进一步的，步骤3中模型真实精度指标σ^*包括均方根误差、平均绝对误差、相关性系数Furthermore, the true accuracy index σ ^* of the model in step 3 includes root mean square error, mean absolute error, and correlation coefficient

进一步的，步骤3中模型求解的目标函数由拟合误差函数和惩罚函数组成，目标函数的通用表达如下：Furthermore, the objective function of the model solution in step 3 consists of a fitting error function and a penalty function. The general expression of the objective function is as follows:

式中，L(yⁱ,f(Xⁱ；ω))为模型的拟合误差，代表模型尽可能拟合训练样本；n为训练样本容量；λΩ(ω)为约束模型过拟合而引入的惩罚函数，以提高模型在未知样本集的泛化能力。Where L(y ⁱ ,f(X ⁱ ;ω)) is the fitting error of the model, which means that the model fits the training samples as much as possible; n is the training sample capacity; λΩ(ω) is the penalty function introduced to constrain the overfitting of the model to improve the generalization ability of the model in unknown sample sets.

进一步的，步骤5具体包括：Furthermore, step 5 specifically includes:

步骤5.1：采用平均偏差d(n,m,σ^*)作为样本集划分的判别指标，记样本集划分容量为L(n,m)的预报模型统计试验抽样次数为C，记C次抽样的测试集预测精度样本集为σ_m，其中第c个样本为σ_m(c)，采用均方根距离指标量化σ_m(c∈C)与真实精度σ^*的平均偏离程度，定义平均偏差指标d(n,m,σ^*)，计算公式如下：Step 5.1: Use the average deviation d(n,m,σ ^* ) as the discriminant index for sample set partitioning. Let the number of sampling times of the forecast model statistical test with the sample set partitioning capacity L(n,m) be C, let the test set prediction accuracy sample set of C sampling times be σ _m , where the cth sample is σ _m (c), and use the root mean square distance index to quantify the average deviation between σ _m (c∈C) and the true accuracy σ ^* . Define the average deviation index d(n,m,σ ^* ), and the calculation formula is as follows:

步骤5.2：若d(n,m,σ^*)在多组样本集划分方案L(n,m)中取得最小值，即为样本集划分容量，称为“最佳样本集容量”L(nr,mr)；此时，预测精度σ(n,m)分布与真实σ^*偏离尽可能足够小，即统计量Z分布足够接近H₀。Step 5.2: If d(n,m,σ ^* ) obtains the minimum value in the multi-group sample set partitioning scheme L(n,m), it is the sample set partitioning capacity, called the "optimal sample set capacity"L(nr,mr); at this time, the deviation between the distribution of prediction accuracy σ(n,m) and the true σ ^* is as small as possible, that is, the distribution of the statistic Z is close enough to _H0 .

本发明具有以下优势。The present invention has the following advantages.

(1)本发明在步骤2中以假设检验理论的小概率事件实际不可能性原理为基础，提供了一种有限样本下径流预报模型的假设检验方法；在步骤4～5中采用蒙特卡洛法对预测精度的分布进行模拟，得到偏离真实模型风险最小的抽样分布；在步骤7～8中提出一种基于拒绝率最小的模型优选准则，最大概率地保证了优选模型对未知总体推断预测的可信度。如实施例中图6的结果所示，假设检验法优选的均值模型可信度0.91，较于传统优选法(2:1和4:1)所优选模型的可信度分别增加0.89和0.61，验证了所本发明提假设检验优选方法的优越性。(1) In step 2, the present invention provides a hypothesis testing method for a runoff forecast model under a finite sample based on the principle of the actual impossibility of small probability events in hypothesis testing theory; in steps 4-5, the Monte Carlo method is used to simulate the distribution of prediction accuracy to obtain a sampling distribution with the lowest risk of deviating from the true model; in steps 7-8, a model optimization criterion based on the minimum rejection rate is proposed to maximize the probability of ensuring the credibility of the optimal model for the prediction of the unknown population inference. As shown in the results of Figure 6 in the embodiment, the credibility of the mean model selected by the hypothesis testing method is 0.91, which is 0.89 and 0.61 higher than the credibility of the model selected by the traditional optimization method (2:1 and 4:1), respectively, which verifies the superiority of the hypothesis testing optimization method proposed by the present invention.

(2)所提供的预报模型优选方法可推广至不同领域预测预报问题，为有限样本下的数据驱动预报模型构建提供一种基于可靠性准则的优选方法。(2) The forecast model optimization method provided can be extended to prediction problems in different fields, providing a reliability-based optimization method for the construction of data-driven forecast models under finite samples.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明实施例中模型预测精度的概率分布和双侧检验拒绝域的示意图；FIG1 is a schematic diagram of the probability distribution of the prediction accuracy of the model and the rejection region of the two-sided test in an embodiment of the present invention;

图2为本发明实施例所提一种基于假设检验的有限样本下数据驱动径流预报模型优选方法的流程图；FIG2 is a flow chart of a method for optimizing a data-driven runoff forecast model under finite samples based on hypothesis testing according to an embodiment of the present invention;

图3为本发明实施例中年径流序列及距平和一阶差分序列变化过程图；FIG3 is a diagram showing the change process of the mid-year runoff series and the anomaly and first-order difference series according to an embodiment of the present invention;

图4为本发明实施例中均值模型等6种备选模型的预测精度的95％置信区间和平均偏差指标随训练集样本容量n的变化过程图；FIG4 is a diagram showing the change process of the 95% confidence interval and the average deviation index of the prediction accuracy of the six alternative models such as the mean model in the embodiment of the present invention with the sample size n of the training set;

图5为均值模型等6种备选模型统计量Z抽样分布95％置信区间和z₀的分布位置图；Figure 5 shows the 95% confidence interval of the sampling distribution of the statistic Z and the distribution position of z ₀ for the six alternative models including the mean model;

图6为本发明假设检验法和传统优选方法的预报模型优选结果对比图。FIG6 is a comparison diagram of the prediction model optimization results of the hypothesis testing method of the present invention and the traditional optimization method.

具体实施方式Detailed ways

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合本发明实施例中的附图，对本发明实施例中的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明的一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动的前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solution and advantages of the embodiments of the present invention clearer, the technical solution in the embodiments of the present invention will be clearly and completely described below in conjunction with the drawings in the embodiments of the present invention. Obviously, the described embodiments are part of the embodiments of the present invention, not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by ordinary technicians in this field without making creative work are within the scope of protection of the present invention.

如图2所示，本实施例所提供的一种基于假设检验的有限样本下数据驱动径流预报模型优选方法，包括以下步骤：As shown in FIG2 , the present embodiment provides a method for optimizing a data-driven runoff forecast model under limited samples based on hypothesis testing, comprising the following steps:

步骤1：根据历史径流资料和气象数据筛选预报因子，选择多个数据驱动模型建立径流与预报因子之间的映射关系，建立备选模型集。Step 1: Screen forecasting factors based on historical runoff data and meteorological data, select multiple data-driven models to establish a mapping relationship between runoff and forecasting factors, and establish a set of alternative models.

所述数据驱动模型由模型结构、参数、输入变量和输出变量组成，模型表达式如下：The data-driven model consists of model structure, parameters, input variables and output variables. The model expression is as follows:

y＝f(X；ω)+ε，ε～g(θ) (1)y＝f(X；ω)+ε，ε～g(θ) (1)

式中，输入变量X为与径流具有相关性的预报因子，如自相关因子或外部成因相关因子；输出变量y为径流；f(X；ω)为具有一定结构特征的模型形式；ω为模型参数，通过历史样本训练得到；ε为分布函数为g(θ)的随机项。Wherein, the input variable X is a prediction factor correlated with runoff, such as an autocorrelation factor or an external cause-related factor; the output variable y is the runoff; f(X; ω) is a model form with certain structural characteristics; ω is a model parameter obtained through historical sample training; ε is a random term with a distribution function of g(θ).

本实施例收集的1956年至2010年的年径流数据，样本容量为55。径流序列及其一阶差分序列如图3所示。对55年实测径流序列及差分序列进行线性和非线性自相关分析，筛选预报因子并建立自相关模型。由于实测径流序列线性自相关性不显著，可采用均值模型建模；一阶差分序列在阶数p＝1～2时的自相关性显著，可采用一阶差分自回归模型ARI(1,1)、ARI(2,1)；采用“互信息”量化实测径流序列的自相关性，结果表明在阶数p＝1～3时具有一定的非线性自相关性，可采用支持向量回归模型建模SVR(1)、SVR(2)和SVR(3)。因此，构建备选模型集为：均值模型、ARI(1,1)、ARI(2,1)、SVR(1)、SVR(2)和SVR(3)等6种模型。The annual runoff data collected in this embodiment from 1956 to 2010 has a sample size of 55. The runoff sequence and its first-order difference sequence are shown in Figure 3. Linear and nonlinear autocorrelation analysis was performed on the 55-year measured runoff sequence and difference sequence, and the prediction factors were screened and the autocorrelation model was established. Since the linear autocorrelation of the measured runoff sequence is not significant, the mean model can be used for modeling; the autocorrelation of the first-order difference sequence is significant when the order is p = 1 to 2, and the first-order difference autoregression model ARI (1, 1) and ARI (2, 1) can be used; the "mutual information" is used to quantify the autocorrelation of the measured runoff sequence. The results show that there is a certain nonlinear autocorrelation when the order is p = 1 to 3, and the support vector regression model SVR (1), SVR (2) and SVR (3) can be used for modeling. Therefore, the candidate model set is constructed as follows: 6 models: mean model, ARI (1, 1), ARI (2, 1), SVR (1), SVR (2) and SVR (3).

步骤2：根据步骤1中的备选模型集，依次选择一个模型作为模型假设。Step 2: Based on the candidate model set in step 1, select a model as the model hypothesis.

已知样本容量为L的样本集D_L＝{(X,y)|X∈R^L×d,y∈R^L}为来自总体D＝(X,y)中的抽样，由于实际总体未知，采用已知样本D_L构建模型y＝f(X；ω^*)+ε,ε～g(θ^*)对总体D＝(X,y)进行推断。将已知样本集D_L划分为样本容量分别为n和m的训练集D_L,n＝(X_n,y_n)和测试集D_L,m＝(X_m,y_m)，记该样本集划分方案为L(n,m)。将训练集D_L,n用于模型参数率定，测试集D_L,m用于模型预测精度的检验。记采用训练集D_L,n率定的经验模型为ω(D_L,n)为参数ω^*的估计值。The sample set _DL = {(X, y) | X∈RL ^×d , y∈RL} with a known sample size of L ^is a sample from the population D = (X, y). Since the actual population is unknown, the known sample _DL is used to construct the model y = f(X; ω ^* ) + ε, ε ~ g(θ ^* ) to infer the population D = (X, y). The known sample set _DL is divided into a training set _DL,n = ( _Xn , _yn ) and a test set _DL,m = ( _Xm , _ym ) with sample sizes of n and m respectively. The sample set division scheme is denoted as L(n, m). The training set _DL,n is used for model parameter calibration, and the test set _DL,m is used to test the model prediction accuracy. The empirical model calibrated using the training set _DL,n is denoted as ω( _DL,n ) is the estimated value of the parameter ω ^* .

为检验预报模型的真实性，定义零假设H₀和备择假设H₁。In order to test the authenticity of the forecast model, the null hypothesis H ₀ and the alternative hypothesis H ₁ are defined.

H₀：模型假设为真，采用训练集率定得到的预报模型为真实模型，其在测试集的预测精度σ(n,m)等于模型真实精度σ^*。H ₀ : The model hypothesis is true, and the prediction model is calibrated using the training set is the true model, whose prediction accuracy σ(n,m) in the test set is equal to the model's true accuracy σ ^* .

H₁：预报模型不为真，即其在测试集的预测精度σ(n,m)不等于模型真实精度σ^*。构造统计量：H ₁ : Forecast model is not true, that is, its prediction accuracy σ(n,m) in the test set is not equal to the model's true accuracy σ ^* . Construct statistics:

Z＝σ(n,m)-σ^* (2)Z＝σ(n,m)-σ ^* (2)

当σ(n,m)使得统计量Z的观察值的绝对值z过分大的情况发生时，就拒绝假设H₀，反之接受假设H₀。When σ(n,m) makes the absolute value z of the observed value of the statistic Z too large, the hypothesis H ₀ is rejected, otherwise the hypothesis H ₀ is accepted.

Z的抽样分布和拒绝域及Z的观察值z₀的示意图，如图1所示。The schematic diagram of the sampling distribution and rejection region of Z and the observed value z ₀ of Z is shown in Figure 1.

步骤3：对步骤2中选择的一个模型假设，采用样本容量为L＝55的实测样本率定模型参数，作为真实模型的总体估计，模型记为y＝f(X；ω^*)+ε,ε～g(θ^*)，模型模型真实精度估计为σ^*；Step 3: For a model hypothesis selected in step 2, use the measured sample with a sample size of L = 55 to calibrate the model parameters as the overall estimate of the true model. The model is recorded as y = f(X; ω ^* ) + ε, ε ~ g(θ ^* ), and the true accuracy estimate of the model is σ ^* ;

步骤3中模型求解的目标函数一般由拟合误差函数和惩罚函数组成，目标函数的通用表达如下：The objective function of the model solution in step 3 is generally composed of a fitting error function and a penalty function. The general expression of the objective function is as follows:

式中，L(yⁱ,f(Xⁱ；ω))为模型的拟合误差，代表模型尽可能拟合训练样本；n为训练样本容量；λΩ(ω)为约束模型过拟合而引入的惩罚函数，以提高模型在未知样本集的泛化能力。传统的线性回归模型常采用最小化均方误差(Mean Squared Error，MSE)作为目标函数，采用最小二乘法求解；机器学习模型如支持向量回归(Support Vector Regression，SVR)，其参数寻优的目标函数常包含误差函数和惩罚函数，目标函数也称为损失函数，采用梯度下降等寻优算法进行求解。In the formula, L( ^yi , f( ^Xi ; ω)) is the fitting error of the model, which means that the model fits the training samples as much as possible; n is the training sample capacity; λΩ(ω) is the penalty function introduced to constrain the model from overfitting, so as to improve the generalization ability of the model in unknown sample sets. Traditional linear regression models often use the mean squared error (MSE) as the objective function and use the least squares method to solve it; machine learning models such as support vector regression (SVR) often use the objective function of parameter optimization that includes an error function and a penalty function. The objective function is also called a loss function and is solved using optimization algorithms such as gradient descent.

步骤3中模型真实精度指标σ^*可选择均方根误差、平均绝对误差、相关性系数等不同指标。以均方根误差(Root Mean Square Error，RMSE)为例，测试集的预测精度评价指标的计算公式如下：In step 3, the model's true accuracy index σ ^* can be selected from different indicators such as root mean square error, mean absolute error, correlation coefficient, etc. Taking the root mean square error (RMSE) as an example, the calculation formula for the prediction accuracy evaluation index of the test set is as follows:

式中，σ为均方根误差，取值为均方误差(Mean Square Error，MSE)的开平方，m为测试样本容量，yⁱ为径流实测值，为经验模型对径流的预测值。Where σ is the root mean square error, which is the square root of the mean square error (MSE), m is the test sample size, ^yi is the measured value of runoff, is the predicted value of runoff by the empirical model.

拟合模型的参数按照公式(3)的目标函数和求解方法进行估计；残差序列ε采用K-S检验判别残差序列均满足正态分布。6个备选模型均值模型、ARI(1,1)、ARI(2,1)、SVR(1)、SVR(2)和SVR(3)的残差均方根误差估计σ^*分别为698、804、775、679、599、534。Fitting the model The parameters of are estimated according to the objective function and solution method of formula (3); the residual sequence ε is determined by KS test to determine whether the residual sequence satisfies the normal distribution. The residual root mean square error estimates σ ^* of the six alternative models, mean model, ARI (1,1), ARI (2,1), SVR (1), SVR (2) and SVR (3) are 698, 804, 775, 679, 599 and 534 respectively.

步骤4：依据均值模型、ARI(1,1)、ARI(2,1)、SVR(1)、SVR(2)和SVR(3)等6种模型估计的总体，采用蒙特卡罗法从总体中进行C＝500次、样本容量为L＝55的随机抽样，一次抽样记为样本集D_L。将样本集D_L按时序划分为样本容量为L(n,m)的训练集和测试集。Step 4: Based on the population estimated by the mean model, ARI (1,1), ARI (2,1), SVR (1), SVR (2) and SVR (3), the Monte Carlo method is used to perform C = 500 random samplings with a sample size of L = 55 from the population. Each sampling is recorded as the sample set _DL . The sample set _DL is divided into a training set and a test set with a sample size of L (n, m) in time sequence.

设置46组L(n,m)方案：n在5～50之间，相应的m在50～5之间变化。对每一组L(n,m)方案，进行C＝500次抽样的预报模型训练和测试，依次得到46组方案的测试集预测精度样本集σ_m。46 sets of L(n,m) schemes are set: n is between 5 and 50, and correspondingly m varies between 50 and 5. For each set of L(n,m) schemes, the forecast model training and testing are performed with C=500 samplings, and the test set prediction accuracy sample sets σ _m of the 46 sets of schemes are obtained in turn.

步骤5：若预测精度抽样分布与模型真实精度的平均距离指标越小，代表统计量Z的抽样分布足够逼近真实H₀，即σ(n,m)的抽样分布与模型真实精度σ^*逼近，原假设H₀为真的概率越大。以平均距离最小为原则确定最佳样本集划分容量L(n^r,m^r)，使得统计量Z的抽样分布尽可能逼近真实H₀。具体步骤包括：Step 5: If the average distance index between the prediction accuracy sampling distribution and the model's true accuracy is smaller, it means that the sampling distribution of the statistic Z is close enough to the true H ₀ , that is, the sampling distribution of σ(n,m) is close to the model's true accuracy σ ^* , and the probability that the original hypothesis H ₀ is true is greater. Determine the optimal sample set partition capacity L(n ^r ,m ^r ) based on the principle of minimum average distance, so that the sampling distribution of the statistic Z is as close to the true H ₀ as possible. The specific steps include:

步骤5.1：采用公式(5)的平均偏差d(n,m,σ^*)作为样本集划分的判别指标。记样本集划分容量为L(n,m)的预报模型统计试验抽样次数为C，记C次抽样的测试集预测精度(均方根预测误差)样本集为σ_m，其中第c个样本为σ_m(c)。采用均方根距离指标量化σ_m(c∈C)与真实精度σ^*的平均偏离程度，定义平均偏差指标d(n,m,σ^*)，计算公式如下：Step 5.1: Use the average deviation d(n,m,σ ^* ) of formula (5) as the discriminant index for sample set partitioning. Let the number of sampling times of the forecast model statistical test with the sample set partitioning capacity L(n,m) be C, let the test set prediction accuracy (root mean square prediction error) sample set of C sampling times be σ _m , where the cth sample is σ _m (c). The root mean square distance index is used to quantify the average deviation degree of σ _m (c∈C) from the true accuracy σ ^* , and the average deviation index d(n,m,σ ^* ) is defined, and the calculation formula is as follows:

步骤5.2：若d(n,m,σ^*)在多组样本集划分方案L(n,m)中取得最小值，即为样本集划分容量，称为“最佳样本集容量”L(nr,mr)。此时，预测精度σ(n,m)分布与真实σ^*偏离尽可能足够小，即公式(2)中统计量Z分布足够接近H₀。Step 5.2: If d(n,m,σ ^* ) obtains the minimum value in the multi-group sample set partitioning scheme L(n,m), it is the sample set partitioning capacity, called the "optimal sample set capacity" L(nr,mr). At this time, the deviation between the distribution of the prediction accuracy σ(n,m) and the true σ ^* is as small as possible, that is, the distribution of the statistic Z in formula (2) is close enough to H ₀ .

根据步骤4中46组L(n,m)方案的测试集预测精度样本集，按照公式(5)计According to the test set prediction accuracy sample set of 46 groups of L(n,m) schemes in step 4, the following formula (5) is used to calculate

算各L(n,m)方案下σ_m样本与σ^*的平均偏差d(n,m,σ^*)。绘制均值模型、ARI(1,1)、ARI(2,1)、SVR(1)、SVR(2)和SVR(3)等6种备选模型的预测精度的95％置信区间和平均偏差指标结果随训练集样本容量n的变化，如图4所示。Calculate the average deviation d(n,m,σ ^* ) between the σ _m samples and σ ^* under each L(n,m) scheme. Plot the 95% confidence intervals and average deviation index results of the prediction accuracy of the six alternative models, including the mean model, ARI(1,1), ARI(2,1), SVR(1), SVR(2) and SVR(3), as the training set sample size n changes, as shown in Figure 4.

对图4(a)～(f)中6个模型的平均偏差d(n,m,σ^*)进行分析：随着训练集n的增加(m减小)，各模型的d(n,m,σ^*)均是先下降、后上升的变化趋势，且在某一个n和m下取得最小值。可据此判别个模型的最佳的样本集划分，训练接样本容量分别为n＝15、21、24、26、26、25，测试集容量为55-n。The average deviation d(n,m,σ ^* ) of the six models in Figure 4(a) to (f) is analyzed: as the training set n increases (m decreases), the d(n,m,σ ^* ) of each model first decreases and then increases, and the minimum value is obtained under a certain n and m. Based on this, the best sample set division for each model can be determined. The training sample capacity is n=15, 21, 24, 26, 26, 25, and the test set capacity is 55-n.

步骤6：根据步骤5中样本划分容量L(n^r,m^r)的预测精度σ(n,m)样本集构建统计量Z的抽样分布，根据该抽样分布和给定显著性水平α确定拒绝域。Step 6: Construct a sampling distribution of the statistic Z based on the prediction accuracy σ(n,m) sample set of the sample partition capacity L(n ^r ,m ^r ) in step 5, and determine the rejection region based on the sampling distribution and the given significance level α.

采用最佳样容量划分下的测试集预测精度σ_m，按照公式(5)构造统计量Z，根据Z的抽样分布得到显著性水平α＝0.05的置信区间，双侧检验的区间下限和上限分别取Z抽样分布的α/2＝2.5％和1-α/2＝97.5％分位数。The prediction accuracy σ _m of the test set under the optimal sample size partition is adopted, and the statistic Z is constructed according to formula (5). The confidence interval with a significance level of α = 0.05 is obtained according to the sampling distribution of Z. The lower and upper limits of the interval of the two-sided test are respectively taken as the α/2 = 2.5% and 1-α/2 = 97.5% quantiles of the sampling distribution of Z.

步骤7：按照4.1和4.2的步骤，计算计算实测样本在最佳样本容量下测试集的预测精度进一步按照公式(2)计算统计量Z的观察值z₀。Step 7: According to steps 4.1 and 4.2, calculate the prediction accuracy of the test set of the measured samples under the optimal sample size The observed value z ₀ of the statistic Z is further calculated according to formula (2).

取显著性水平为0.05，绘制6种备选模型的统计量Z抽样分布95％置信区间和z₀的分布位置，如图5所示。Taking the significance level as 0.05, the 95% confidence interval of the sampling distribution of the statistic Z and the distribution position of z ₀ of the six alternative models are plotted, as shown in Figure 5.

步骤8：根据z₀落在Z抽样分布中的分位数，采用公式(6)计算p值，分别量化6个模型假设为真的可信度。可信度p值和拒绝率1-p如表1所示。Step 8: According to the quantile of z ₀ in the Z sampling distribution, the p-value is calculated using formula (6) to quantify the credibility of the six model assumptions. The credibility p-value and the rejection rate 1-p are shown in Table 1.

表1最佳样本容量划分下的预报模型假设检验结果表Table 1. Hypothesis test results of forecast model under optimal sample size division

对比分析表1中6个模型的检验结果，可知：6个模型的拒绝率显示均值模型和SVR(1)可信度最高，ARI(2,1)次之，ARI(1,1)和SVR(2)可信度较低，SVR(3)在0.05的显著性水平下检验为不可信模型。因此，推荐选择均值模型和SVR(1)为最优模型，两个模型的最佳样本容量分别为(15,40)和(26,29)。Comparing and analyzing the test results of the six models in Table 1, we can see that the rejection rates of the six models show that the mean model and SVR(1) have the highest credibility, followed by ARI(2,1), ARI(1,1) and SVR(2) have lower credibility, and SVR(3) is an unreliable model at a significance level of 0.05. Therefore, it is recommended to select the mean model and SVR(1) as the optimal models, and the optimal sample sizes of the two models are (15,40) and (26,29), respectively.

进一步，将以上提供的预报模型优选方法与现有方法进行比较：Furthermore, the forecast model optimization method provided above is compared with the existing methods:

假设检验法为本发明所提方法，在最佳样本集划分下基于拒绝率最小准则进行模型优选；传统优选法将实测样本划按经验比例(2:1和4:1)分为训练集和测试集，基于测试集预测精度最高进行模型优选。The hypothesis testing method is the method proposed in the present invention, which performs model optimization based on the minimum rejection rate criterion under the optimal sample set division; the traditional optimization method divides the measured samples into training set and test set according to the empirical ratio (2:1 and 4:1), and performs model optimization based on the highest prediction accuracy of the test set.

分别给出6个模型在假设检验法、传统优选法(2:1)和传统优选法(4:1)下实测样本的测试精度σ和拒绝率1-p的对应关系，如图6所示。The corresponding relationships between the test accuracy σ and the rejection rate 1-p of the measured samples of the six models under the hypothesis testing method, the traditional optimization method (2:1) and the traditional optimization method (4:1) are given respectively, as shown in Figure 6.

图6(a)中假设检验法(拒绝率最小准则)优选出的最优均值模型、次优模型SVR(1)不仅在实测样本上具有较于其他4个模型更高的测试精度，而且模型为真的拒绝率最低，分别为0.09和0.1；而图6(b)传统优选法(2:1)推荐出SVR(3)为精度最高的最佳模型，但模型拒绝率为0.98，在95％置信水平下拒绝模型为真的假设；图6(c)传统优选法(4:1)推荐出ARI(2,1)为精度最高的最优模型、SVR(3)为次优模型，但拒绝率达到了0.7和0.85，模型为真的可信度较低。可见，仅基于传统优选法(2:1和4:1)的测试精度优选的模型，可能由于预测精度统计量存在的抽样不确定性而导致不合理的模型优选结果。In Figure 6(a), the optimal mean model and the suboptimal model SVR(1) selected by the hypothesis testing method (minimum rejection rate criterion) not only have higher test accuracy than the other four models on the measured samples, but also have the lowest rejection rate of the model being true, which are 0.09 and 0.1 respectively; while in Figure 6(b), the traditional optimization method (2:1) recommends SVR(3) as the best model with the highest accuracy, but the model rejection rate is 0.98, rejecting the hypothesis that the model is true at the 95% confidence level; in Figure 6(c), the traditional optimization method (4:1) recommends ARI(2,1) as the optimal model with the highest accuracy and SVR(3) as the suboptimal model, but the rejection rates reach 0.7 and 0.85, and the credibility of the model being true is low. It can be seen that the model selected based only on the test accuracy of the traditional optimization method (2:1 and 4:1) may lead to unreasonable model selection results due to the sampling uncertainty of the prediction accuracy statistics.

假设检验法优选的均值模型可信度0.91，较于传统优选法(2:1和4:1)所优选模型的可信度分别增加0.89和0.61，验证了所提假设检验优选方法的优越性。The credibility of the mean model selected by the hypothesis testing method is 0.91, which is 0.89 and 0.61 higher than the credibility of the model selected by the traditional optimization method (2:1 and 4:1), respectively, which verifies the superiority of the proposed hypothesis testing optimization method.

以上所述，仅为本发明的某个案例的具体实施方式，但本发明的保护范围并不局限于此，任何属于本技术领域的技术人员在本发明揭露的技术范围内，可轻易想到的变化或替换，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应该以权利要求的保护范围为准。The above is only a specific implementation of a case of the present invention, but the protection scope of the present invention is not limited thereto. Any changes or substitutions that can be easily thought of by a person skilled in the art within the technical scope disclosed by the present invention should be included in the protection scope of the present invention. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. A method for optimizing a data-driven runoff forecast model under finite samples based on hypothesis testing, characterized in that it comprises the following steps:

Step 1: Screen the prediction factors according to the historical runoff data and meteorological data, establish the mapping relationship between runoff and prediction factors, select multiple data-driven models to describe the mapping relationship, and establish a candidate model set; the data-driven model consists of model structure, parameters, input variables and output variables, and the model expression is as follows:

y＝f(X；ω)+ε，ε～g(θ)

In the formula, the input variable X is a predictor correlated with runoff; the output variable y is runoff; f(X; ω) is a model form with certain structural characteristics; ω is a model parameter obtained through historical sample training; ε is a random term with a distribution function of g(θ);

Step 2: According to the alternative model set in step 1, select each alternative model as the model hypothesis in turn; the sample set _DL with a known sample size of L is a sample from the population D, and the known sample _DL is used to construct the model y=f(X;ω ^* )+ε,ε~g(θ ^* ) to infer the population D, ω ^* is the model parameter used to infer the population D using the sample _DL , and the distribution function is g(θ ^* ) at this time, which is the overall estimate of the true model, and the true accuracy of the model is estimated to be σ ^* ; the known sample set _DL is divided into a training set _DL,n and a test set DL _,m with sample sizes of n and m respectively, and the sample set division scheme is denoted as L(n,m); the training set _DL,n is used for model parameter calibration, and the test set DL _,m is used for testing the model prediction accuracy; the forecast model calibrated using the training set DL _,n is denoted as ω( _DL,n ) is the estimated value of parameter ω ^* ;

The null hypothesis H ₀ and the alternative hypothesis H ₁ are defined to test the authenticity of the forecast model; where:

H ₀ : Model y = f(X; ω ^* ) + ε, ε ~ g(θ ^* ) is assumed to be true, and the prediction model is obtained by calibration using the training set _DL,n is the true model, whose prediction accuracy σ(n,m) on the test set _DL,m is equal to the true model accuracy σ ^* ;

H ₁ : Forecast model is not true, that is, the prediction accuracy σ(n,m) is not equal to σ ^* ;

If the prediction model is true, then σ(n,m) approaches σ ^* within a certain allowable deviation range, and the constructed statistic is:

Z = σ(n,m)-σ ^* ;

When σ(n,m) makes the absolute value of the observed value z ₀ of the statistic Z too large, the hypothesis H ₀ is rejected, otherwise the hypothesis H ₀ is accepted;

Step 3: For a model hypothesis selected in step 2, use the measured sample D _L ⁰ with a sample size of L to calibrate the model parameters as the overall estimate of the true model. The true accuracy of the model is estimated as σ ^* ;

Step 4: Use the Monte Carlo method to sample the overall model y = f (X; ω ^* ) + ε, ε ~ g (θ ^* ) with a sample size of L. The number of samplings is C, and the sample size of C times is L. For each group of partitioning schemes L (n, m), use C samplings to calibrate and test the model to obtain the sampling distribution of prediction accuracy;

Step 5: Determine the optimal sample set partition capacity L(n ^r ,m ^r ) based on the principle of minimum average distance, so that the sampling distribution of the statistic Z is as close to the true H ₀ as possible. The smaller the average distance index between the sampling distribution of Z and the true accuracy of the model, the smaller the sampling distribution of the statistic Z is close enough to the true H ₀ , that is, the closer the sampling distribution of the prediction accuracy σ(n,m) of the optimal sample set partition capacity L(n ^r ,m ^r ) is to the true accuracy σ ^* of the model, the greater the probability that the original hypothesis H ₀ is true;

Step 6: Construct the sampling distribution of the statistic Z according to the prediction accuracy σ(n,m) sample set of the optimal sample set partition capacity L(n ^r ,m ^r ) in step 5, and determine the rejection region of the sampling distribution of the statistic Z according to the sampling distribution of the statistic Z and the given significance level α;

Step 7: Based on the measured samples Construct a prediction model with a sample set partition capacity of L(n ^r ,m ^r ) and calculate the prediction accuracy of the test set/> and the observed value z ₀ of the statistic Z;

Step 8: Determine whether to reject the null hypothesis H ₀ based on whether the observed value z ₀ of the statistic Z falls within the rejection region of the sampling distribution of the statistic Z. If the null hypothesis H ₀ is rejected, the current model fails the hypothesis test and is not suitable as the preferred model. After performing the hypothesis test on each alternative model, if there are multiple alternative models that pass the hypothesis test, the p-value method is further used to give the minimum significance level at which the null hypothesis H ₀ can be rejected. The calculation formula for the p-value is:

p＝P{|z|≥z ₀ };

The model with the smallest rejection rate 1-p for the null hypothesis H ₀ is considered to be closest to the true model, and the model optimization is performed based on the "minimum rejection rate" criterion.

2. The method for optimizing the data-driven runoff forecast model under finite samples based on hypothesis testing as described in claim 1 is characterized in that: the model true accuracy indicator σ ^* in step 3 includes root mean square error, mean absolute error, and correlation coefficient.

3. The method for optimizing the data-driven runoff forecast model under finite samples based on hypothesis testing according to claim 1 is characterized in that the objective function of the model solution in step 3 is composed of a fitting error function and a penalty function, and the general expression of the objective function is as follows:

Where F(y ⁱ , f(X ⁱ ; ω)) is the fitting error of the model, which means that the model fits the training samples as much as possible; n is the training sample capacity; λΩ(ω) is the penalty function introduced to constrain the overfitting of the model to improve the generalization ability of the model in unknown sample sets.

4. The method for optimizing the data-driven runoff forecast model under finite samples based on hypothesis testing according to claim 1, characterized in that: step 5 specifically comprises:

Step 5.1: Use the average deviation d(n,m,σ ^* ) as the discriminant index for sample set partitioning. Let the number of sampling times of the forecast model statistical test with the sample set partitioning capacity L(n,m) be C, let the test set prediction accuracy sample set of C sampling times be σ _m , where the cth sample is σ _m (c), and use the root mean square distance index to quantify the average deviation between σ _m (c∈C) and the true accuracy σ ^* . Define the average deviation index d(n,m,σ ^* ), and the calculation formula is as follows:

Step 5.2: If d(n,m,σ ^* ) obtains the minimum value in the multi-group sample set partitioning scheme L(n,m), the sample set partitioning capacity at this time is called the "optimal sample set partitioning capacity" L( ^nr , ^mr ); at this time, the distribution of the prediction accuracy σ(n,m) deviates sufficiently from the true σ ^* , that is, the distribution of the statistic Z is close enough to _H0 .