CN116627953B

CN116627953B - Method for repairing loss of groundwater level monitoring data

Info

Publication number: CN116627953B
Application number: CN202310591040.7A
Authority: CN
Inventors: 孙永华; 张王宽; 成星路; 曹许悦; 王衍昭
Original assignee: Capital Normal University
Current assignee: Capital Normal University
Priority date: 2023-05-24
Filing date: 2023-05-24
Publication date: 2023-10-27
Anticipated expiration: 2043-05-24
Also published as: CN116627953A

Abstract

The invention discloses a method for repairing loss of groundwater level monitoring data, which comprises the following steps: step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set; step 2, interpolation is carried out on the time series data containing the missing value by utilizing a BTF model; step 3, detecting abnormal values of the interpolated time series data by using an isolated forest model; and 4, interpolating the time sequence data again until the data is repaired completely. The method effectively overcomes the defect that partial abnormal values are generated when the BTF interpolation method is adopted, and improves the reliability and accuracy of interpolation results; the method is suitable for the condition of the underground water monitoring data with the missing of the long-time sequence, solves the problems that the interpolation quality of the time sequence continuous missing data is poor and the real time sequence change trend cannot be attached in the existing method, and has good accuracy and high applicability.

Description

Method for repairing loss of groundwater level monitoring data

Technical Field

The invention belongs to the technical field of ground water level monitoring, and particularly relates to a method for repairing ground water level monitoring data loss.

Background

Because of the complexity of the groundwater level data, the groundwater monitoring coverage of most areas is insufficient, or the observation period is short, and the groundwater level data of most areas at present has the defect of long-time sequence data. At present, the groundwater level time series data mostly adopts the existing time series data interpolation method, which comprises the following steps: (1) a filling method based on statistics is adopted to interpolate a median value, a mean value, a mode and the like in time sequence data; (2) the linear interpolation method is to fit the known groundwater level data into a function, and then interpolate the missing value; (3) and a front-back weighted average method, which is to perform weighted average according to the distance between the front time and the back time, and the like. Considering that the groundwater level data has strong time change characteristics, various factors such as seasonal precipitation increase and manual irrigation can influence groundwater change, the time information in a time sequence is ignored, and meanwhile, more data are continuously deleted in the groundwater level data, so that the repairing method based on the statistical principle is not ideal in effect and cannot be directly applied to groundwater level time sequence data repairing.

The current latest time-space data interpolation method, namely the Behcet time factor decomposition time sequence interpolation method, is based on the Behcet time decomposition frame, and compared with other methods, the method can obtain better effect by using the method to interpolate the groundwater data with long time sequence. However, since the gaussian distribution is adopted in the modeling of the graph model, the influence of abnormal values is unavoidable in the interpolation process. Therefore, how to eliminate the influence of abnormal values so as to improve the accuracy and the applicability of the groundwater level data deletion repair is a technical problem to be solved urgently at present.

Disclosure of Invention

The invention aims to provide a method for repairing the loss of groundwater level monitoring data, which aims to solve the technical problems.

In order to achieve the above purpose, the present invention provides the following technical solutions:

the invention discloses a method for repairing loss of groundwater level monitoring data, which comprises the following steps:

step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set;

step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed;

step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, verifying the model by using the test data, adjusting model parameters, detecting abnormal values of the interpolated time sequence data by using the trained isolated forest model, marking the abnormal values in the interpolation result as missing values again, and removing the missing values;

step 4, interpolating the time sequence data again until the data is repaired completely: and (3) interpolating the time series data subjected to abnormal value detection again by using a KNN algorithm model, repairing the missing values in the time series data until the missing values in all the time series data are complemented, and finally obtaining the complete underground water level monitoring data.

Further, the specific process of training the BTF model by using the training set in the step 2 is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;

the calculation formula of the percentage error MAPE is as follows:

the root mean square error RMSE is calculated as:

wherein a is _i An ith value which is original groundwater level time series data; b _i And the interpolation result is corresponding.

Further, in the step 3, the dividing the selected time series data into training data and test data specifically includes: 70% of the selected time series data are classified as training data and 30% are classified as test data.

Further, in the step 3, the specific process of performing outlier detection on the interpolated time-series data by using the trained isolated forest model is as follows:

step 31, randomly selecting r sample data points from the training set as a sub-sampling set q= { Q ₁ ,q ₂ ,…,q _r The dimension of the data point is z, which is taken as the root node of the tree;

step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;

step 33, for sub-samplingEach sample data point q in the set _i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B _i (B) Dividing if q _i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;

step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:

1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;

2) The height of the isolation tree reaches a limited height;

step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;

step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):

wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant;

the anomaly score s for query-interpolated groundwater level data l is defined as:

the interpolated groundwater level data l is used for anomaly identification according to the following criteria:

1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;

2) When E (h (l)). Fwdarw.r-1, that is, s.fwdarw.0, the groundwater level data l is recognized as normal data.

Further, in the step 4, interpolation is performed again on the time series data subjected to abnormal value detection by using the KNN algorithm model, and a specific process of repairing the missing value in the time series data is as follows: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;

the Euclidean distance is calculated as:

wherein x is _i ，x _j Coordinates that are missing values; y is _i ，y _j Coordinates that are known recorded values;

the calculation formula of the complement value is:

in the formula, mean is a complement value corresponding to the missing value; w (w) _i The value is recorded for the ith groundwater level.

The beneficial effects of the invention are as follows: the invention provides a method for repairing the loss of groundwater level monitoring data, which comprises the steps of carrying out outlier recognition on time sequence data after interpolation of a BTF model by using an isolated forest model, and carrying out interpolation again by using a KNN algorithm model, so that the groundwater level monitoring data is repaired completely, the defect that partial outlier is generated when the BTF interpolation method is adopted only is overcome, and the reliability and accuracy of interpolation results are improved. The method is suitable for the condition of the underground water monitoring data with the missing long-time sequence, solves the problems that the interpolation quality of the time sequence continuous missing data is poor and the real time sequence change trend cannot be attached in the existing method, and has good accuracy and high applicability.

The invention will be described in further detail with reference to the drawings and the detailed description.

Drawings

FIG. 1 is a flow chart of a method according to the present invention;

FIG. 2 is a graph showing the comparison of the interpolation effect of data using the BTF method and the IF-BTFK method.

Detailed Description

The invention discloses a method for repairing loss of groundwater level monitoring data, as shown in figure 1, the method comprises the following steps:

step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: and acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set.

Step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed.

The specific process of training the BTF model by using the training set is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;

the calculation formula of the percentage error MAPE is as follows:

the root mean square error RMSE is calculated as:

The BTF model, a bayesian time decomposition framework model, is a graphical model that can characterize global and local consistency in large-scale time series data by integrating low-rank matrix/tensor decomposition and Vector Autoregressive (VAR) processes into a single probability, which can efficiently perform probability prediction and produce uncertainty estimates.

In particular, the groundwater spatiotemporal data may be defined as a three-dimensional tensorWherein m and n respectively represent the number of monitoring stations and the time length (monitoring time), and t represents the number (frequency) of groundwater level records in each year. The groundwater recording data in tensor D is indexed using (i, j, t) ∈Ω. The portion of the data that does not contain missing data is selected, and three low rank factor matrices U, V and X are randomly initialized according to CANDECOMP/PARAFAC (CP) matrix decomposition, where X is a frequency factor matrix and U and V are space-time factor matrices. Assume that each missing term d in the tensor _(i,j,t) All obey an independent gaussian distribution, and we assume that each observation D in D obeys a gaussian distribution with accuracy τ

Under Gaussian assumption, τ represents the noise level in the subsurface observation data, and the standard deviation τ is assumed to follow Gamma distribution to improve the robustness of the method.

τ～Gamma(α,β)

To correctly calculate the defectMissing groundwater records data, assuming line vector U in U and V factor matrix _i And v _j Is a multivariate gaussian distribution.

Under the Bayesian condition, the super-parameters of the assumed model obey Gaussian-Wishare distribution, so that the robustness of the method can be enhanced. The a priori distribution of μ and Λ is defined as follows:

(μ _u ,Λ _u )～Gaussian–Wishart(μ ₀ ,β ₀ ,W ₀ ,v ₀ )

(μ _v ,Λ _v )～Gaussian–Wishart(μ ₀ ,β ₀ ,W ₀ ,v ₀ )

wherein v is ₀ Represent the degree of freedom, W ₀ Representing the matrix of ratios of R x R,is a mean vector that may be defined as a zero vector.

The processing method of the frequency factor matrix X is different from that of the space-time factor matrix. The frequency factor matrix has time series characteristics, so VAR can be used to predict missing data in the groundwater time series. The autoregressive method assumes that there is a linear dependency between the variables of the same groundwater time series. For the t-th observation in the annual observations in the groundwater time seriesThe linearity is expressed as follows:

wherein A is _k Is an R x R coefficient matrix, E _t Is a gaussian noise vector. Matrix a and vector v _t Can be expressed as:

in summary, VAR can be expressed as x _t ＝A ^T v _t +∈ _t . Furthermore, the hysteresis set is defined as The time factor matrix X is:

the conjugate Matrix Normal Inverse Wishart (MNIW) is then a priori applied to the coefficient matrix a and covariance matrix Σ:

and finally, iterating the method by adopting a Gibbs sampling algorithm. Sampling the factor matrix by using Gibbs sampling algorithm to obtain a groundwater observation value d _(i,j,t) Dependency relationship with VAR hyper-parameters. After the Gibbs sampling algorithm reaches steady state, all missing groundwater observations can be approximately solved with Markov Chain Monte Carlo (MCMC). Sampling g times laterThe average result is obtained as an interpolation result. There is no limitation on the number of sampling times, and in practice, as the number of sampling times increases, the accuracy of the interpolation result may decrease. Therefore, the number of sampling times should be selected to be a value which does not increase with an increase in the number of sampling times after the accuracy increases to a certain extent.

Step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, mining the relation between an abnormal value and a normal value in the time sequence data, verifying the model by the test data, and adjusting model parameters until the isolated forest model can effectively identify the abnormal value in the time sequence data. And then, performing outlier detection on the interpolated time series data by using the trained isolated forest model, marking the outlier in the interpolation result as a missing value again, and removing the missing value.

The isolated forest model is adopted because outliers in the groundwater time series data after BTF interpolation generally have two characteristics: the few and the difference, i.e. the outliers of these interpolations are sparsely distributed and far from the normal values of high density, and these outliers are also called points that are easily isolated. For a set of continuous groundwater time series data, the core of the isolated forest model is to randomly sample and construct a number of isolated trees (itres) from which an isolated forest is composed.

Specifically, the specific process of performing outlier detection on the interpolated time series data by using the trained isolated forest model is as follows:

step 33, for each sample data point q in the sub-sample set _i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B _i (B) Dividing if q _i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;

2) The height of the isolation tree reaches a limited height;

The method comprises the following specific processes of interpolating the time series data after abnormal value detection again and repairing the missing value in the time series data: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;

the Euclidean distance is calculated as:

the calculation formula of the complement value is:

Example 1

The present embodiment is a specific application example of the above method.

The Linyi city is located in the southeast of the Lu, 34 degrees 17 '-36 degrees 23' of North latitude, 117 degrees 25 '-119 degrees 11' of east longitude. Because of being controlled by geological structures, a series of broken block protrusions and broken block recesses are formed in the near-to-near market, and the medium and low mountain topography formed by the broken block protrusions such as Mongolia and the like formed by the ancient crystalline rock is the natural watershed of the surface water and the underground water in the market. According to the occurrence condition of the groundwater in the Yi city, the water-based property of the rock and the hydraulic characteristic of the groundwater, the groundwater in the whole city is divided into four types of water-containing rock groups: loose rock pore water-bearing rock pore-crack water-bearing rock pore-carbonate rock crack-karst water-bearing rock pore-base rock crack water-bearing rock pore.

The groundwater level monitoring data obtained in this embodiment is water level record data (6 record data per month) of 10 years (2006-2015 years) of a time span of 8 monitoring points. Wherein the dot number W _1，2，6 The data of the monitoring points are complete, the data of the rest monitoring points are incomplete, and the deletion rate of most time series data is about 40%.

Converting the original ground water level monitoring data of 8 monitoring points into a time series data set T (T= { T) ₁ ,....,t ₈ -j) while marking groundwater data missing on a certain date. Furthermore, the complete time series T in the data set T is selected ₁ ,t ₂ ,t ₆ Constructing a subset T of time series ₁ 。

The time series data sets T and T are combined ₁ Respectively converted into three-dimensional tensorsAnd->

By random means, makeRandom deletions of 10%, 20%, 30% and 40%. Inputting the tensor of the random missing data into a BTF model, setting the Gibbs sampling frequency to be 100-500, and finding the optimal sampling frequency of model super-parameter fitting through an accuracy evaluation index RMSE and MAPE. Smaller RMSE and MAPE represent better interpolation results, closer to true values. The experimental results are shown in table 1.

TABLE 1

Will beInputting the trained BTF model for interpolation, and then converting the interpolated tensor into a time sequence T _R (T _R ＝{t _R1 ,....,t _R8 })。

Subset T of time series ₁ Dividing the model into training data and test data, training an isolated forest model and verifying, wherein parameters of an initialization algorithm comprise: the number of trees and the proportion of abnormal data. Training the model with training data, importing the normalized data into an isolated forest model, and mining each subsequence (t ₁ ,t ₂ ,t ₆ ) And (3) continuously adjusting various parameters of the model according to the change trend of the model and the relation between the abnormal value and the normal value, and finally determining the number of the isolated forest model trees and the abnormal data proportion.

Time sequence T _R Is a subsequence (t) _R1 -t _R8 ) And inputting the trained isolated forest models one by one, and marking the abnormal value in each subsequence as a missing value and removing the missing value.

Time series T after marking _R Inputting a KNN algorithm model, calculating Euclidean distance between each missing value and other surrounding known groundwater level data, and taking the average value of the first k groundwater level data with the shortest distance as the complement value of the missing value until the re-interpolation of all marked groundwater level data is completed.

The present example compares the results of the BTF-only groundwater time series interpolation method with the BTF-and KNN-based groundwater time series interpolation method (IF-BTFK) using the method of the present invention, i.e., based on isolated forest outlier detection, as shown in table 2 and fig. 2.

TABLE 2

From the above, it can be seen that the groundwater level data has two distinct features: 1. the fluctuation of the change is small; 2. the data missing part is more, and the phenomenon of continuous data missing exists. The existing space-time interpolation method is directly used for long-time groundwater data interpolation, and although the change trend of the data can be effectively simulated, partial abnormal values are unavoidable. The repair method for the loss of the underground water level monitoring data can well overcome the problem of abnormal value generation in the interpolation process, so that the interpolated time series data is closer to the real time series data.

Finally, it should be noted that the above description is only for the purpose of illustrating the technical solution of the present invention and not for the purpose of limiting the same, and that although the present invention has been described in detail with reference to the preferred arrangement, it will be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims

1. A method for repairing a loss of groundwater level monitoring data, the method comprising the steps of:

step 4, interpolating the time sequence data again until the data is repaired completely: interpolation is carried out on the time series data subjected to abnormal value detection again by using a KNN algorithm model, missing values in the time series data are repaired until the missing values in all the time series data are complemented, and finally complete underground water level monitoring data are obtained;

the specific process of training the BTF model by using the training set in the step 2 is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;

the calculation formula of the percentage error MAPE is as follows:

the root mean square error RMSE is calculated as:

wherein a is _i An ith value which is original groundwater level time series data; b _i The interpolation result is corresponding;

the BTF model, a bayesian time decomposition framework model, is a graphical model that characterizes global and local consistency in large-scale time series data by integrating low-rank matrix/tensor decomposition and vector autoregressive, VAR, processes into a single probability, the graphical model being capable of performing probabilistic predictions and producing uncertainty estimates;

the space-time data of groundwater is defined as three-dimensional tensorWherein m and n respectively represent the number of monitoring stations and the time length, and t represents the times of groundwater level recording in each year; indexing the groundwater recording data in tensor D using (i, j, t) ∈Ω; selecting a part of data which does not contain missing data, and randomly initializing three low-rank factor matrices U, V and X according to CANDECOMP/PARAFAC matrix decomposition, wherein X is a frequency factor matrix, and U and V are space-time factor matrices; assume that each missing term d in the tensor _(i,j,t) Is subject to a gaussian distribution and assuming that each observation D in D is subject to a gaussian distribution, the accuracy is τ:

under the Gaussian assumption, τ represents the noise level in the subsurface observations, assuming that the standard deviation τ obeys the Gamma distribution:

T～Gamma(α,β)

assume a row vector U in the U and V factor matrices _i And v _j Is a multivariate gaussian distribution:

under Bayesian conditions, the super-parameters of the assumed model obey the Gaussian-Wishare distribution, and the a priori distributions of μ and Λ are defined as follows:

(μ _u ,Λ _u )～Gaussian–Wishart(μ ₀ ,β ₀ ,W ₀ ,v ₀ )

(μ _v ,Λ _v )～Gaussian–Wishart(μ ₀ ,β ₀ ,W ₀ ,v ₀ )

wherein v is ₀ Represent the degree of freedom, W ₀ Representing the matrix of ratios of R x R,representing the mean vector;

the autoregressive method assumes that linear dependency exists between variables of the same groundwater time sequence; for the t-th observation in the annual observations in the groundwater time seriesThe linearity is expressed as follows:

wherein A is _k E for R x R coefficient matrix _t Is a gaussian noise vector; matrix a and vector v _t Expressed as:

in summary, VAR is expressed as x _t ＝A ^T v _t +∈ _t The method comprises the steps of carrying out a first treatment on the surface of the Hysteresis set is defined asThe time factor matrix X is:

finally, iterating the method by adopting a Gibbs sampling algorithm; sampling the factor matrix by using Gibbs sampling algorithm to obtain a groundwater observation value d _(i,j,t) Dependency relationship with VAR hyper-parameters; after the Gibbs sampling algorithm reaches a stable state, all the missing groundwater observation values are approximately solved by adopting Markov Chain Monte Carlo (MCMC), and an average result is obtained as an interpolation result after sampling g times; the sampling frequency is set as a frequency value which does not rise with the increase of the sampling frequency after the precision rises to a certain degree;

in the step 3, the specific process of performing outlier detection on the interpolated time series data by using the trained isolated forest model is as follows:

step 33, for each sample data point q in the sub-sample set _i ，1≤i≤r, according to the value q of its dimension B _i (B) Dividing if q _i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;

2) The height of the isolation tree reaches a limited height;

wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant; r is the number of samples in sub-sample set Q;

2) When E (h (l)). Fwdarw.r-1, i.e., s.fwdarw.0, the groundwater level data l is recognized as normal data;

in the step 4, interpolation is performed again on the time series data subjected to abnormal value detection by using the KNN algorithm model, and the specific process of repairing the missing value in the time series data is as follows: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;

the Euclidean distance is calculated as:

the calculation formula of the complement value is:

in the formula, mean is a complement value corresponding to the missing value; w (w) _i Recording a value for the ith groundwater level; k is the number of the known groundwater level record values around the missing value.

2. The method for repairing a loss of groundwater level monitoring data according to claim 1,

in the step 3, the step of dividing the selected time series data into training data and test data specifically includes: 70% of the selected time series data are classified as training data and 30% are classified as test data.