CN116627953B - Method for repairing loss of groundwater level monitoring data - Google Patents

Method for repairing loss of groundwater level monitoring data Download PDF

Info

Publication number
CN116627953B
CN116627953B CN202310591040.7A CN202310591040A CN116627953B CN 116627953 B CN116627953 B CN 116627953B CN 202310591040 A CN202310591040 A CN 202310591040A CN 116627953 B CN116627953 B CN 116627953B
Authority
CN
China
Prior art keywords
data
model
time
missing
value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310591040.7A
Other languages
Chinese (zh)
Other versions
CN116627953A (en
Inventor
孙永华
张王宽
成星路
曹许悦
王衍昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital Normal University
Original Assignee
Capital Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital Normal University filed Critical Capital Normal University
Priority to CN202310591040.7A priority Critical patent/CN116627953B/en
Publication of CN116627953A publication Critical patent/CN116627953A/en
Application granted granted Critical
Publication of CN116627953B publication Critical patent/CN116627953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01FMEASURING VOLUME, VOLUME FLOW, MASS FLOW OR LIQUID LEVEL; METERING BY VOLUME
    • G01F23/00Indicating or measuring liquid level or level of fluent solid material, e.g. indicating in terms of volume or indicating by means of an alarm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Quality & Reliability (AREA)
  • Fluid Mechanics (AREA)
  • Computer Security & Cryptography (AREA)
  • Geophysics And Detection Of Objects (AREA)

Abstract

The invention discloses a method for repairing loss of groundwater level monitoring data, which comprises the following steps: step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set; step 2, interpolation is carried out on the time series data containing the missing value by utilizing a BTF model; step 3, detecting abnormal values of the interpolated time series data by using an isolated forest model; and 4, interpolating the time sequence data again until the data is repaired completely. The method effectively overcomes the defect that partial abnormal values are generated when the BTF interpolation method is adopted, and improves the reliability and accuracy of interpolation results; the method is suitable for the condition of the underground water monitoring data with the missing of the long-time sequence, solves the problems that the interpolation quality of the time sequence continuous missing data is poor and the real time sequence change trend cannot be attached in the existing method, and has good accuracy and high applicability.

Description

Method for repairing loss of groundwater level monitoring data
Technical Field
The invention belongs to the technical field of ground water level monitoring, and particularly relates to a method for repairing ground water level monitoring data loss.
Background
Because of the complexity of the groundwater level data, the groundwater monitoring coverage of most areas is insufficient, or the observation period is short, and the groundwater level data of most areas at present has the defect of long-time sequence data. At present, the groundwater level time series data mostly adopts the existing time series data interpolation method, which comprises the following steps: (1) a filling method based on statistics is adopted to interpolate a median value, a mean value, a mode and the like in time sequence data; (2) the linear interpolation method is to fit the known groundwater level data into a function, and then interpolate the missing value; (3) and a front-back weighted average method, which is to perform weighted average according to the distance between the front time and the back time, and the like. Considering that the groundwater level data has strong time change characteristics, various factors such as seasonal precipitation increase and manual irrigation can influence groundwater change, the time information in a time sequence is ignored, and meanwhile, more data are continuously deleted in the groundwater level data, so that the repairing method based on the statistical principle is not ideal in effect and cannot be directly applied to groundwater level time sequence data repairing.
The current latest time-space data interpolation method, namely the Behcet time factor decomposition time sequence interpolation method, is based on the Behcet time decomposition frame, and compared with other methods, the method can obtain better effect by using the method to interpolate the groundwater data with long time sequence. However, since the gaussian distribution is adopted in the modeling of the graph model, the influence of abnormal values is unavoidable in the interpolation process. Therefore, how to eliminate the influence of abnormal values so as to improve the accuracy and the applicability of the groundwater level data deletion repair is a technical problem to be solved urgently at present.
Disclosure of Invention
The invention aims to provide a method for repairing the loss of groundwater level monitoring data, which aims to solve the technical problems.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the invention discloses a method for repairing loss of groundwater level monitoring data, which comprises the following steps:
step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set;
step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed;
step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, verifying the model by using the test data, adjusting model parameters, detecting abnormal values of the interpolated time sequence data by using the trained isolated forest model, marking the abnormal values in the interpolation result as missing values again, and removing the missing values;
step 4, interpolating the time sequence data again until the data is repaired completely: and (3) interpolating the time series data subjected to abnormal value detection again by using a KNN algorithm model, repairing the missing values in the time series data until the missing values in all the time series data are complemented, and finally obtaining the complete underground water level monitoring data.
Further, the specific process of training the BTF model by using the training set in the step 2 is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;
the calculation formula of the percentage error MAPE is as follows:
the root mean square error RMSE is calculated as:
wherein a is i An ith value which is original groundwater level time series data; b i And the interpolation result is corresponding.
Further, in the step 3, the dividing the selected time series data into training data and test data specifically includes: 70% of the selected time series data are classified as training data and 30% are classified as test data.
Further, in the step 3, the specific process of performing outlier detection on the interpolated time-series data by using the trained isolated forest model is as follows:
step 31, randomly selecting r sample data points from the training set as a sub-sampling set q= { Q 1 ,q 2 ,…,q r The dimension of the data point is z, which is taken as the root node of the tree;
step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;
step 33, for sub-samplingEach sample data point q in the set i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B i (B) Dividing if q i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;
step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:
1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;
2) The height of the isolation tree reaches a limited height;
step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;
step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):
wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant;
the anomaly score s for query-interpolated groundwater level data l is defined as:
the interpolated groundwater level data l is used for anomaly identification according to the following criteria:
1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;
2) When E (h (l)). Fwdarw.r-1, that is, s.fwdarw.0, the groundwater level data l is recognized as normal data.
Further, in the step 4, interpolation is performed again on the time series data subjected to abnormal value detection by using the KNN algorithm model, and a specific process of repairing the missing value in the time series data is as follows: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;
the Euclidean distance is calculated as:
wherein x is i ,x j Coordinates that are missing values; y is i ,y j Coordinates that are known recorded values;
the calculation formula of the complement value is:
in the formula, mean is a complement value corresponding to the missing value; w (w) i The value is recorded for the ith groundwater level.
The beneficial effects of the invention are as follows: the invention provides a method for repairing the loss of groundwater level monitoring data, which comprises the steps of carrying out outlier recognition on time sequence data after interpolation of a BTF model by using an isolated forest model, and carrying out interpolation again by using a KNN algorithm model, so that the groundwater level monitoring data is repaired completely, the defect that partial outlier is generated when the BTF interpolation method is adopted only is overcome, and the reliability and accuracy of interpolation results are improved. The method is suitable for the condition of the underground water monitoring data with the missing long-time sequence, solves the problems that the interpolation quality of the time sequence continuous missing data is poor and the real time sequence change trend cannot be attached in the existing method, and has good accuracy and high applicability.
The invention will be described in further detail with reference to the drawings and the detailed description.
Drawings
FIG. 1 is a flow chart of a method according to the present invention;
FIG. 2 is a graph showing the comparison of the interpolation effect of data using the BTF method and the IF-BTFK method.
Detailed Description
The invention discloses a method for repairing loss of groundwater level monitoring data, as shown in figure 1, the method comprises the following steps:
step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: and acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set.
Step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed.
The specific process of training the BTF model by using the training set is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;
the calculation formula of the percentage error MAPE is as follows:
the root mean square error RMSE is calculated as:
wherein a is i An ith value which is original groundwater level time series data; b i And the interpolation result is corresponding.
The BTF model, a bayesian time decomposition framework model, is a graphical model that can characterize global and local consistency in large-scale time series data by integrating low-rank matrix/tensor decomposition and Vector Autoregressive (VAR) processes into a single probability, which can efficiently perform probability prediction and produce uncertainty estimates.
In particular, the groundwater spatiotemporal data may be defined as a three-dimensional tensorWherein m and n respectively represent the number of monitoring stations and the time length (monitoring time), and t represents the number (frequency) of groundwater level records in each year. The groundwater recording data in tensor D is indexed using (i, j, t) ∈Ω. The portion of the data that does not contain missing data is selected, and three low rank factor matrices U, V and X are randomly initialized according to CANDECOMP/PARAFAC (CP) matrix decomposition, where X is a frequency factor matrix and U and V are space-time factor matrices. Assume that each missing term d in the tensor (i,j,t) All obey an independent gaussian distribution, and we assume that each observation D in D obeys a gaussian distribution with accuracy τ
Under Gaussian assumption, τ represents the noise level in the subsurface observation data, and the standard deviation τ is assumed to follow Gamma distribution to improve the robustness of the method.
τ~Gamma(α,β)
To correctly calculate the defectMissing groundwater records data, assuming line vector U in U and V factor matrix i And v j Is a multivariate gaussian distribution.
Under the Bayesian condition, the super-parameters of the assumed model obey Gaussian-Wishare distribution, so that the robustness of the method can be enhanced. The a priori distribution of μ and Λ is defined as follows:
uu )~Gaussian–Wishart(μ 00 ,W 0 ,v 0 )
vv )~Gaussian–Wishart(μ 00 ,W 0 ,v 0 )
wherein v is 0 Represent the degree of freedom, W 0 Representing the matrix of ratios of R x R,is a mean vector that may be defined as a zero vector.
The processing method of the frequency factor matrix X is different from that of the space-time factor matrix. The frequency factor matrix has time series characteristics, so VAR can be used to predict missing data in the groundwater time series. The autoregressive method assumes that there is a linear dependency between the variables of the same groundwater time series. For the t-th observation in the annual observations in the groundwater time seriesThe linearity is expressed as follows:
wherein A is k Is an R x R coefficient matrix, E t Is a gaussian noise vector. Matrix a and vector v t Can be expressed as:
in summary, VAR can be expressed as x t =A T v t +∈ t . Furthermore, the hysteresis set is defined as The time factor matrix X is:
the conjugate Matrix Normal Inverse Wishart (MNIW) is then a priori applied to the coefficient matrix a and covariance matrix Σ:
and finally, iterating the method by adopting a Gibbs sampling algorithm. Sampling the factor matrix by using Gibbs sampling algorithm to obtain a groundwater observation value d (i,j,t) Dependency relationship with VAR hyper-parameters. After the Gibbs sampling algorithm reaches steady state, all missing groundwater observations can be approximately solved with Markov Chain Monte Carlo (MCMC). Sampling g times laterThe average result is obtained as an interpolation result. There is no limitation on the number of sampling times, and in practice, as the number of sampling times increases, the accuracy of the interpolation result may decrease. Therefore, the number of sampling times should be selected to be a value which does not increase with an increase in the number of sampling times after the accuracy increases to a certain extent.
Step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, mining the relation between an abnormal value and a normal value in the time sequence data, verifying the model by the test data, and adjusting model parameters until the isolated forest model can effectively identify the abnormal value in the time sequence data. And then, performing outlier detection on the interpolated time series data by using the trained isolated forest model, marking the outlier in the interpolation result as a missing value again, and removing the missing value.
The isolated forest model is adopted because outliers in the groundwater time series data after BTF interpolation generally have two characteristics: the few and the difference, i.e. the outliers of these interpolations are sparsely distributed and far from the normal values of high density, and these outliers are also called points that are easily isolated. For a set of continuous groundwater time series data, the core of the isolated forest model is to randomly sample and construct a number of isolated trees (itres) from which an isolated forest is composed.
Specifically, the specific process of performing outlier detection on the interpolated time series data by using the trained isolated forest model is as follows:
step 31, randomly selecting r sample data points from the training set as a sub-sampling set q= { Q 1 ,q 2 ,…,q r The dimension of the data point is z, which is taken as the root node of the tree;
step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;
step 33, for each sample data point q in the sub-sample set i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B i (B) Dividing if q i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;
step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:
1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;
2) The height of the isolation tree reaches a limited height;
step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;
step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):
wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant;
the anomaly score s for query-interpolated groundwater level data l is defined as:
the interpolated groundwater level data l is used for anomaly identification according to the following criteria:
1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;
2) When E (h (l)). Fwdarw.r-1, that is, s.fwdarw.0, the groundwater level data l is recognized as normal data.
Step 4, interpolating the time sequence data again until the data is repaired completely: and (3) interpolating the time series data subjected to abnormal value detection again by using a KNN algorithm model, repairing the missing values in the time series data until the missing values in all the time series data are complemented, and finally obtaining the complete underground water level monitoring data.
The method comprises the following specific processes of interpolating the time series data after abnormal value detection again and repairing the missing value in the time series data: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;
the Euclidean distance is calculated as:
wherein x is i ,x j Coordinates that are missing values; y is i ,y j Coordinates that are known recorded values;
the calculation formula of the complement value is:
in the formula, mean is a complement value corresponding to the missing value; w (w) i The value is recorded for the ith groundwater level.
Example 1
The present embodiment is a specific application example of the above method.
The Linyi city is located in the southeast of the Lu, 34 degrees 17 '-36 degrees 23' of North latitude, 117 degrees 25 '-119 degrees 11' of east longitude. Because of being controlled by geological structures, a series of broken block protrusions and broken block recesses are formed in the near-to-near market, and the medium and low mountain topography formed by the broken block protrusions such as Mongolia and the like formed by the ancient crystalline rock is the natural watershed of the surface water and the underground water in the market. According to the occurrence condition of the groundwater in the Yi city, the water-based property of the rock and the hydraulic characteristic of the groundwater, the groundwater in the whole city is divided into four types of water-containing rock groups: loose rock pore water-bearing rock pore-crack water-bearing rock pore-carbonate rock crack-karst water-bearing rock pore-base rock crack water-bearing rock pore.
The groundwater level monitoring data obtained in this embodiment is water level record data (6 record data per month) of 10 years (2006-2015 years) of a time span of 8 monitoring points. Wherein the dot number W 1,2,6 The data of the monitoring points are complete, the data of the rest monitoring points are incomplete, and the deletion rate of most time series data is about 40%.
Converting the original ground water level monitoring data of 8 monitoring points into a time series data set T (T= { T) 1 ,....,t 8 -j) while marking groundwater data missing on a certain date. Furthermore, the complete time series T in the data set T is selected 1 ,t 2 ,t 6 Constructing a subset T of time series 1
The time series data sets T and T are combined 1 Respectively converted into three-dimensional tensorsAnd->
By random means, makeRandom deletions of 10%, 20%, 30% and 40%. Inputting the tensor of the random missing data into a BTF model, setting the Gibbs sampling frequency to be 100-500, and finding the optimal sampling frequency of model super-parameter fitting through an accuracy evaluation index RMSE and MAPE. Smaller RMSE and MAPE represent better interpolation results, closer to true values. The experimental results are shown in table 1.
TABLE 1
Will beInputting the trained BTF model for interpolation, and then converting the interpolated tensor into a time sequence T R (T R ={t R1 ,....,t R8 })。
Subset T of time series 1 Dividing the model into training data and test data, training an isolated forest model and verifying, wherein parameters of an initialization algorithm comprise: the number of trees and the proportion of abnormal data. Training the model with training data, importing the normalized data into an isolated forest model, and mining each subsequence (t 1 ,t 2 ,t 6 ) And (3) continuously adjusting various parameters of the model according to the change trend of the model and the relation between the abnormal value and the normal value, and finally determining the number of the isolated forest model trees and the abnormal data proportion.
Time sequence T R Is a subsequence (t) R1 -t R8 ) And inputting the trained isolated forest models one by one, and marking the abnormal value in each subsequence as a missing value and removing the missing value.
Time series T after marking R Inputting a KNN algorithm model, calculating Euclidean distance between each missing value and other surrounding known groundwater level data, and taking the average value of the first k groundwater level data with the shortest distance as the complement value of the missing value until the re-interpolation of all marked groundwater level data is completed.
The present example compares the results of the BTF-only groundwater time series interpolation method with the BTF-and KNN-based groundwater time series interpolation method (IF-BTFK) using the method of the present invention, i.e., based on isolated forest outlier detection, as shown in table 2 and fig. 2.
TABLE 2
From the above, it can be seen that the groundwater level data has two distinct features: 1. the fluctuation of the change is small; 2. the data missing part is more, and the phenomenon of continuous data missing exists. The existing space-time interpolation method is directly used for long-time groundwater data interpolation, and although the change trend of the data can be effectively simulated, partial abnormal values are unavoidable. The repair method for the loss of the underground water level monitoring data can well overcome the problem of abnormal value generation in the interpolation process, so that the interpolated time series data is closer to the real time series data.
Finally, it should be noted that the above description is only for the purpose of illustrating the technical solution of the present invention and not for the purpose of limiting the same, and that although the present invention has been described in detail with reference to the preferred arrangement, it will be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims (2)

1. A method for repairing a loss of groundwater level monitoring data, the method comprising the steps of:
step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set;
step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed;
step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, verifying the model by using the test data, adjusting model parameters, detecting abnormal values of the interpolated time sequence data by using the trained isolated forest model, marking the abnormal values in the interpolation result as missing values again, and removing the missing values;
step 4, interpolating the time sequence data again until the data is repaired completely: interpolation is carried out on the time series data subjected to abnormal value detection again by using a KNN algorithm model, missing values in the time series data are repaired until the missing values in all the time series data are complemented, and finally complete underground water level monitoring data are obtained;
the specific process of training the BTF model by using the training set in the step 2 is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;
the calculation formula of the percentage error MAPE is as follows:
the root mean square error RMSE is calculated as:
wherein a is i An ith value which is original groundwater level time series data; b i The interpolation result is corresponding;
the BTF model, a bayesian time decomposition framework model, is a graphical model that characterizes global and local consistency in large-scale time series data by integrating low-rank matrix/tensor decomposition and vector autoregressive, VAR, processes into a single probability, the graphical model being capable of performing probabilistic predictions and producing uncertainty estimates;
the space-time data of groundwater is defined as three-dimensional tensorWherein m and n respectively represent the number of monitoring stations and the time length, and t represents the times of groundwater level recording in each year; indexing the groundwater recording data in tensor D using (i, j, t) ∈Ω; selecting a part of data which does not contain missing data, and randomly initializing three low-rank factor matrices U, V and X according to CANDECOMP/PARAFAC matrix decomposition, wherein X is a frequency factor matrix, and U and V are space-time factor matrices; assume that each missing term d in the tensor (i,j,t) Is subject to a gaussian distribution and assuming that each observation D in D is subject to a gaussian distribution, the accuracy is τ:
under the Gaussian assumption, τ represents the noise level in the subsurface observations, assuming that the standard deviation τ obeys the Gamma distribution:
T~Gamma(α,β)
assume a row vector U in the U and V factor matrices i And v j Is a multivariate gaussian distribution:
under Bayesian conditions, the super-parameters of the assumed model obey the Gaussian-Wishare distribution, and the a priori distributions of μ and Λ are defined as follows:
uu )~Gaussian–Wishart(μ 00 ,W 0 ,v 0 )
vv )~Gaussian–Wishart(μ 00 ,W 0 ,v 0 )
wherein v is 0 Represent the degree of freedom, W 0 Representing the matrix of ratios of R x R,representing the mean vector;
the autoregressive method assumes that linear dependency exists between variables of the same groundwater time sequence; for the t-th observation in the annual observations in the groundwater time seriesThe linearity is expressed as follows:
wherein A is k E for R x R coefficient matrix t Is a gaussian noise vector; matrix a and vector v t Expressed as:
in summary, VAR is expressed as x t =A T v t +∈ t The method comprises the steps of carrying out a first treatment on the surface of the Hysteresis set is defined asThe time factor matrix X is:
the conjugate Matrix Normal Inverse Wishart (MNIW) is then a priori applied to the coefficient matrix a and covariance matrix Σ:
finally, iterating the method by adopting a Gibbs sampling algorithm; sampling the factor matrix by using Gibbs sampling algorithm to obtain a groundwater observation value d (i,j,t) Dependency relationship with VAR hyper-parameters; after the Gibbs sampling algorithm reaches a stable state, all the missing groundwater observation values are approximately solved by adopting Markov Chain Monte Carlo (MCMC), and an average result is obtained as an interpolation result after sampling g times; the sampling frequency is set as a frequency value which does not rise with the increase of the sampling frequency after the precision rises to a certain degree;
in the step 3, the specific process of performing outlier detection on the interpolated time series data by using the trained isolated forest model is as follows:
step 31, randomly selecting r sample data points from the training set as a sub-sampling set q= { Q 1 ,q 2 ,…,q r The dimension of the data point is z, which is taken as the root node of the tree;
step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;
step 33, for each sample data point q in the sub-sample set i ,1≤i≤r, according to the value q of its dimension B i (B) Dividing if q i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;
step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:
1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;
2) The height of the isolation tree reaches a limited height;
step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;
step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):
wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant; r is the number of samples in sub-sample set Q;
the anomaly score s for query-interpolated groundwater level data l is defined as:
the interpolated groundwater level data l is used for anomaly identification according to the following criteria:
1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;
2) When E (h (l)). Fwdarw.r-1, i.e., s.fwdarw.0, the groundwater level data l is recognized as normal data;
in the step 4, interpolation is performed again on the time series data subjected to abnormal value detection by using the KNN algorithm model, and the specific process of repairing the missing value in the time series data is as follows: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;
the Euclidean distance is calculated as:
wherein x is i ,x j Coordinates that are missing values; y is i ,y j Coordinates that are known recorded values;
the calculation formula of the complement value is:
in the formula, mean is a complement value corresponding to the missing value; w (w) i Recording a value for the ith groundwater level; k is the number of the known groundwater level record values around the missing value.
2. The method for repairing a loss of groundwater level monitoring data according to claim 1,
in the step 3, the step of dividing the selected time series data into training data and test data specifically includes: 70% of the selected time series data are classified as training data and 30% are classified as test data.
CN202310591040.7A 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data Active CN116627953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310591040.7A CN116627953B (en) 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310591040.7A CN116627953B (en) 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data

Publications (2)

Publication Number Publication Date
CN116627953A CN116627953A (en) 2023-08-22
CN116627953B true CN116627953B (en) 2023-10-27

Family

ID=87591469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310591040.7A Active CN116627953B (en) 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data

Country Status (1)

Country Link
CN (1) CN116627953B (en)

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105873111A (en) * 2016-06-08 2016-08-17 南京信息工程大学 Soft and hard fault diagnosis and self restoration method suitable for health monitoring
CN107577649A (en) * 2017-09-26 2018-01-12 广州供电局有限公司 The interpolation processing method and device of missing data
CN109003422A (en) * 2018-08-02 2018-12-14 北京大学深圳研究生院 Monitoring data processing method and landslide forecasting procedure for landslide
CN109307159A (en) * 2018-08-21 2019-02-05 湖南大学 A kind of pipe network model alarm method based on water consumption optimal prediction model
CN109359104A (en) * 2018-09-14 2019-02-19 广州帷策智能科技有限公司 The missing data interpolation method and device of time data sequence
CN109947812A (en) * 2018-07-09 2019-06-28 平安科技(深圳)有限公司 Consecutive miss value fill method, data analysis set-up, terminal and storage medium
CN110503629A (en) * 2019-07-16 2019-11-26 西安理工大学 The underwater unnatural object detection method of isolated forest based on genetic algorithm
CN110580328A (en) * 2019-09-11 2019-12-17 江苏省地质工程勘察院 Method for repairing underground water level monitoring value loss
WO2020010701A1 (en) * 2018-07-11 2020-01-16 平安科技(深圳)有限公司 Pollutant anomaly monitoring method and system, computer device, and storage medium
CN110766066A (en) * 2019-10-18 2020-02-07 天津理工大学 FNN-based tensor heterogeneous integrated internet of vehicles missing data estimation method
CN111597080A (en) * 2020-05-22 2020-08-28 广东省生态环境技术研究所 Method for repairing underground water level missing data based on ground statistics and neural network
CN111625399A (en) * 2020-05-19 2020-09-04 国网天津市电力公司电力科学研究院 Method and system for recovering metering data
CN113255733A (en) * 2021-04-29 2021-08-13 西安交通大学 Unsupervised anomaly detection method under multi-modal data loss
CN113298297A (en) * 2021-05-10 2021-08-24 内蒙古工业大学 Wind power output power prediction method based on isolated forest and WGAN network
CN113591401A (en) * 2021-08-24 2021-11-02 华北电力大学(保定) Power transformer data cleaning method based on time series decomposition
CN113936192A (en) * 2021-10-22 2022-01-14 国网河北省电力有限公司经济技术研究院 Power distribution network synchronous measurement missing data repairing method, terminal and storage medium
CN114169237A (en) * 2021-11-30 2022-03-11 南昌大学 Power cable joint temperature abnormity early warning method combining EEMD-LSTM and isolated forest algorithm
CN114333292A (en) * 2021-11-22 2022-04-12 上海电科智能系统股份有限公司 Traffic restoration method based on trajectory reconstruction technology
CN115878603A (en) * 2022-12-27 2023-03-31 大连大学 Water quality missing data interpolation algorithm based on K nearest neighbor algorithm and GAN network
CN115935147A (en) * 2022-11-25 2023-04-07 东南大学 Traffic data recovery and abnormal value detection method represented by low-rank and sparse tensor

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105873111A (en) * 2016-06-08 2016-08-17 南京信息工程大学 Soft and hard fault diagnosis and self restoration method suitable for health monitoring
CN107577649A (en) * 2017-09-26 2018-01-12 广州供电局有限公司 The interpolation processing method and device of missing data
CN109947812A (en) * 2018-07-09 2019-06-28 平安科技(深圳)有限公司 Consecutive miss value fill method, data analysis set-up, terminal and storage medium
WO2020010701A1 (en) * 2018-07-11 2020-01-16 平安科技(深圳)有限公司 Pollutant anomaly monitoring method and system, computer device, and storage medium
CN109003422A (en) * 2018-08-02 2018-12-14 北京大学深圳研究生院 Monitoring data processing method and landslide forecasting procedure for landslide
CN109307159A (en) * 2018-08-21 2019-02-05 湖南大学 A kind of pipe network model alarm method based on water consumption optimal prediction model
CN109359104A (en) * 2018-09-14 2019-02-19 广州帷策智能科技有限公司 The missing data interpolation method and device of time data sequence
CN110503629A (en) * 2019-07-16 2019-11-26 西安理工大学 The underwater unnatural object detection method of isolated forest based on genetic algorithm
CN110580328A (en) * 2019-09-11 2019-12-17 江苏省地质工程勘察院 Method for repairing underground water level monitoring value loss
CN110766066A (en) * 2019-10-18 2020-02-07 天津理工大学 FNN-based tensor heterogeneous integrated internet of vehicles missing data estimation method
CN111625399A (en) * 2020-05-19 2020-09-04 国网天津市电力公司电力科学研究院 Method and system for recovering metering data
CN111597080A (en) * 2020-05-22 2020-08-28 广东省生态环境技术研究所 Method for repairing underground water level missing data based on ground statistics and neural network
CN113255733A (en) * 2021-04-29 2021-08-13 西安交通大学 Unsupervised anomaly detection method under multi-modal data loss
CN113298297A (en) * 2021-05-10 2021-08-24 内蒙古工业大学 Wind power output power prediction method based on isolated forest and WGAN network
CN113591401A (en) * 2021-08-24 2021-11-02 华北电力大学(保定) Power transformer data cleaning method based on time series decomposition
CN113936192A (en) * 2021-10-22 2022-01-14 国网河北省电力有限公司经济技术研究院 Power distribution network synchronous measurement missing data repairing method, terminal and storage medium
CN114333292A (en) * 2021-11-22 2022-04-12 上海电科智能系统股份有限公司 Traffic restoration method based on trajectory reconstruction technology
CN114169237A (en) * 2021-11-30 2022-03-11 南昌大学 Power cable joint temperature abnormity early warning method combining EEMD-LSTM and isolated forest algorithm
CN115935147A (en) * 2022-11-25 2023-04-07 东南大学 Traffic data recovery and abnormal value detection method represented by low-rank and sparse tensor
CN115878603A (en) * 2022-12-27 2023-03-31 大连大学 Water quality missing data interpolation algorithm based on K nearest neighbor algorithm and GAN network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
基于微波数据的快速路交通流数据修复及预测方法研究;陆文琦;中国优秀硕士学位论文全文数据库工程科技Ⅱ辑(月刊)(第01期);C034-1122 *

Also Published As

Publication number Publication date
CN116627953A (en) 2023-08-22

Similar Documents

Publication Publication Date Title
CN108761574B (en) Rainfall estimation method based on multi-source information fusion
Sauquet et al. Comparison of catchment grouping methods for flow duration curve estimation at ungauged sites in France
US20070093964A1 (en) Method for generating data set
CN111401599B (en) Water level prediction method based on similarity search and LSTM neural network
CN111027732B (en) Method and system for generating multi-wind power plant output scene
CN115829162B (en) Crop yield prediction method, device, electronic equipment and medium
CN116128141B (en) Storm surge prediction method and device, storage medium and electronic equipment
CN115495991A (en) Rainfall interval prediction method based on time convolution network
Smithers et al. Long duration design rainfall estimates for South Africa
CN117423003B (en) Winter wheat seedling condition grading remote sensing monitoring method in overwintering period
CN114936201A (en) Satellite precipitation data correction method based on adaptive block neural network model
CN116627953B (en) Method for repairing loss of groundwater level monitoring data
CN111325376A (en) Wind speed prediction method and device
CN110852415B (en) Vegetation index prediction method, system and equipment based on neural network algorithm
CN116245018A (en) Sea wave missing measurement data forecasting method based on bivariate long-short-term memory algorithm
CN113792105B (en) Geospatial point data sampling method based on half-variogram
CN111753469B (en) Typhoon storm surge situation simulation method and device
Neykov et al. Linking atmospheric circulation to daily precipitation patterns over the territory of Bulgaria
CN113537573A (en) Wind power operation trend prediction method based on dual space-time feature extraction
CN113673777A (en) Desert succession prediction method under climate change condition
CN115859840B (en) Marine environment power element region extremum analysis method
CN116992781B (en) Multi-step multi-element storm forecasting method based on deep learning
CN117633449B (en) DE-DOA improved RRDBNet precipitation data downscaling method based on Spark-Cassandra framework
Kim et al. An Effective Algorithm of Outlier Correction in Space–Time Radar Rainfall Data Based on the Iterative Localized Analysis
CN117852639A (en) Multisource data precipitation prediction method based on krill Jin Tujuan product network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant