CN116627953A - Method for repairing loss of groundwater level monitoring data - Google Patents

Method for repairing loss of groundwater level monitoring data Download PDF

Info

Publication number
CN116627953A
CN116627953A CN202310591040.7A CN202310591040A CN116627953A CN 116627953 A CN116627953 A CN 116627953A CN 202310591040 A CN202310591040 A CN 202310591040A CN 116627953 A CN116627953 A CN 116627953A
Authority
CN
China
Prior art keywords
data
value
groundwater level
model
time series
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310591040.7A
Other languages
Chinese (zh)
Other versions
CN116627953B (en
Inventor
孙永华
张王宽
成星路
曹许悦
王衍昭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Capital Normal University
Original Assignee
Capital Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Capital Normal University filed Critical Capital Normal University
Priority to CN202310591040.7A priority Critical patent/CN116627953B/en
Publication of CN116627953A publication Critical patent/CN116627953A/en
Application granted granted Critical
Publication of CN116627953B publication Critical patent/CN116627953B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G01MEASURING; TESTING
    • G01FMEASURING VOLUME, VOLUME FLOW, MASS FLOW OR LIQUID LEVEL; METERING BY VOLUME
    • G01F23/00Indicating or measuring liquid level or level of fluent solid material, e.g. indicating in terms of volume or indicating by means of an alarm
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/953Querying, e.g. by the use of web search engines
    • G06F16/9537Spatial or temporal dependent retrieval, e.g. spatiotemporal queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/24323Tree-organised classifiers

Abstract

The invention discloses a method for repairing loss of groundwater level monitoring data, which comprises the following steps: step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set; step 2, interpolation is carried out on the time series data containing the missing value by utilizing a BTF model; step 3, detecting abnormal values of the interpolated time series data by using an isolated forest model; and 4, interpolating the time sequence data again until the data is repaired completely. The method effectively overcomes the defect that partial abnormal values are generated when the BTF interpolation method is adopted, and improves the reliability and accuracy of interpolation results; the method is suitable for the condition of the underground water monitoring data with the missing of the long-time sequence, solves the problems that the interpolation quality of the time sequence continuous missing data is poor and the real time sequence change trend cannot be attached in the existing method, and has good accuracy and high applicability.

Description

Method for repairing loss of groundwater level monitoring data
Technical Field
The invention belongs to the technical field of ground water level monitoring, and particularly relates to a method for repairing ground water level monitoring data loss.
Background
Because of the complexity of the groundwater level data, the groundwater monitoring coverage of most areas is insufficient, or the observation period is short, and the groundwater level data of most areas at present has the defect of long-time sequence data. At present, the groundwater level time series data mostly adopts the existing time series data interpolation method, which comprises the following steps: (1) a filling method based on statistics is adopted to interpolate a median value, a mean value, a mode and the like in time sequence data; (2) the linear interpolation method is to fit the known groundwater level data into a function, and then interpolate the missing value; (3) and a front-back weighted average method, which is to perform weighted average according to the distance between the front time and the back time, and the like. Considering that the groundwater level data has strong time change characteristics, various factors such as seasonal precipitation increase and manual irrigation can influence groundwater change, the time information in a time sequence is ignored, and meanwhile, more data are continuously deleted in the groundwater level data, so that the repairing method based on the statistical principle is not ideal in effect and cannot be directly applied to groundwater level time sequence data repairing.
The current latest time-space data interpolation method, namely the Behcet time factor decomposition time sequence interpolation method, is based on the Behcet time decomposition frame, and compared with other methods, the method can obtain better effect by using the method to interpolate the groundwater data with long time sequence. However, since the gaussian distribution is adopted in the modeling of the graph model, the influence of abnormal values is unavoidable in the interpolation process. Therefore, how to eliminate the influence of abnormal values so as to improve the accuracy and the applicability of the groundwater level data deletion repair is a technical problem to be solved urgently at present.
Disclosure of Invention
The invention aims to provide a method for repairing the loss of groundwater level monitoring data, which aims to solve the technical problems.
In order to achieve the above purpose, the present invention provides the following technical solutions:
the invention discloses a method for repairing loss of groundwater level monitoring data, which comprises the following steps:
step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set;
step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed;
step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, verifying the model by using the test data, adjusting model parameters, detecting abnormal values of the interpolated time sequence data by using the trained isolated forest model, marking the abnormal values in the interpolation result as missing values again, and removing the missing values;
step 4, interpolating the time sequence data again until the data is repaired completely: and (3) interpolating the time series data subjected to abnormal value detection again by using a KNN algorithm model, repairing the missing values in the time series data until the missing values in all the time series data are complemented, and finally obtaining the complete underground water level monitoring data.
Further, the specific process of training the BTF model by using the training set in the step 2 is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;
the calculation formula of the percentage error MAPE is as follows:
the root mean square error RMSE is calculated as:
wherein a is i An ith value which is original groundwater level time series data; b i And the interpolation result is corresponding.
Further, in the step 3, the dividing the selected time series data into training data and test data specifically includes: 70% of the selected time series data are classified as training data and 30% are classified as test data.
Further, in the step 3, the specific process of performing outlier detection on the interpolated time-series data by using the trained isolated forest model is as follows:
step 31, randomly selecting r sample numbers from the training setData points as sub-sampling set q= { Q 1 ,q 2 ,…,q r The dimension of the data point is z, which is taken as the root node of the tree;
step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;
step 33, for each sample data point q in the sub-sample set i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B i (B) Dividing if q i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;
step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:
1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;
2) The height of the isolation tree reaches a limited height;
step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;
step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):
wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant;
the anomaly score s for query-interpolated groundwater level data l is defined as:
the interpolated groundwater level data l is used for anomaly identification according to the following criteria:
1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;
2) When E (h (l)). Fwdarw.r-1, that is, s.fwdarw.0, the groundwater level data l is recognized as normal data.
Further, in the step 4, interpolation is performed again on the time series data subjected to abnormal value detection by using the KNN algorithm model, and a specific process of repairing the missing value in the time series data is as follows: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;
the Euclidean distance is calculated as:
wherein x is i ,x j Coordinates that are missing values; y is i ,y j Coordinates that are known recorded values;
the calculation formula of the complement value is:
in the formula, mean is a complement value corresponding to the missing value; w (w) i The value is recorded for the ith groundwater level.
The beneficial effects of the invention are as follows: the invention provides a method for repairing the loss of groundwater level monitoring data, which comprises the steps of carrying out outlier recognition on time sequence data after interpolation of a BTF model by using an isolated forest model, and carrying out interpolation again by using a KNN algorithm model, so that the groundwater level monitoring data is repaired completely, the defect that partial outlier is generated when the BTF interpolation method is adopted only is overcome, and the reliability and accuracy of interpolation results are improved. The method is suitable for the condition of the underground water monitoring data with the missing long-time sequence, solves the problems that the interpolation quality of the time sequence continuous missing data is poor and the real time sequence change trend cannot be attached in the existing method, and has good accuracy and high applicability.
The invention will be described in further detail with reference to the drawings and the detailed description.
Drawings
FIG. 1 is a flow chart of a method according to the present invention;
FIG. 2 is a graph showing the comparison of the interpolation effect of data using the BTF method and the IF-BTFK method.
Detailed Description
The invention discloses a method for repairing loss of groundwater level monitoring data, as shown in figure 1, the method comprises the following steps:
step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: and acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set.
Step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed.
The specific process of training the BTF model by using the training set is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;
the calculation formula of the percentage error MAPE is as follows:
the root mean square error RMSE is calculated as:
wherein a is i An ith value which is original groundwater level time series data; b i And the interpolation result is corresponding.
The BTF model, a bayesian time decomposition framework model, is a graphical model that can characterize global and local consistency in large-scale time series data by integrating low-rank matrix/tensor decomposition and Vector Autoregressive (VAR) processes into a single probability, which can efficiently perform probability prediction and produce uncertainty estimates.
In particular, the groundwater spatiotemporal data may be defined as a three-dimensional tensorWherein m and n respectively represent the number of monitoring stations and the time length (monitoring time), and t represents the number (frequency) of groundwater level records in each year. The groundwater recording data in tensor D is indexed using (i, j, t) ∈Ω. The portion of the data that does not contain missing data is selected, and three low rank factor matrices U, V and X are randomly initialized according to CANDECOMP/PARAFAC (CP) matrix decomposition, where X is a frequency factor matrix and U and V are space-time factor matrices. Assume that each missing term d in the tensor (i,j,t) All obey an independent gaussian distribution, and we assume that each observation D in D obeys a gaussian distribution with accuracy τ
Under Gaussian assumption, τ represents the noise level in the subsurface observation data, and the standard deviation τ is assumed to follow Gamma distribution to improve the robustness of the method.
τ~Gamma(α,β)
In order to correctly calculate the missing groundwater recording data, it is assumed that the row vectors U in the U and V factor matrices i And v j Is a multivariate gaussian distribution.
Under the Bayesian condition, the super-parameters of the assumed model obey Gaussian-Wishare distribution, so that the robustness of the method can be enhanced. The a priori distribution of μ and Λ is defined as follows:
uu )~Gaussian–Wishart(μ 00 ,W 0 ,v 0 )
vv )~Gaussian–Wishart(μ 00 ,W 0 ,v 0 )
wherein v is 0 Represent the degree of freedom, W 0 Representing the matrix of ratios of R x R,is a mean vector that may be defined as a zero vector.
The processing method of the frequency factor matrix X is different from that of the space-time factor matrix. The frequency factor matrix has time series characteristics, so VAR can be used to predict missing data in the groundwater time series. The autoregressive method assumes that there is a linear dependency between the variables of the same groundwater time series. For the t-th observation in the annual observations in the groundwater time seriesValue ofThe linearity is expressed as follows:
wherein A is k Is an R x R coefficient matrix, E t Is a gaussian noise vector. Matrix a and vector v t Can be expressed as:
in summary, VAR can be expressed as x t =A T v t +∈ t . Furthermore, the hysteresis set is defined as The time factor matrix X is:
the conjugate Matrix Normal Inverse Wishart (MNIW) is then a priori applied to the coefficient matrix a and covariance matrix Σ:
and finally, iterating the method by adopting a Gibbs sampling algorithm. Sampling the factor matrix by using Gibbs sampling algorithm to obtain a groundwater observation value d (i,j,t) Dependency relationship with VAR hyper-parameters. After the Gibbs sampling algorithm reaches steady state, all missing groundwater observations can be approximately solved with Markov Chain Monte Carlo (MCMC). Taking the average result after sampling g times as an interpolation result. There is no limitation on the number of sampling times, and in practice, as the number of sampling times increases, the accuracy of the interpolation result may decrease. Therefore, the number of sampling times should be selected to be a value which does not increase with an increase in the number of sampling times after the accuracy increases to a certain extent.
Step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, mining the relation between an abnormal value and a normal value in the time sequence data, verifying the model by the test data, and adjusting model parameters until the isolated forest model can effectively identify the abnormal value in the time sequence data. And then, performing outlier detection on the interpolated time series data by using the trained isolated forest model, marking the outlier in the interpolation result as a missing value again, and removing the missing value.
The isolated forest model is adopted because outliers in the groundwater time series data after BTF interpolation generally have two characteristics: the few and the difference, i.e. the outliers of these interpolations are sparsely distributed and far from the normal values of high density, and these outliers are also called points that are easily isolated. For a set of continuous groundwater time series data, the core of the isolated forest model is to randomly sample and construct a number of isolated trees (itres) from which an isolated forest is composed.
Specifically, the specific process of performing outlier detection on the interpolated time series data by using the trained isolated forest model is as follows:
step 31, randomly selecting r sample data points from the training set as a sub-sampling set q= { Q 1 ,q 2 ,…,q r The dimension of the data point is z, which is taken as the root node of the tree;
step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;
step 33, for each sample data point q in the sub-sample set i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B i (B) Dividing if q i (B)<p, dividing into a left subtree, and otherwise dividing into a right subtree;
step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:
1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;
2) The height of the isolation tree reaches a limited height;
step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;
step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):
wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant;
the anomaly score s for query-interpolated groundwater level data l is defined as:
the interpolated groundwater level data l is used for anomaly identification according to the following criteria:
1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;
2) When E (h (l)). Fwdarw.r-1, that is, s.fwdarw.0, the groundwater level data l is recognized as normal data.
Step 4, interpolating the time sequence data again until the data is repaired completely: and (3) interpolating the time series data subjected to abnormal value detection again by using a KNN algorithm model, repairing the missing values in the time series data until the missing values in all the time series data are complemented, and finally obtaining the complete underground water level monitoring data.
The method comprises the following specific processes of interpolating the time series data after abnormal value detection again and repairing the missing value in the time series data: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;
the Euclidean distance is calculated as:
wherein x is i ,x j Coordinates that are missing values; y is i ,y j Coordinates that are known recorded values;
the calculation formula of the complement value is:
in the formula, mean is a complement value corresponding to the missing value; w (w) i The value is recorded for the ith groundwater level.
Example 1
The present embodiment is a specific application example of the above method.
The Linyi city is located in the southeast of the Lu, 34 degrees 17 '-36 degrees 23' of North latitude, 117 degrees 25 '-119 degrees 11' of east longitude. Because of being controlled by geological structures, a series of broken block protrusions and broken block recesses are formed in the near-to-near market, and the medium and low mountain topography formed by the broken block protrusions such as Mongolia and the like formed by the ancient crystalline rock is the natural watershed of the surface water and the underground water in the market. According to the occurrence condition of the groundwater in the Yi city, the water-based property of the rock and the hydraulic characteristic of the groundwater, the groundwater in the whole city is divided into four types of water-containing rock groups: loose rock pore water-bearing rock pore-crack water-bearing rock pore-carbonate rock crack-karst water-bearing rock pore-base rock crack water-bearing rock pore.
The groundwater level monitoring data obtained in this embodiment is water level record data (6 record data per month) of 10 years (2006-2015 years) of a time span of 8 monitoring points. Wherein the dot number W 1,2,6 The data of the monitoring points are complete, the data of the rest monitoring points are incomplete, and the deletion rate of most time series data is about 40%.
Converting the original ground water level monitoring data of 8 monitoring points into a time series data set T (T= { T) 1 ,....,t 8 -j) while marking groundwater data missing on a certain date. Furthermore, the complete time series T in the data set T is selected 1 ,t 2 ,t 6 Constructing a subset T of time series 1
The time series data sets T and T are combined 1 Respectively converted into three-dimensional tensorsAnd->
By random means, makeRandom deletions of 10%, 20%, 30% and 40%. Inputting the tensor of the random missing data into a BTF model, setting the Gibbs sampling frequency to be 100-500, and finding the optimal sampling frequency of model super-parameter fitting through an accuracy evaluation index RMSE and MAPE. Smaller RMSE and MAPE represent better interpolation results, closer to true values. The experimental results are shown in table 1.
TABLE 1
Will beInputting the trained BTF model for interpolation, and then converting the interpolated tensor into a time sequence T R (T R ={t R1 ,....,t R8 })。
Subset T of time series 1 Dividing the model into training data and test data, training an isolated forest model and verifying, wherein parameters of an initialization algorithm comprise: the number of trees and the proportion of abnormal data. Training the model with training data, importing the normalized data into an isolated forest model, and mining each subsequence (t 1 ,t 2 ,t 6 ) And (3) continuously adjusting various parameters of the model according to the change trend of the model and the relation between the abnormal value and the normal value, and finally determining the number of the isolated forest model trees and the abnormal data proportion.
Time sequence T R Is a subsequence (t) R1 -t R8 ) And inputting the trained isolated forest models one by one, and marking the abnormal value in each subsequence as a missing value and removing the missing value.
Time series T after marking R Inputting a KNN algorithm model, calculating Euclidean distance between each missing value and other surrounding known groundwater level data, and taking the average value of the first k groundwater level data with the shortest distance as the complement value of the missing value until the re-interpolation of all marked groundwater level data is completed.
The present example compares the results of the BTF-only groundwater time series interpolation method with the BTF-and KNN-based groundwater time series interpolation method (IF-BTFK) using the method of the present invention, i.e., based on isolated forest outlier detection, as shown in table 2 and fig. 2.
TABLE 2
From the above, it can be seen that the groundwater level data has two distinct features: 1. the fluctuation of the change is small; 2. the data missing part is more, and the phenomenon of continuous data missing exists. The existing space-time interpolation method is directly used for long-time groundwater data interpolation, and although the change trend of the data can be effectively simulated, partial abnormal values are unavoidable. The repair method for the loss of the underground water level monitoring data can well overcome the problem of abnormal value generation in the interpolation process, so that the interpolated time series data is closer to the real time series data.
Finally, it should be noted that the above description is only for the purpose of illustrating the technical solution of the present invention and not for the purpose of limiting the same, and that although the present invention has been described in detail with reference to the preferred arrangement, it will be understood by those skilled in the art that modifications and equivalents may be made thereto without departing from the spirit and scope of the technical solution of the present invention.

Claims (5)

1. A method for repairing a loss of groundwater level monitoring data, the method comprising the steps of:
step 1, acquiring and arranging ground water level monitoring data to form a ground water space-time data set: acquiring water level monitoring data of underground water level monitoring points in the past year, sorting the water level monitoring data into time series data with time labels and the same recording time interval, and marking missing values to form an underground water space-time data set;
step 2, interpolation is carried out on the time series data containing the missing values by utilizing a BTF model: taking a groundwater space-time data set as basic data, selecting time sequence data containing missing values in the data set as a target sequence, selecting complete time sequence data in the data set as a training set, training a BTF model by using the training set, adjusting model parameters by combining a Gibbs sampling algorithm, and then interpolating the target sequence by using the model after training until interpolation of all the time sequence data containing missing values is completed;
step 3, abnormal value detection is carried out on the interpolated time series data by utilizing an isolated forest model: selecting complete time sequence data in the underground water space-time data set, dividing the selected time sequence data into training data and test data, training an isolated forest model by using the training data, verifying the model by using the test data, adjusting model parameters, detecting abnormal values of the interpolated time sequence data by using the trained isolated forest model, marking the abnormal values in the interpolation result as missing values again, and removing the missing values;
step 4, interpolating the time sequence data again until the data is repaired completely: and (3) interpolating the time series data subjected to abnormal value detection again by using a KNN algorithm model, repairing the missing values in the time series data until the missing values in all the time series data are complemented, and finally obtaining the complete underground water level monitoring data.
2. The method for repairing a loss of groundwater level monitoring data according to claim 1, wherein the specific process of training the BTF model by using the training set in step 2 is as follows: randomly deleting 10%, 20%, 30% and 40% of time sequence data in a training set to simulate the real situation of real groundwater level time sequence data, then interpolating the time sequence by utilizing a BTF model, and comparing interpolation effects and efficiency of BTF models with different parameters by introducing two evaluation indexes of average absolute percentage error MAPE and root mean square error RMSE for interpolation results to determine optimal model parameters;
the calculation formula of the percentage error MAPE is as follows:
the root mean square error RMSE is calculated as:
wherein a is i An ith value which is original groundwater level time series data; b i And the interpolation result is corresponding.
3. The method for repairing a loss of groundwater level monitoring data according to claim 1, wherein in step 3, the dividing the selected time series data into training data and test data is specifically as follows: 70% of the selected time series data are classified as training data and 30% are classified as test data.
4. The method for repairing a loss of groundwater level monitoring data according to claim 1, wherein the specific process of performing outlier detection on the interpolated time series data by using the trained isolated forest model in step 3 is as follows:
step 31, randomly selecting r sample data points from the training set as a sub-sampling set q= { Q 1 ,q 2 ,…,q r The dimension of the data point is z, which is taken as the root node of the tree;
step 32, randomly selecting a dimension B and a splitting point p from the current sub-sampling set, wherein p is between the maximum value and the minimum value of the dimension B in the current sub-sampling set;
step 33, for each sample data point q in the sub-sample set i I is more than or equal to 1 and less than or equal to r, and the value q is according to the dimension B i (B) Dividing if q i (B)<Dividing into a left subtree, and otherwise dividing into a right subtree;
step 34, repeatedly executing the steps 32-33, and continuously constructing new left and right subtrees until one of the following conditions is met:
1) Only one data point or a plurality of identical data points are left in Q, and cannot be further divided;
2) The height of the isolation tree reaches a limited height;
step 35, repeatedly executing the steps 31-34 until the number of the isolation trees reaches the designated number N, and forming an isolated forest by the isolation trees;
step 36, for any one of the interpolated groundwater level data l, calculating a path length h (l) of the data l in each isolation tree by traversing each isolation tree in the isolated forest, further calculating an expected E (h (l)) of the path length of the data l in the isolated forest, and recording the average path length of the isolation tree by using the average path length C (r):
wherein H (r) is a tone function, H (r) =ln (r) +δ, wherein δ is a euler constant;
the anomaly score s for query-interpolated groundwater level data l is defined as:
the interpolated groundwater level data l is used for anomaly identification according to the following criteria:
1) When E (h (l))→0, i.e., s→1, then the groundwater level data l is identified as abnormal data;
2) When E (h (l)). Fwdarw.r-1, that is, s.fwdarw.0, the groundwater level data l is recognized as normal data.
5. The method for repairing the loss of the groundwater level monitoring data according to claim 1, wherein in the step 4, the time series data detected by the abnormal value is interpolated again by using a KNN algorithm model, and the specific process of repairing the loss value in the time series data is as follows: for each missing value in the time series data, calculating the distance between the missing value and other surrounding known groundwater level recorded values through Euclidean distance, then sorting and selecting the first k groundwater level recorded values according to ascending order of the distance, and calculating the mean value of the first k groundwater level recorded values as the complement value of the missing value;
the Euclidean distance is calculated as:
wherein x is i ,x j Coordinates that are missing values; y is i ,y j Coordinates that are known recorded values;
the calculation formula of the complement value is:
in the formula, mean is a complement value corresponding to the missing value; w (w) i The value is recorded for the ith groundwater level.
CN202310591040.7A 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data Active CN116627953B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310591040.7A CN116627953B (en) 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310591040.7A CN116627953B (en) 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data

Publications (2)

Publication Number Publication Date
CN116627953A true CN116627953A (en) 2023-08-22
CN116627953B CN116627953B (en) 2023-10-27

Family

ID=87591469

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310591040.7A Active CN116627953B (en) 2023-05-24 2023-05-24 Method for repairing loss of groundwater level monitoring data

Country Status (1)

Country Link
CN (1) CN116627953B (en)

Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105873111A (en) * 2016-06-08 2016-08-17 南京信息工程大学 Soft and hard fault diagnosis and self restoration method suitable for health monitoring
CN107577649A (en) * 2017-09-26 2018-01-12 广州供电局有限公司 The interpolation processing method and device of missing data
CN109003422A (en) * 2018-08-02 2018-12-14 北京大学深圳研究生院 Monitoring data processing method and landslide forecasting procedure for landslide
CN109307159A (en) * 2018-08-21 2019-02-05 湖南大学 A kind of pipe network model alarm method based on water consumption optimal prediction model
CN109359104A (en) * 2018-09-14 2019-02-19 广州帷策智能科技有限公司 The missing data interpolation method and device of time data sequence
CN109947812A (en) * 2018-07-09 2019-06-28 平安科技(深圳)有限公司 Consecutive miss value fill method, data analysis set-up, terminal and storage medium
CN110503629A (en) * 2019-07-16 2019-11-26 西安理工大学 The underwater unnatural object detection method of isolated forest based on genetic algorithm
CN110580328A (en) * 2019-09-11 2019-12-17 江苏省地质工程勘察院 Method for repairing underground water level monitoring value loss
WO2020010701A1 (en) * 2018-07-11 2020-01-16 平安科技(深圳)有限公司 Pollutant anomaly monitoring method and system, computer device, and storage medium
CN110766066A (en) * 2019-10-18 2020-02-07 天津理工大学 FNN-based tensor heterogeneous integrated internet of vehicles missing data estimation method
CN111597080A (en) * 2020-05-22 2020-08-28 广东省生态环境技术研究所 Method for repairing underground water level missing data based on ground statistics and neural network
CN111625399A (en) * 2020-05-19 2020-09-04 国网天津市电力公司电力科学研究院 Method and system for recovering metering data
CN113255733A (en) * 2021-04-29 2021-08-13 西安交通大学 Unsupervised anomaly detection method under multi-modal data loss
CN113298297A (en) * 2021-05-10 2021-08-24 内蒙古工业大学 Wind power output power prediction method based on isolated forest and WGAN network
CN113591401A (en) * 2021-08-24 2021-11-02 华北电力大学(保定) Power transformer data cleaning method based on time series decomposition
CN113936192A (en) * 2021-10-22 2022-01-14 国网河北省电力有限公司经济技术研究院 Power distribution network synchronous measurement missing data repairing method, terminal and storage medium
CN114169237A (en) * 2021-11-30 2022-03-11 南昌大学 Power cable joint temperature abnormity early warning method combining EEMD-LSTM and isolated forest algorithm
CN114333292A (en) * 2021-11-22 2022-04-12 上海电科智能系统股份有限公司 Traffic restoration method based on trajectory reconstruction technology
CN115878603A (en) * 2022-12-27 2023-03-31 大连大学 Water quality missing data interpolation algorithm based on K nearest neighbor algorithm and GAN network
CN115935147A (en) * 2022-11-25 2023-04-07 东南大学 Traffic data recovery and abnormal value detection method represented by low-rank and sparse tensor

Patent Citations (20)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105873111A (en) * 2016-06-08 2016-08-17 南京信息工程大学 Soft and hard fault diagnosis and self restoration method suitable for health monitoring
CN107577649A (en) * 2017-09-26 2018-01-12 广州供电局有限公司 The interpolation processing method and device of missing data
CN109947812A (en) * 2018-07-09 2019-06-28 平安科技(深圳)有限公司 Consecutive miss value fill method, data analysis set-up, terminal and storage medium
WO2020010701A1 (en) * 2018-07-11 2020-01-16 平安科技(深圳)有限公司 Pollutant anomaly monitoring method and system, computer device, and storage medium
CN109003422A (en) * 2018-08-02 2018-12-14 北京大学深圳研究生院 Monitoring data processing method and landslide forecasting procedure for landslide
CN109307159A (en) * 2018-08-21 2019-02-05 湖南大学 A kind of pipe network model alarm method based on water consumption optimal prediction model
CN109359104A (en) * 2018-09-14 2019-02-19 广州帷策智能科技有限公司 The missing data interpolation method and device of time data sequence
CN110503629A (en) * 2019-07-16 2019-11-26 西安理工大学 The underwater unnatural object detection method of isolated forest based on genetic algorithm
CN110580328A (en) * 2019-09-11 2019-12-17 江苏省地质工程勘察院 Method for repairing underground water level monitoring value loss
CN110766066A (en) * 2019-10-18 2020-02-07 天津理工大学 FNN-based tensor heterogeneous integrated internet of vehicles missing data estimation method
CN111625399A (en) * 2020-05-19 2020-09-04 国网天津市电力公司电力科学研究院 Method and system for recovering metering data
CN111597080A (en) * 2020-05-22 2020-08-28 广东省生态环境技术研究所 Method for repairing underground water level missing data based on ground statistics and neural network
CN113255733A (en) * 2021-04-29 2021-08-13 西安交通大学 Unsupervised anomaly detection method under multi-modal data loss
CN113298297A (en) * 2021-05-10 2021-08-24 内蒙古工业大学 Wind power output power prediction method based on isolated forest and WGAN network
CN113591401A (en) * 2021-08-24 2021-11-02 华北电力大学(保定) Power transformer data cleaning method based on time series decomposition
CN113936192A (en) * 2021-10-22 2022-01-14 国网河北省电力有限公司经济技术研究院 Power distribution network synchronous measurement missing data repairing method, terminal and storage medium
CN114333292A (en) * 2021-11-22 2022-04-12 上海电科智能系统股份有限公司 Traffic restoration method based on trajectory reconstruction technology
CN114169237A (en) * 2021-11-30 2022-03-11 南昌大学 Power cable joint temperature abnormity early warning method combining EEMD-LSTM and isolated forest algorithm
CN115935147A (en) * 2022-11-25 2023-04-07 东南大学 Traffic data recovery and abnormal value detection method represented by low-rank and sparse tensor
CN115878603A (en) * 2022-12-27 2023-03-31 大连大学 Water quality missing data interpolation algorithm based on K nearest neighbor algorithm and GAN network

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
陆文琦: "基于微波数据的快速路交通流数据修复及预测方法研究", 中国优秀硕士学位论文全文数据库工程科技Ⅱ辑(月刊), no. 01, pages 034 - 1122 *

Also Published As

Publication number Publication date
CN116627953B (en) 2023-10-27

Similar Documents

Publication Publication Date Title
CN108761574B (en) Rainfall estimation method based on multi-source information fusion
Sauquet et al. Comparison of catchment grouping methods for flow duration curve estimation at ungauged sites in France
CN111401599B (en) Water level prediction method based on similarity search and LSTM neural network
CN112712209B (en) Reservoir warehousing flow prediction method and device, computer equipment and storage medium
CN110045227B (en) power distribution network fault diagnosis method based on random matrix and deep learning
CN116128141B (en) Storm surge prediction method and device, storage medium and electronic equipment
Zhong et al. Modeling nonstationary temperature maxima based on extremal dependence changing with event magnitude
CN115495991A (en) Rainfall interval prediction method based on time convolution network
Smithers et al. Long duration design rainfall estimates for South Africa
CN117423003B (en) Winter wheat seedling condition grading remote sensing monitoring method in overwintering period
CN114936201A (en) Satellite precipitation data correction method based on adaptive block neural network model
CN116627953B (en) Method for repairing loss of groundwater level monitoring data
CN111325376A (en) Wind speed prediction method and device
CN110852415B (en) Vegetation index prediction method, system and equipment based on neural network algorithm
CN113792105B (en) Geospatial point data sampling method based on half-variogram
CN115797501A (en) Time-series forest age mapping method combining forest disturbance and recovery events
Miralles et al. Bayesian modeling of insurance claims for hail damage
CN108182492A (en) A kind of Data Assimilation method and device
CN114818548A (en) Aquifer parameter field inversion method based on convolution generated confrontation network
CN116796799A (en) Method for creating small-river basin flood rainfall threshold model in area without hydrologic data
Neykov et al. Linking atmospheric circulation to daily precipitation patterns over the territory of Bulgaria
CN113537573A (en) Wind power operation trend prediction method based on dual space-time feature extraction
CN115859840B (en) Marine environment power element region extremum analysis method
CN116992781B (en) Multi-step multi-element storm forecasting method based on deep learning
CN115014577B (en) Underwater temperature field reconstruction method based on depth evidence regression network

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant