CN112765141A - Continuous large-scale water quality missing data filling method based on transfer learning - Google Patents

Continuous large-scale water quality missing data filling method based on transfer learning Download PDF

Info

Publication number
CN112765141A
CN112765141A CN202110040587.9A CN202110040587A CN112765141A CN 112765141 A CN112765141 A CN 112765141A CN 202110040587 A CN202110040587 A CN 202110040587A CN 112765141 A CN112765141 A CN 112765141A
Authority
CN
China
Prior art keywords
data
water quality
training
filling
training samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110040587.9A
Other languages
Chinese (zh)
Inventor
蒋鹏
陈锃
许欢
刘俊
林广�
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Dianzi University
Original Assignee
Hangzhou Dianzi University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Dianzi University filed Critical Hangzhou Dianzi University
Priority to CN202110040587.9A priority Critical patent/CN112765141A/en
Publication of CN112765141A publication Critical patent/CN112765141A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/049Temporal neural networks, e.g. delay elements, oscillating neurons or pulsed inputs
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention relates to a continuous large-scale water quality missing data filling method based on transfer learning. The invention firstly carries out data preprocessing and utilizes a sliding window algorithm to construct training and testing samples. Then data padding is carried out: specifically, a training sample of a target domain and a training sample of a source domain are fused into a new mixed training sample set; in each iteration, a new weak learner for filling data is constructed; calculating an average prediction filling error on the newly mixed training sample; respectively calculating the weight iteration update coefficients of the training samples of the source domain and the training samples of the target domain; updating new weights of the training samples of the source domain and the target domain at the next moment; and carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner. The invention improves the filling accuracy rate by about 15 to 25 percent in the process of processing the large-scale continuous missing data problem.

Description

Continuous large-scale water quality missing data filling method based on transfer learning
Technical Field
The invention relates to a water quality missing data filling method, in particular to a large-scale continuous water quality missing data filling method based on transfer learning.
Background
With the rapid development of industrialization and urbanization, water resource conservation and water pollution control have become the hottest and worried hot topics in the world. In order to control water pollution and reduce its adverse effects on water ecosystem and human society, a number of researchers have performed a number of works (including spatiotemporal prediction of water quality, water quality pollutant impact factor evaluation and data-driven water quality models, etc.) to improve water quality monitoring levels in small flow areas.
In conducting these studies, an effective and high quality water quality data set is an important prerequisite to produce reasonable and reliable research results. However, most of the water quality data such as ammonia nitrogen, PH, dissolved oxygen and the like are obtained by automatic sampling of front-end biological heavy metal sensors of different water quality monitoring sites. The original water quality data contains a large number of missing values due to ineffectiveness factors such as equipment failure, periodic maintenance, insufficient sample sampling, manual change of sensor parameter setting and the like. These water quality loss data will severely increase the limitations and difficulties of subsequent water quality research findings. Therefore, as more and more water quality research turns to data-based analysis, the missing data has become an urgent problem to be solved in this field.
Although most existing studies have explored some classical statistical methods (mean, median, etc.) or emerging machine/deep learning methods (expectation maximization, fuzzy clustering, support vector regression, extreme learning, etc.) to fill in missing data. But it is difficult to solve the problem of large-scale continuous data loss (the conventional methods can only be applied to the case that the loss rate is below 30%, and they do not consider the case of 50% -90% loss rate), because as the loss rate increases, related a priori statistical information or enough training samples cannot be provided around the lost data to obtain the accuracy of the padding data. Therefore, these methods are not applicable in terms of large-scale continuous loss of data.
With the advent of the big data age, knowledge contained in data is relevant to the aspects of the country and the society, and the improvement of data processing and analyzing technology needs a complete and accurate data set, and most of the existing data has noise or missing situations due to the lack of sampling and analyzing or input errors periodically. Therefore, how to effectively solve the data problem becomes a crucial task. The invention focuses on a filling method under the condition of large-scale continuous missing of data in the water quality field, and is different from the traditional method for filling missing data.
Disclosure of Invention
The invention provides a large-scale continuous water quality missing data filling method based on transfer learning, aiming at the problem that the existing technology cannot fill large-scale continuous water quality missing data.
The invention comprises the following steps:
data preprocessing:
cleaning and standard normalizing incomplete data sequences collected from a sensor of a certain water quality monitoring station and defining the incomplete data sequences as experimental data;
finding out the data of the monitoring station most similar to the incomplete data sequence by using a time sequence similarity query method and determining the data as reference data;
constructing a training and testing sample by using a sliding window algorithm;
and (3) data padding:
setting a water quality monitoring site which contains a small amount of training samples and has large-scale continuous lack of data as a target domain, and setting a water quality monitoring site with a complete training sample as a source domain;
fusing the training sample of the target domain and the training sample of the source domain into a new mixed training sample set;
initializing weight distribution and weak learner weight coefficients of training samples of a source domain and a target domain: maximum iteration times and weight distribution of the defined mixed training sample;
starting iterative operation:
in each iteration, a new weak learner for filling data is constructed;
calculating an average prediction filling error on the newly mixed training sample;
respectively calculating the weight iteration update coefficients of the training samples of the source domain and the training samples of the target domain;
updating new weights of training samples of the source domain and the target domain at the t +1 moment according to the weight at the t moment; completing one weak learner training, restarting the iteration process until the maximum iteration times is reached, and jumping to output;
and (3) outputting: and carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner.
The invention has the beneficial effects that: in the TrAdaBoost-LSTM algorithm designed by the invention, the LSTM has the good characteristic of processing time sequence data and can realize long-term dependence on data information, and the thought essence of transfer learning is based on the interconnection of everything and realizes the transfer of similar data domains; the invention selects any water quality monitoring station containing large-scale continuous missing data as a sample of a target domain, and the sample is searched by a time sequence similarity query algorithm: dynamic Time Warping (DTW) to take the complete data of another monitored site as source domain samples. The experimental result shows that compared with the traditional statistical filling, machine learning filling and deep learning filling methods, through indexes such as RMSE/MAE/MAPE/R-square and the like, the filling method disclosed by the invention not only improves the filling accuracy rate by 15% -25% in the process of processing the large-scale continuous missing data problem, but also provides a potential reference thought for researches in other fields of the same type.
Drawings
FIG. 1 is a large-scale continuous missing data padding framework;
FIG. 2 is a sliding window algorithm;
fig. 3 is a filling result of a site water quality monitoring station.
Detailed Description
As shown in fig. 1, the missing data padding method framework proposed by the present invention can be divided into two parts: and executing a data preprocessing and padding algorithm.
In the data preprocessing process, firstly, incomplete data sequences collected from sensors of a certain water quality monitoring station are cleaned, normalized in a standard mode and defined as experimental data. Secondly, the method of time series similarity query (in the invention, Dynamic Time Warping (DTW) is used to find the data of the monitored station most similar to the incomplete data series and to determine it as the reference data. Finally, the training and test samples are constructed using a Sliding Window algorithm (Sliding Window).
In the filling algorithm execution process, the invention provides a transfer learning algorithm based on an example: the TrAdaBoost and advanced deep learning algorithm: a novel filling algorithm TrAdaBoost-LSTM fused with long-short term memory neural networks (LSTM).
The description is as follows: the time series similarity query method (such as dynamic programming warping (DTW)) involved in the filling framework provided by the invention and the migration learning algorithm based on the example are as follows: the formulas for TrAdaBoost, the LSTM algorithm based on deep learning, and subsequent correlation analysis indicators Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Proportional Error (MAPE), and model error (R-square) calculations are default to those understood and well known to those skilled in the art, and are not expanded herein.
The key or innovative technical points involved in the filling framework proposed by the invention are as follows;
1. as shown in fig. 2, a sliding window algorithm (sliding window) is a common method applied in the analysis of a time sequence, and the main idea is to focus on the time t before the current time t
Figure BDA0002895177400000051
Continuous data namely
Figure BDA0002895177400000052
And associates it with the current time t. At this moment, the balance
Figure BDA0002895177400000053
Is the sliding window size. The mathematical expression of the time sequence sliding window is
Figure BDA0002895177400000054
Wherein S ═ S1,S2,S3…SN]In order to complete a time-series sequence,
Figure BDA0002895177400000055
one input, called this timing sequence S, { StThe output corresponding to the characteristic is called.
2. In the invention, a water quality monitoring site which contains a small amount of training samples and continuously lacks data in a large scale (the proportion of the missing data is more than 50 percent) is defined as a target domain, and a water quality monitoring site with a complete training sample is defined as a source domain. The steps of the TrAdaBoost-LSTM missing data padding algorithm are as follows:
inputting: training samples for the source domain:
Figure BDA0002895177400000061
training samples for the target domain:
Figure BDA0002895177400000062
wherein
Figure BDA0002895177400000063
And
Figure BDA0002895177400000064
in order to train the input to the model,
Figure BDA0002895177400000065
and
Figure BDA0002895177400000066
is the output of the training model; m is the number of the source domain training samples, and N is the number of the target domain training samples.
Step 1: training samples of a target domain and a source domainFusing into a new mixed training sample set: { Fk,Lk}(k=1,2,3…,M+N)。
Step 2: initial weight distribution of training samples for source and target domains
Figure BDA0002895177400000067
Figure BDA0002895177400000068
And
Figure BDA0002895177400000069
initializing the weak learner: the weights of the mixed training samples in the LSTM iteratively update the coefficients:
Figure BDA00028951774000000610
initial maximum number of iterations iter and defining the weight distribution of the mixed training sample: omega is
Figure BDA00028951774000000611
And step 3: for each iteration, a new weak learner of padding data is constructed: LSTM; also, the input defining the weak learner is { F }kWhere k is 1,2,3 …, M + N), the output is Lk}(k=1,2,3…,M+N)。
And 4, step 4: the average prediction fill-in error is calculated over the newly mixed training samples. The predicted padding value of the training set is Yk(k ═ 1,2,3 …, M + N). Wherein the padding error is shown in formula (2):
Figure BDA00028951774000000612
and 5: respectively calculating the weight iteration update coefficient beta of the training sample of the source domain and the training sample of the target domaintAs shown in equation (3):
Figure BDA0002895177400000071
step 6: updating new weights of training samples of the source domain and the target domain at the t +1 moment according to the weight at the t moment, wherein the new weights are respectively as follows:
Figure BDA0002895177400000072
completing one weak learner training, returning to the step 3 again until the maximum iteration times is reached, and jumping to output;
and (3) outputting: and (3) carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner, wherein the calculation formula (4) is as follows:
Figure BDA0002895177400000073
the invention uses four parameters as important indexes for evaluating the performance of algorithm filling data: the Root MEAN Square Error (RMSE), the MEAN Absolute Error (MAE), the MEAN Absolute Proportional Error (MAPE) and R-square were compared with the other 5 conventional filling algorithms (MEAN, autoregressive mixed average (ARIMA), Support Vector Regression (SVR), Extreme Learning Machine (ELM) and long and short term memory network (LSTM)) by combining the dissolved oxygen concentration missing from the on-site water quality monitoring sites of the qiantangjiang river basin, hangzhou, zhejiang as an experimental case, and the results are shown in table 1. It can be seen from table 1 that in comparison with the other 5 algorithms, the algorithm proposed by the present invention has the lowest RMSE, MAE, MAPE and the highest R-square under both low loss rate and high loss rate, which also indicates that the present invention has excellent missing data filling effect to some extent, and the missing filling result of the oxygen content concentration on site is shown in fig. 3.
Table 1 is a table comparing performance between different padding algorithms under different loss rates
Figure BDA0002895177400000081
The above embodiments are merely to illustrate the technical solutions of the present invention and not to limit the present invention, and the present invention has been described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent arrangements may be made without departing from the spirit and scope of the present invention and it should be understood that the present invention is to be covered by the appended claims.

Claims (7)

1. A continuous large-scale water quality missing data filling method based on transfer learning is characterized by comprising the following steps:
data preprocessing:
cleaning and standard normalizing incomplete data sequences collected from a sensor of a certain water quality monitoring station and defining the incomplete data sequences as experimental data;
finding out the data of the monitoring station most similar to the incomplete data sequence by using a time sequence similarity query method and determining the data as reference data;
constructing a training and testing sample by using a sliding window algorithm;
and (3) data padding:
setting a water quality monitoring site which contains a small amount of training samples and has large-scale continuous lack of data as a target domain, and setting a water quality monitoring site with a complete training sample as a source domain;
fusing the training sample of the target domain and the training sample of the source domain into a new mixed training sample set;
initializing weight distribution and weak learner weight coefficients of training samples of a source domain and a target domain: maximum iteration times and weight distribution of the defined mixed training sample;
starting iterative operation:
in each iteration, a new weak learner for filling data is constructed;
calculating an average prediction filling error on the newly mixed training sample;
respectively calculating the weight iteration update coefficients of the training samples of the source domain and the training samples of the target domain;
updating new weights of training samples of the source domain and the target domain at the t +1 moment according to the weight at the t moment; completing one weak learner training, restarting the iteration process until the maximum iteration times is reached, and jumping to output;
and (3) outputting: and carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner.
2. The continuous large-scale water quality missing data filling method based on the transfer learning according to claim 1, characterized in that: the method for time series similarity query uses a dynamic time warping algorithm.
3. The continuous large-scale water quality missing data filling method based on the transfer learning according to claim 1, characterized in that:
weight distribution after initialization of training samples of the source domain:
Figure FDA0002895177390000021
weight distribution after initialization of training samples of the target domain:
Figure FDA0002895177390000022
the weak learner employs a long short-term memory network LSTM, where the weights of the mixed training samples in the LSTM iteratively update coefficients:
Figure FDA0002895177390000023
where M is the number of source domain training samples, N is the number of target domain training samples, i is 1,2,3 …, M, j is 1,2,3 …, N, iter is the maximum number of iterations.
4. The continuous large-scale water quality missing data filling method based on the transfer learning according to claim 3, characterized in that: the filling error is calculated as follows:
Figure FDA0002895177390000024
wherein Y iskTo predict the padding value, LkIs the output of the weak learner LSTM,
Figure FDA0002895177390000031
k=1,2,3…,M+N。
5. the continuous large-scale water quality missing data filling method based on the transfer learning according to claim 4, characterized in that: weight iterative update coefficient betatThe calculation is as follows:
Figure FDA0002895177390000032
6. the continuous large-scale water quality missing data filling method based on the transfer learning according to claim 5, characterized in that:
the new weight of the source domain training sample at the time t +1 is
Figure FDA0002895177390000033
the new weight of the target domain training sample at the time t +1 is as follows:
Figure FDA0002895177390000034
7. the continuous large-scale water quality missing data filling method based on the transfer learning according to claim 6, characterized in that: final prediction pad value pred _ valuekThe calculation is as follows:
Figure FDA0002895177390000035
CN202110040587.9A 2021-01-13 2021-01-13 Continuous large-scale water quality missing data filling method based on transfer learning Pending CN112765141A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110040587.9A CN112765141A (en) 2021-01-13 2021-01-13 Continuous large-scale water quality missing data filling method based on transfer learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110040587.9A CN112765141A (en) 2021-01-13 2021-01-13 Continuous large-scale water quality missing data filling method based on transfer learning

Publications (1)

Publication Number Publication Date
CN112765141A true CN112765141A (en) 2021-05-07

Family

ID=75700023

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110040587.9A Pending CN112765141A (en) 2021-01-13 2021-01-13 Continuous large-scale water quality missing data filling method based on transfer learning

Country Status (1)

Country Link
CN (1) CN112765141A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447970A (en) * 2021-06-28 2021-09-28 潍柴动力股份有限公司 Navigation data continuous and reliable data filling method and device and navigation system

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933418A (en) * 2015-06-25 2015-09-23 西安理工大学 Population size counting method of double cameras
CN107944874A (en) * 2017-12-13 2018-04-20 阿里巴巴集团控股有限公司 Air control method, apparatus and system based on transfer learning
CN109143199A (en) * 2018-11-09 2019-01-04 大连东软信息学院 Sea clutter small target detecting method based on transfer learning
CN109948715A (en) * 2019-03-22 2019-06-28 杭州电子科技大学 A kind of water monitoring data missing values complementing method
KR20200068056A (en) * 2018-11-26 2020-06-15 한국과학기술원 Method and System for Power Load Forecasting based on Pattern Tagging

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104933418A (en) * 2015-06-25 2015-09-23 西安理工大学 Population size counting method of double cameras
CN107944874A (en) * 2017-12-13 2018-04-20 阿里巴巴集团控股有限公司 Air control method, apparatus and system based on transfer learning
CN109143199A (en) * 2018-11-09 2019-01-04 大连东软信息学院 Sea clutter small target detecting method based on transfer learning
KR20200068056A (en) * 2018-11-26 2020-06-15 한국과학기술원 Method and System for Power Load Forecasting based on Pattern Tagging
CN109948715A (en) * 2019-03-22 2019-06-28 杭州电子科技大学 A kind of water monitoring data missing values complementing method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
DONGJIE TANG等: ""Improving the transferability of the crash prediction model using the TrAdaBoost.R2 algorithm"", 《ACCIDENT ANALYSIS & PREVENTION》 *
JUN MA等: ""Transfer learning for long-interval consecutive missing values imputation without external features in air pollution time series"", 《ADVANCED ENGINEERING INFORMATICS》 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113447970A (en) * 2021-06-28 2021-09-28 潍柴动力股份有限公司 Navigation data continuous and reliable data filling method and device and navigation system

Similar Documents

Publication Publication Date Title
CN105391083B (en) Wind power interval short term prediction method based on variation mode decomposition and Method Using Relevance Vector Machine
Wang et al. An ensemble hybrid forecasting model for annual runoff based on sample entropy, secondary decomposition, and long short-term memory neural network
CN108009674A (en) Air PM2.5 concentration prediction methods based on CNN and LSTM fused neural networks
CN111310968A (en) LSTM neural network circulation hydrological forecasting method based on mutual information
CN109583565B (en) Flood prediction method based on attention model long-time and short-time memory network
CN110751318B (en) Ultra-short-term power load prediction method based on IPSO-LSTM
CN108876021B (en) Medium-and-long-term runoff forecasting method and system
CN105740984A (en) Product concept performance evaluation method based on performance prediction
CN110824915A (en) GA-DBN network-based intelligent monitoring method and system for wastewater treatment
CN107798431A (en) A kind of Medium-and Long-Term Runoff Forecasting method based on Modified Elman Neural Network
CN112733997A (en) Hydrological time series prediction optimization method based on WOA-LSTM-MC
CN112330065A (en) Runoff forecasting method based on basic flow segmentation and artificial neural network model
CN110807490A (en) Intelligent prediction method for construction cost of power transmission line based on single-base tower
CN115906954A (en) Multivariate time sequence prediction method and device based on graph neural network
CN113033081A (en) Runoff simulation method and system based on SOM-BPNN model
CN114694379A (en) Traffic flow prediction method and system based on self-adaptive dynamic graph convolution
Ye et al. A nonlinear interactive grey multivariable model based on dynamic compensation for forecasting the economy-energy-environment system
CN112765141A (en) Continuous large-scale water quality missing data filling method based on transfer learning
CN114117953A (en) Hydrological model structure diagnosis method, runoff forecasting method and device
CN114239397A (en) Soft measurement modeling method based on dynamic feature extraction and local weighted deep learning
CN106021924B (en) Sewage online soft sensor method based on more attribute gaussian kernel function fast correlation vector machines
CN111310974A (en) Short-term water demand prediction method based on GA-ELM
CN110909492A (en) Sewage treatment process soft measurement method based on extreme gradient lifting algorithm
CN116561569A (en) Industrial power load identification method based on EO feature selection and AdaBoost algorithm
CN115577856A (en) Method and system for predicting construction cost and controlling balance of power transformation project

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20210507