CN112765141A

CN112765141A - Continuous large-scale water quality missing data filling method based on transfer learning

Info

Publication number: CN112765141A
Application number: CN202110040587.9A
Authority: CN
Inventors: 蒋鹏; 陈锃; 许欢; 刘俊; 林广�
Original assignee: Hangzhou Dianzi University
Current assignee: Hangzhou Dianzi University
Priority date: 2021-01-13
Filing date: 2021-01-13
Publication date: 2021-05-07

Abstract

The invention relates to a continuous large-scale water quality missing data filling method based on transfer learning. The invention firstly carries out data preprocessing and utilizes a sliding window algorithm to construct training and testing samples. Then data padding is carried out: specifically, a training sample of a target domain and a training sample of a source domain are fused into a new mixed training sample set; in each iteration, a new weak learner for filling data is constructed; calculating an average prediction filling error on the newly mixed training sample; respectively calculating the weight iteration update coefficients of the training samples of the source domain and the training samples of the target domain; updating new weights of the training samples of the source domain and the target domain at the next moment; and carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner. The invention improves the filling accuracy rate by about 15 to 25 percent in the process of processing the large-scale continuous missing data problem.

Description

Continuous large-scale water quality missing data filling method based on transfer learning

Technical Field

The invention relates to a water quality missing data filling method, in particular to a large-scale continuous water quality missing data filling method based on transfer learning.

Background

With the rapid development of industrialization and urbanization, water resource conservation and water pollution control have become the hottest and worried hot topics in the world. In order to control water pollution and reduce its adverse effects on water ecosystem and human society, a number of researchers have performed a number of works (including spatiotemporal prediction of water quality, water quality pollutant impact factor evaluation and data-driven water quality models, etc.) to improve water quality monitoring levels in small flow areas.

In conducting these studies, an effective and high quality water quality data set is an important prerequisite to produce reasonable and reliable research results. However, most of the water quality data such as ammonia nitrogen, PH, dissolved oxygen and the like are obtained by automatic sampling of front-end biological heavy metal sensors of different water quality monitoring sites. The original water quality data contains a large number of missing values due to ineffectiveness factors such as equipment failure, periodic maintenance, insufficient sample sampling, manual change of sensor parameter setting and the like. These water quality loss data will severely increase the limitations and difficulties of subsequent water quality research findings. Therefore, as more and more water quality research turns to data-based analysis, the missing data has become an urgent problem to be solved in this field.

Although most existing studies have explored some classical statistical methods (mean, median, etc.) or emerging machine/deep learning methods (expectation maximization, fuzzy clustering, support vector regression, extreme learning, etc.) to fill in missing data. But it is difficult to solve the problem of large-scale continuous data loss (the conventional methods can only be applied to the case that the loss rate is below 30%, and they do not consider the case of 50% -90% loss rate), because as the loss rate increases, related a priori statistical information or enough training samples cannot be provided around the lost data to obtain the accuracy of the padding data. Therefore, these methods are not applicable in terms of large-scale continuous loss of data.

With the advent of the big data age, knowledge contained in data is relevant to the aspects of the country and the society, and the improvement of data processing and analyzing technology needs a complete and accurate data set, and most of the existing data has noise or missing situations due to the lack of sampling and analyzing or input errors periodically. Therefore, how to effectively solve the data problem becomes a crucial task. The invention focuses on a filling method under the condition of large-scale continuous missing of data in the water quality field, and is different from the traditional method for filling missing data.

Disclosure of Invention

The invention provides a large-scale continuous water quality missing data filling method based on transfer learning, aiming at the problem that the existing technology cannot fill large-scale continuous water quality missing data.

The invention comprises the following steps:

data preprocessing:

cleaning and standard normalizing incomplete data sequences collected from a sensor of a certain water quality monitoring station and defining the incomplete data sequences as experimental data;

finding out the data of the monitoring station most similar to the incomplete data sequence by using a time sequence similarity query method and determining the data as reference data;

constructing a training and testing sample by using a sliding window algorithm;

and (3) data padding:

setting a water quality monitoring site which contains a small amount of training samples and has large-scale continuous lack of data as a target domain, and setting a water quality monitoring site with a complete training sample as a source domain;

fusing the training sample of the target domain and the training sample of the source domain into a new mixed training sample set;

initializing weight distribution and weak learner weight coefficients of training samples of a source domain and a target domain: maximum iteration times and weight distribution of the defined mixed training sample;

starting iterative operation:

in each iteration, a new weak learner for filling data is constructed;

calculating an average prediction filling error on the newly mixed training sample;

respectively calculating the weight iteration update coefficients of the training samples of the source domain and the training samples of the target domain;

updating new weights of training samples of the source domain and the target domain at the t +1 moment according to the weight at the t moment; completing one weak learner training, restarting the iteration process until the maximum iteration times is reached, and jumping to output;

and (3) outputting: and carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner.

The invention has the beneficial effects that: in the TrAdaBoost-LSTM algorithm designed by the invention, the LSTM has the good characteristic of processing time sequence data and can realize long-term dependence on data information, and the thought essence of transfer learning is based on the interconnection of everything and realizes the transfer of similar data domains; the invention selects any water quality monitoring station containing large-scale continuous missing data as a sample of a target domain, and the sample is searched by a time sequence similarity query algorithm: dynamic Time Warping (DTW) to take the complete data of another monitored site as source domain samples. The experimental result shows that compared with the traditional statistical filling, machine learning filling and deep learning filling methods, through indexes such as RMSE/MAE/MAPE/R-square and the like, the filling method disclosed by the invention not only improves the filling accuracy rate by 15% -25% in the process of processing the large-scale continuous missing data problem, but also provides a potential reference thought for researches in other fields of the same type.

Drawings

FIG. 1 is a large-scale continuous missing data padding framework;

FIG. 2 is a sliding window algorithm;

fig. 3 is a filling result of a site water quality monitoring station.

Detailed Description

As shown in fig. 1, the missing data padding method framework proposed by the present invention can be divided into two parts: and executing a data preprocessing and padding algorithm.

In the data preprocessing process, firstly, incomplete data sequences collected from sensors of a certain water quality monitoring station are cleaned, normalized in a standard mode and defined as experimental data. Secondly, the method of time series similarity query (in the invention, Dynamic Time Warping (DTW) is used to find the data of the monitored station most similar to the incomplete data series and to determine it as the reference data. Finally, the training and test samples are constructed using a Sliding Window algorithm (Sliding Window).

In the filling algorithm execution process, the invention provides a transfer learning algorithm based on an example: the TrAdaBoost and advanced deep learning algorithm: a novel filling algorithm TrAdaBoost-LSTM fused with long-short term memory neural networks (LSTM).

The description is as follows: the time series similarity query method (such as dynamic programming warping (DTW)) involved in the filling framework provided by the invention and the migration learning algorithm based on the example are as follows: the formulas for TrAdaBoost, the LSTM algorithm based on deep learning, and subsequent correlation analysis indicators Root Mean Square Error (RMSE), Mean Absolute Error (MAE), Mean Absolute Proportional Error (MAPE), and model error (R-square) calculations are default to those understood and well known to those skilled in the art, and are not expanded herein.

The key or innovative technical points involved in the filling framework proposed by the invention are as follows;

1. as shown in fig. 2, a sliding window algorithm (sliding window) is a common method applied in the analysis of a time sequence, and the main idea is to focus on the time t before the current time t

Continuous data namely

And associates it with the current time t. At this moment, the balance

Is the sliding window size. The mathematical expression of the time sequence sliding window is

Wherein S ═ S₁,S₂,S₃…S_N]In order to complete a time-series sequence,

one input, called this timing sequence S, { S_tThe output corresponding to the characteristic is called.

2. In the invention, a water quality monitoring site which contains a small amount of training samples and continuously lacks data in a large scale (the proportion of the missing data is more than 50 percent) is defined as a target domain, and a water quality monitoring site with a complete training sample is defined as a source domain. The steps of the TrAdaBoost-LSTM missing data padding algorithm are as follows:

inputting: training samples for the source domain:

training samples for the target domain:

wherein

And

in order to train the input to the model,

and

is the output of the training model; m is the number of the source domain training samples, and N is the number of the target domain training samples.

Step 1: training samples of a target domain and a source domainFusing into a new mixed training sample set: { F^k,L^k}(k＝1,2,3…,M+N)。

Step 2: initial weight distribution of training samples for source and target domains

And

initializing the weak learner: the weights of the mixed training samples in the LSTM iteratively update the coefficients:

initial maximum number of iterations iter and defining the weight distribution of the mixed training sample: omega is

And step 3: for each iteration, a new weak learner of padding data is constructed: LSTM; also, the input defining the weak learner is { F }^kWhere k is 1,2,3 …, M + N), the output is L^k}(k＝1,2,3…,M+N)。

And 4, step 4: the average prediction fill-in error is calculated over the newly mixed training samples. The predicted padding value of the training set is Y^k(k ═ 1,2,3 …, M + N). Wherein the padding error is shown in formula (2):

and 5: respectively calculating the weight iteration update coefficient beta of the training sample of the source domain and the training sample of the target domain_tAs shown in equation (3):

step 6: updating new weights of training samples of the source domain and the target domain at the t +1 moment according to the weight at the t moment, wherein the new weights are respectively as follows:

completing one weak learner training, returning to the step 3 again until the maximum iteration times is reached, and jumping to output;

and (3) outputting: and (3) carrying out weighted average on the output values of all weak learners to obtain a final prediction filling value of a strong learner, wherein the calculation formula (4) is as follows:

the invention uses four parameters as important indexes for evaluating the performance of algorithm filling data: the Root MEAN Square Error (RMSE), the MEAN Absolute Error (MAE), the MEAN Absolute Proportional Error (MAPE) and R-square were compared with the other 5 conventional filling algorithms (MEAN, autoregressive mixed average (ARIMA), Support Vector Regression (SVR), Extreme Learning Machine (ELM) and long and short term memory network (LSTM)) by combining the dissolved oxygen concentration missing from the on-site water quality monitoring sites of the qiantangjiang river basin, hangzhou, zhejiang as an experimental case, and the results are shown in table 1. It can be seen from table 1 that in comparison with the other 5 algorithms, the algorithm proposed by the present invention has the lowest RMSE, MAE, MAPE and the highest R-square under both low loss rate and high loss rate, which also indicates that the present invention has excellent missing data filling effect to some extent, and the missing filling result of the oxygen content concentration on site is shown in fig. 3.

Table 1 is a table comparing performance between different padding algorithms under different loss rates

The above embodiments are merely to illustrate the technical solutions of the present invention and not to limit the present invention, and the present invention has been described in detail with reference to the preferred embodiments. It will be understood by those skilled in the art that various modifications and equivalent arrangements may be made without departing from the spirit and scope of the present invention and it should be understood that the present invention is to be covered by the appended claims.

Claims

1. A continuous large-scale water quality missing data filling method based on transfer learning is characterized by comprising the following steps:

data preprocessing:

constructing a training and testing sample by using a sliding window algorithm;

and (3) data padding:

starting iterative operation:

in each iteration, a new weak learner for filling data is constructed;

2. The continuous large-scale water quality missing data filling method based on the transfer learning according to claim 1, characterized in that: the method for time series similarity query uses a dynamic time warping algorithm.

3. The continuous large-scale water quality missing data filling method based on the transfer learning according to claim 1, characterized in that:

weight distribution after initialization of training samples of the source domain:

weight distribution after initialization of training samples of the target domain:

the weak learner employs a long short-term memory network LSTM, where the weights of the mixed training samples in the LSTM iteratively update coefficients:

where M is the number of source domain training samples, N is the number of target domain training samples, i is 1,2,3 …, M, j is 1,2,3 …, N, iter is the maximum number of iterations.

4. The continuous large-scale water quality missing data filling method based on the transfer learning according to claim 3, characterized in that: the filling error is calculated as follows:

wherein Y is^kTo predict the padding value, L^kIs the output of the weak learner LSTM,

k＝1,2,3…,M+N。

5. the continuous large-scale water quality missing data filling method based on the transfer learning according to claim 4, characterized in that: weight iterative update coefficient beta_tThe calculation is as follows:

6. the continuous large-scale water quality missing data filling method based on the transfer learning according to claim 5, characterized in that:

the new weight of the source domain training sample at the time t +1 is

the new weight of the target domain training sample at the time t +1 is as follows:

7. the continuous large-scale water quality missing data filling method based on the transfer learning according to claim 6, characterized in that: final prediction pad value pred _ value^kThe calculation is as follows: