CN115982658A

CN115982658A - Hydrological data anomaly identification and repair method based on federated learning framework

Info

Publication number: CN115982658A
Application number: CN202211546472.8A
Authority: CN
Inventors: 陈浙梁; 童增来; 姚东; 李歆遒; 言薇; 徐斌; 沈凯华; 钱克宠; 刘林海; 张紫琳; 王玉明; 倪宪汉; 李欢; 吕耀光; 金晶
Original assignee: Zhejiang Hydrological Management Center
Current assignee: Zhejiang Hydrological Management Center
Priority date: 2022-12-02
Filing date: 2022-12-02
Publication date: 2023-04-18

Abstract

The invention discloses a hydrological data anomaly detection and restoration method based on a federal learning framework, which comprises a model training process and a recognition and restoration process, wherein the model training process comprises the following steps: firstly, preprocessing training hydrological data and carrying out abnormal processing, secondly, building a federal learning framework, initializing model parameters by a server, and sending a global model to each client; the client starts to learn the local data characteristics after receiving the model, the model extracts context information by using a bidirectional LSTM with an attention mechanism, then simultaneously optimizes the functions of anomaly detection and data restoration in a mode of resisting learning, and finally updates the functions of the model through iterative interaction between the client and the server. According to the invention, by providing a new method, on the premise that the privacy of hydrological telemetering data is protected, abnormality identification and restoration are simultaneously carried out, and support is provided for improving hydrological forecasting performance and reducing loss caused by uncertain disasters.

Description

Hydrological data anomaly identification and repair method based on federated learning framework

Technical Field

The invention relates to the field of hydrological data processing, in particular to a hydrological data anomaly identification and repair method based on a federal learning framework.

Background

With the enhancement of uncertainty of global natural disasters, the construction of intelligent hydrology is more and more emphasized by people. The method aims to construct an air-space-ground integrated hydrological telemetering system taking technologies such as cloud computing and big data as cores, so that hydrological phenomena occurring in the nature can be observed and recorded more accurately in real time, and a data base is provided for hydrological research. Obviously, as the main source of hydrological data, hydrological telemetering equipment carries the burden of data acquisition and storage. Whether the telemetering equipment can accurately and infallibly provide real and reliable hydrological data is directly related to basic decisions such as flood control and drought control scheduling, ecological environment protection, water resource comprehensive development and the like. However, in the actual operation process of the telemetering equipment, the acquired hydrological data often has abnormal conditions such as numerical errors, partial loss, serious gear-breaking and the like due to system faults, equipment aging, weak address remote signals and the like. This seriously affects the integrity, authenticity and accuracy of the hydrological data, and directly results in greatly reduced capability of statistical analysis of various hydrological models. Therefore, potential features of the data are mined through abnormal recognition of the hydrologic data, and meanwhile, the abnormal data are repaired, so that the method has important significance for improving hydrologic forecasting performance and reducing loss caused by uncertain disasters.

However, the existing method for recognizing and repairing the abnormality of the hydrological data mainly has the following problems: 1) In practical situations, identification and repair of abnormal data are often required to be synchronously solved, but most researches pay more attention to abnormal detection, and the important significance of repairing abnormal data is ignored; 2) Most models do not consider potential time sequence information of characteristics such as water level, rainfall, flow and the like in hydrological telemetering data, so that the accuracy of abnormal recognition is low, and the recovery degree of data restoration is poor; 3) The privacy issues contained in the telemetry data are ignored.

Disclosure of Invention

In order to overcome the defects in the prior art, the invention provides a generation countermeasure network model based on a federal learning framework and a long-term and short-term memory network, so that the abnormity identification and repair of hydrological telemetering data can be completed simultaneously on the premise of protecting the data privacy.

According to the method, firstly, the original hydrological data of each client (namely the hydrological telemetering equipment) of the federal learning framework is structurally modeled into corresponding time sequence data, then, the server initialization parameters are waited, and the generated countermeasure network model and the global model parameters to be optimized are sent to each client. After receiving the time sequence, the client inputs the processed time sequence into a generated countermeasure network, then the discriminator network is used for discriminating the abnormal sequence, the generator network carries out reconstruction and repair on the abnormal sequence, and the abnormal sequence and the generator network are trained in a countermeasure mode to be optimized step by step. Meanwhile, a long-time memory network is embedded into the data acquisition system, potential characteristics of attention mechanism learning data are introduced, and the time dependency relationship of the data is captured. And then the client sends the trained local model parameters to the server, integrates the local model parameters into new global model parameters by the server and sends the new global model parameters to the client again. And finally, under the condition that the data privacy of the hydrological telemetry station is protected, the hydrological telemetry data can be subjected to a data restoration function, and meanwhile, abnormal data can be identified.

The invention achieves the aim through the following technical scheme: a hydrological data anomaly identification and restoration method based on a federal learning framework comprises a model training process and an identification and restoration process, wherein the model training process comprises the following steps:

s1: preprocessing training hydrological data and performing abnormal processing;

s2: building a federal learning architecture, and initializing model parameters;

s3: the client optimizes the abnormal detection and data restoration functions through counterstudy;

s4: the local client interacts with the server to update the global parameters;

the identification and repair process specifically comprises the following steps: and preprocessing the original hydrological data and inputting the preprocessed original hydrological data into the trained model, wherein the output is the repaired data.

Preferably, the step S1 specifically includes the steps of:

s1.1: screening the hydrological data, and removing noise data and repeated data, namely selecting data with research significance from the noise data and the repeated data;

s1.2: normalizing the screened hydrological data and processing the hydrological data into a matrix sequence F with the same time slot _T ；

S1.3: the processed matrix sequence F _T The set proportion (about 10%) of the test data is artificially dirtied into abnormal data for testing the abnormal detection and repair functions of the data. Doing dirty includes adding any one or several of offset exceptions, sequence exceptions, and extremum exceptions.

When the repair is formally recognized, the preprocessing of the data includes S1.1 and S1.2, and does not include S1.3.

Preferably, step S2 specifically includes the steps of:

s2.1: taking K hydrological telemetry stations as clients and a cloud server as a server (namely a trusted third party) to build a federal learning framework; k is the total number of the hydrological telemetry stations;

s2.2: defining the Data set size of the kth client as Data _k K is greater than or equal to 1 and less than or equal to K, the total data set size for local training is

S2.3: the server initializes the global model parameters, namely generates training parameters for the resisting network and the LSTM network, and sends the global model and the initial parameters to each client.

Preferably, the model comprises a generator and a discriminator, wherein the generator comprises an LSTM network and a full connection layer, and the discriminator comprises a bidirectional LSTM network with attention mechanism and a full connection layer;

when predicting the hydrological data at the time T +1, the hydrological data before the time T +1 needs to be processed to obtain a matrix sequence F _T The matrix sequence of the input generator firstly passes through a neuron comprising three control unit state gates, which are respectively:

forgetting gate for obtaining information f to be discarded _t :

f _t ＝σ(W _f x _t +W _f h _t-1 +b _f )

An input gate for obtaining information i to be memorized _t And storing current cell state information

(intermediate variables):

i _t ＝σ(W _i x _t +W _i h _t-1 +b _i )

according to the forgotten information f _t And update the information i _t Obtaining a new cell state C _t (f _t ×C _t-1 Indicates selective discarding of past information, and

then selective retention is indicated):

and an output gate:

o _t ＝σ(W _o x _t +W _o h _t-1 +b _o )

h _t ＝o _t ·tanh(C _t )

finally calculating the output hidden layer state h of the generator at the moment t _t ；C _t For obtaining long-term memory information, and h _t Is used to obtain short-term memory information and is initialized with C ₀ And h ₀ Defaulting to an all-0 matrix; wherein x is _t Is a matrix sequence F _T Input of the t-th time point of (1), h _t-1 Denotes the hidden layer state at the time of t-1, W represents the weight vector of the corresponding gating unit or cell state according to the subscript, b represents the corresponding gating unit according to the subscriptOffset of meta or cellular state, i.e. W _f Is the weight vector of the forgetting gate, b _f Is the offset of the left behind door, W _i Is the weight vector of the input gate, b _i Is the offset of the input gate, W _c Is a weight vector of the current cell state, b _c Is the offset of the current cell state, i.e., W _o Is the weight vector of the output gate, b _o Is the offset of the output gate, σ is the activation function, and generally adopts Sigmoid function, and tanh is the activation function. In this process, a total of three inputs are included, namely the input value x at time t _t The generator at the previous moment outputs a hidden state h _t-1 And the cellular state C of the neuron _t-1 . The output of the current time comprises the hidden state h of the generator _t And cell state C _t . The final LSTM outputs the hidden state h from the last time step T, which integrates all the useful information before _T And sending the data to a full connection layer network:

x _T+1 ＝Linear(h _T )

obtaining the value x to be predicted at the next moment _T+1 . Line denotes a full connection layer. The above is the network structure of the generator, which utilizes the gating setting to control the transmission state, memorizes the information needing long-time memory, forgets unimportant content, and mines the time sequence F _T The relative long interval and the time sequence change rule of delay and the like. By network structure of LSTM, assume sequence F _T A certain fragment x in (1) _t1 ,...,x _tu (tu<T) has missing or other abnormity, we can use all node information before the node in the sequence to carry out cyclic regression prediction, namely x is predicted according to data nodes between T1 time _t1 Then x is added _t1 Filling the original abnormal part and predicting x again by combining the previous node information _t2 Repeating the steps in a circulating way to finish the data repairing process;

the bidirectional LSTM network of the discriminator comprises a forward LSTM network layer and a backward LSTM network layer, the structures of the forward LSTM network layer and the backward LSTM network layer are the same as the structure of the LSTM network layer in the generator, and the matrix sequence input to the forward LSTM network layer isThe hidden state of positive output at t time is recorded as positive input

The matrix sequence input into the reverse LSTM network is reverse input, and the forward output hidden state at the time t is recorded as->

Hidden state h of bidirectional LSTM _t Is calculated by the following formula, wherein->

For the Concat () function, combining the forward and backward hidden layer state information:

furthermore, in order to improve the learning ability of the discriminator, an attention mechanism is introduced, and an attention layer extracts a weight matrix by the following formula:

α＝softmax(w ^T tanh(H))

and the product r of H and the weight matrix alpha is taken as the output of the attention layer:

r＝Hα ^T

where H is the output of the bi-directional LSTM layer, i.e., the hidden layer state information at all time points { H } ₁ ,...,h _T V, size v T, where v is hidden layer state information h _T T is the length of the sequence, w ^T The method is characterized in that the method is the transposition of a parameter vector matrix obtained by training and learning, and is continuously optimized through the training of a model, wherein alpha is a weight matrix, and r is the output of the layer; then enters a full connection layer network Linear () and an activation function sigmoid to fix the value to 0,1]Interval:

PSY _T ＝Sigmoid(Linear(r))

obtaining the sequence F _T Is true per timestamp t _T Wherein PSY _T Are in the size T1 and all lie in [0,1]]In the meantime. The above is the whole of the arbiter networkBody structure in the sequence F _T As input, PSY _T As a final output, anomaly detection is performed on the sequence time stamp by time stamp. By adding the above structural improvements to the generation of the countermeasure network, the ability of the discriminators to detect anomalies and the generator to fit the data can be enhanced simultaneously, thereby improving the performance of the model as a whole.

Preferably, step S3 specifically includes the steps of:

s3.1: initializing and fixing a generator G, and starting training a discriminator D; with real data F _T And G forged data F _T ' as the input of D, respectively passing through a bidirectional LSTM layer, an attention layer and a full connection layer, and finally outputting an identification result; if the discriminator input is F _T I.e. real and normal hydrologic data, then outputs result P _T The judgment values of all time points are as close to 1 as possible, otherwise, the output tends to 0; since the classifier of D generally uses Sigmoid function, the training of the discriminator D is a process to minimize its cross entropy, and the loss function is as follows:

wherein

For true data samples, <' >>

Then it is the generation of a data or abnormal data sample, based on the data or abnormal data sample>

Indicates x belongs to

Then y is taken to be>

It is clear that for the discriminator D, whether generating data or notConstant data, it is desirable that the output result be as close to 0 as possible;

s3.2: optimization of the generator: with the sequence F to be repaired _T As input, G (F) is output through the LSTM layer and the full link layer _T )＝F _T '; for the generator, it is desirable that the data generated can fool the arbiter as much as possible, so its training is a process that maximizes cross entropy, with the loss function as follows:

finally, if and only if

Then, obtaining a global optimal solution;

s3.3: k clients calculate respective loss gradient L _k (w) to update the local model:

where s is a regularization function, l _j (x) represents the loss of the j sample, w is the local weight parameter, and λ ∈ [0,1 ∈]For balancing losses.

Preferably, step S4 specifically includes the steps of:

all K clients will have their own local model training parameter w ^k Sent to the server to update the global parameter W _z+1 (ii) a Different from the traditional centralized training method, the federal learning updates the global training model through a safe parameter aggregation mechanism, and in addition, in order to reduce the communication overhead in the transmission process of model parameters, a federal average algorithm is adopted to accelerate the convergence of the model; i.e. the server is based on:

to update the global modelA parameter; wherein n is _k And

respectively the number of samples on the client k and the local weight, n is the total number of samples of all the selected clients, and the latest global weight W is obtained _z+1 And then, sending the updated global model to the client k for the next round of optimization updating.

Preferably, the identification repair process comprises the steps of:

the kth client processes hydrologic data into a matrix sequence F _kT Then, downloading the global weight parameter W updated in the last round from the server; at the moment, the generator G and the discriminator D at the local end both have the optimal data restoration capability and the optimal abnormality discrimination capability; thus, the k-th client first sequences the matrix sequence F _kT Inputting into discriminator of local model to obtain D (F) _kT )＝PSY _T I.e. the sequence F _kT Wherein each timestamp t is a normal probability of 0,1]Above 0.5 is considered normal;

if the abnormal point is included, the matrix sequence F to be repaired _kT Inputting into a generator, and reconstructing the data, i.e. data repair G (F) _kT )＝F _kT ', the part which is finally identified as abnormal will be replaced by the repaired data, and finally the repaired sequence F will be obtained by using the reverse normalization _kT ' restore to original data. At this point, the data anomaly detection and repair is finished.

The invention has the advantages that the anomaly identification and repair of the hydrological telemetering data can be completed simultaneously on the premise of protecting the data privacy, and the accuracy and the reliability are high.

Drawings

FIG. 1 is a flow chart of model training according to the present invention;

FIG. 2 is a diagram of a federated learning framework training architecture of the present invention;

FIG. 3 is a diagram of an internal structure of a long-short term memory network according to the present invention;

FIG. 4 is a schematic diagram of a bidirectional long-and-short term memory network model based on an attention mechanism according to the present invention;

FIG. 5 is a diagram of a method for generating a countermeasure network model architecture in accordance with the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.

The embodiment of the invention provides a method for detecting abnormality of hydrological data and repairing the data based on a federal learning framework, which comprises a model training process and a recognition and repair process, wherein as shown in a figure 1 and a figure 2, the training process comprises the following steps:

s1: preprocessing training hydrological data and performing abnormal processing, namely extracting useful data on the basis of an original data set, and the method comprises the following steps:

s1.1: the original data is cleaned aiming at hydrological telemetering data acquired by four hydrological telemetering stations of Hangzhou city, jinhua city, shaoxing city and Lishui city in Zhejiang province from 1 month and 1 day in 2022 to 3 months and 31 days in 2022 for 90 days, so that the data source of model training and testing is ensured to be real and reliable. The attributes that data possess include:

numbering	Name (R)	Note
			1	sid	Telemetry terminal SID encoding
2	Longitude_hy	Hydrology remoteLongitude of the survey station
			3	Latitude_hy	Hydrological telemetry station latitude
4	Terminal_na	Terminal name
			5	Current_sta	Current state
6	Telemetry_pro	Telemetry project
			7	Monitoring_ele	Monitoring element
8	Watershed_na	Name of basin
			9	Rain_area	Area of collecting rain
10	Responsible_un	Responsibility unit
			11	Sensor_model	SensingModel number of the device
12	Device_In	Device information
			13	……	……

TABLE 1

As can be seen from the table, the data content contained in the original data set is rich, the amount of information involved is complex, and a great deal of privacy is covered. Obviously, aiming at the targets of data anomaly detection and repair, we are directed to the monitoring data of the sensor. In addition, due to differences in models and geographical locations of different remote sensing site devices, data recording intervals, collected data attributes, and the like may be different. Therefore, redundant data attributes are selected to be screened out, public attributes of the hydrological devices of the four telemetry stations are extracted, and one-time collection record is counted at intervals of five minutes, so that the data volume is remarkably reduced, and subsequent analysis and calculation are facilitated. The extracted data attributes are as follows:

after data extraction, some attributes are found not to change for a long time under a conventional condition, and the data attributes are judged not to meet the condition of experimental research, such as for rainfall attributes, if rainfall does not occur for a few consecutive days, the attribute value is 0 for a long time; in addition, through data analysis, it is found that the current water level and the water level 5 minute attribute value are always the same due to the small water level variation amplitude, and therefore screening is considered. Therefore, the hydrological telemetry data targeted by the invention mainly comprises the following attributes:

numbering	Name (R)	Note
			1	Sid	Telemetry terminal SID encoding
2	Flow	Flow rate
			3	Tem	Ambient temperature
4	Cwl	Current water level
			5	Vol	Supply voltage
6	Ifr	Index flow velocity
			7	Iwt	Instantaneous water temperature
8	Cifr	Current instantaneous flow rate

Table 3s1.2: the normalization of the data uses the following formula

Wherein, X is the data before normalization, max and min are respectively the maximum value and the minimum value in the attribute data, and X is the data after normalization. After the data of all the attributes are normalized, the same time point of each attribute is taken as the characteristic of the time point. That is, each time point includes the flow at that time, the ambient temperature, the current water level, the power supply voltage, the index flow rate, the instantaneous water temperature, and the current instantaneous flow rate. In order to better extract the sequence features in the matrix sequences, the window size of two hours, namely 24 time points is taken as the length of the sequences, and each matrix sequence F is constructed _T 。

S1.3: to test the performance of the inventive method, a matrix sequence F was constructed for S1.2 _T The manual dirty structure test set comprises 1, selecting about 10% of data F _T Masking matrix F for artificial dirtying 2 _anomaly To F _T Masking to generate missing exception data; 3. to F is aligned with _T Offset exceptions, sequence exceptions, and extremum exceptions are added.

S2: setting up a federal learning architecture, and initializing model parameters, wherein the method comprises the following specific steps:

s2.1: four hydrological telemetry stations in Hangzhou city, jinhua city, shaoxing city and Lishui city in Zhejiang province are used as local clients, and a cloud server is used as a trustable third party to build a federal learning framework. The configuration of the server comprises that a host operating system is Ubuntu 18.04, a memory 128GB, a CPU is Intel (R) Xeon (R) Gold, 16-core dual threads and a display card is NVDIA Quadro P6000.

S2.2: the data set of each client is the hydrological telemetry data collected from 90 days 1/2022 to 31/2022. The data of the previous 30 days is used as a training set, the data of the whole 90 days is used as a test set, and the test data comprises about 10 percent, namely the data of 9 days is processed into abnormal data F _anomaly . Defining the Data set size of K clients as Data _K Then the total Data set size for local training is Data = Data ₁ +Data ₂ +Data ₃ +Data ₄ 。

S2.3: the server initializes the model parameters W and sends the global model to each client for their respective training of their own data set.

The model comprises a generator and a discriminator, wherein the generator comprises an LSTM network and a full connection layer, and the discriminator comprises a bidirectional LSTM network with an attention mechanism and a full connection layer; the specific process of extracting the time sequence characteristic information by the model is as follows:

firstly, each client receives the global model sent by the server, then, the training of the local data sets is started, and the training process is completed synchronously by four local clients. While calculating the loss gradient of the local data (where s (·) is a regularization function, w is a local weight parameter, λ ∈ [0,1 ]):

the processed matrix sequence F _T As an input to the model, hydrological data tends to have significant time series characteristics, since it is a collection of hydrological elements such as water level, flow, voltage, etc., observed over time. Obviously, neither the generator nor the discriminator has a specific structure for processing the time-series data. Therefore, the characteristic time sequence characteristics of the hydrological data are not considered in the training process, so that the generator has poor capability of fitting real data, and the discriminator identifies the abnormal constantOne of the main reasons for this lack of precision. To learn the underlying spatial distribution of the data, temporal feature information extraction is performed using bi-directional LSTM with attention mechanism. For the generator, as shown in FIG. 3, with F _T As an input sequence, first pass through three gates of the control unit state, an input gate, a forgetting gate and an output gate, respectively. Then according to the gate function and formula:

h _t ＝o _t ·tanh(C _t )

finally calculating to obtain the hidden layer state h at the moment t _t 。

For the arbiter, in order to better learn the characteristics of the data, as shown in fig. 4, a reverse LSTM layer is added on the original forward LSTM network layer, and the available information of the network is increased by considering the context of two directions. The network structure comprising two forwarding elements

And passed backwards>

Then the hidden state h _t Is calculated by the following formula, wherein->

For the Concat () function, combine the forward and backward hidden layer state information: />

Furthermore, in order to improve the learning ability of the discriminator, attention mechanism is introduced. This layer extracts the weight matrix by the following formula:

α＝softmax(w ^T tanh(H))

r＝Hα ^T

where H is the output of the bi-directional LSTM layer, i.e. all timesHidden layer state information of intermediate point { h } ₁ ,...,h _T H, size v x T, where v is hidden layer information h _T T is the length of the sequence, w ^T It is the transpose of a parameter vector matrix, and is continuously optimized by the training of the model, where α is the weight matrix and r is the output of the layer. The value is then fixed to 0,1 at the time of entering the full connection layer network Linear () and the activation function sigmoid]Interval:

PSY _T ＝Sigmoid(Linear(r))

obtaining the sequence F _T Is true per timestamp t _T Wherein PSY _T Are all in the size of [0,1]]In between. The above is the overall structure of the discriminator network, in the sequence F _T As input, PSY _T As a final output, anomaly detection is performed on the sequence time stamp by time stamp. By adding the above structural improvements to the generation of the countermeasure network, the ability of the discriminators to detect anomalies and the generator to fit data can be enhanced simultaneously, thereby improving the performance of the model as a whole.

S3: the client optimizes the abnormal detection and data restoration functions through counterstudy, and the specific steps are as follows:

s3.1: after extracting the time series features, it is necessary to balance the countermeasure learning process of the generator and the arbiter in the generation countermeasure network to optimize the anomaly detection and data recovery functions. The generated countermeasure network is optimized through the thought of 'binary game' countermeasure, and the characteristics of the discriminators superior to those of the generators are required, otherwise, the gradient disappears easily, and therefore the generators are usually trained once again after the discriminators D are trained for multiple times. The generator G is first initialized and fixed, and the training of the arbiter D is started with the real data F _T And G forged data F _T ' as D input, respectively passing through the bidirectional LSTM layer, the attention layer and the full connection layer, and finally outputting the discrimination result PSY _T . If the discriminator input is F _T I.e. real and normal hydrologic data, then outputs result P _T The judgment values at all time points are as close to 1 as possible, otherwise, the output tends to be 0. It is apparent that it is desirable for the discriminator D to output a junction regardless of whether data is generated or abnormal data is generatedThe effect is as close to 0 as possible.

S3.2: the optimization of the generator is similar to the traditional training process of generating the confrontation network model, and F _T As input, G (F) is output through the full connection layer and the LSTM layer _T )＝F _T '. For the generator, it is desirable that the generated data deceive the arbiter as much as possible, and the data is taken as an optimization target, so that the generated data gradually approaches the original data to achieve the effect of data recovery. Finally, if and only if P _FT ＝P _FT′ And then, obtaining a global optimal solution.

S4: the method comprises the following steps of interacting a local client with a server to update global parameters, and specifically comprising the following steps:

after the first round of local training of each client is completed, the weights w obtained by the respective training are used _k Send to server to update the global model parameters W of the new round _z+1 . Different from the traditional centralized training method, the Federal learning updates the global training model through a safe parameter aggregation mechanism, and in addition, in order to reduce the communication overhead in the model parameter transmission process, a Federal averaging algorithm is adopted to accelerate the convergence of the model. I.e. the server is based on:

to update the global model parameters. Obtaining the latest W _z+1 And then, sending the updated global model to the client k for the next round of optimization updating.

The overall generation confrontation network model is shown in fig. 5.

The identification and repair process comprises the following steps: the client side processes respective hydrological data into a matrix sequence F _kT Wherein the test set includes outliers. Subsequently, the global weight parameter W of the last round of update is downloaded from the server. At this time, the generator G and the discriminator D on the local side both have the optimal data recovery capability and the optimal abnormality discrimination capability. Thus, the client first combines the matrix sequence F _kT Inputting into discriminator of local model to obtain D (F) _kT ) I.e. byThe probability that each timestamp t in the sequence is normal (0 to 1, more than 0.5 is considered normal). If the abnormal point is included, F is added _kt Input to a generator, and regenerate the data, i.e. data repair G (F) _kT )＝F _kT ' the portion that was last identified as abnormal will be replaced by the repaired data. At the same time, the repaired sequence F is subjected to reverse normalization _kT ' revert to the original data. At this point, the data anomaly detection and repair is finished.

The implementation application case shows that the method for detecting and repairing the anomaly of the hydrological telemetering data based on the federal learning is effective, and compared with other design methods, the method provided by the invention adopts the federal learning architecture to act on data privacy protection, and a discriminator and a generator in a generated countermeasure network are respectively used for detecting the anomaly of the data and repairing the data. In order to improve the capability of the model for extracting the time sequence characteristics, a bidirectional long-short-time memory network and a common long-short-time memory network based on an attention mechanism are respectively embedded into a discriminator and a generator of the model. The model processes hydrological data of the hydrological telemetering equipment into a time sequence matrix sequence and then serves as input, a bidirectional long-time memory network layer in a discriminator extracts relevant time sequence information, the result, namely a hidden layer state, serves as input of an attention layer to obtain a weight matrix, and finally, an identification result is output through a full connection layer to finish abnormal identification of the data. In addition, the matrix sequence determined as abnormal data by the discriminator is also input into the generator, and data restoration is completed by utilizing the capability of fitting data distribution. The experiment uses the real hydrological data sets of the four telemetry stations in Hangzhou city, jinhua city, shaoxing city and Lishui city provided by the hydrological communication platform in Zhejiang province, and the result fully proves the feasibility and superiority of the model.

While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims

1. A hydrological data anomaly identification and restoration method based on a federal learning framework comprises a model training process and an identification and restoration process, wherein the model training process comprises the following steps:

s1: preprocessing training hydrological data and carrying out abnormal processing;

s4: the local client interacts with the server to update global parameters;

2. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 1, wherein: the step S1 includes the steps of:

s1.1: screening the hydrological data, and removing noise data and repeated data;

s1.2: normalizing the screened hydrological data and processing the hydrological data into a matrix sequence F with the same time slot _T (ii) a S1.3: the processed matrix sequence F _T The set proportion part of (2) is artificially made dirty and abnormal data.

3. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 1, wherein: the step S2 includes the steps of:

s2.1: taking K hydrological telemetry stations as clients and a cloud server as a server to build a federal learning framework; k is the total number of the hydrological telemetry stations;

4. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 2 or 3, wherein: the model comprises a generator and a discriminator, wherein the generator comprises an LSTM network and a full connection layer, and the discriminator comprises a bidirectional LSTM network with an attention mechanism and a full connection layer;

forgetting gate for obtaining information f to be discarded _t :

f _t ＝σ(W _f x _t +W _f h _t-1 +b _f )

i _t ＝σ(W _i x _t +W _i h _t-1 +b _i )

According to the forgotten information f _t And update information i _t New cell states were obtained:

and an output gate:

o _t ＝σ(W _o x _t +W _o h _t-1 +b _o )

h _t ＝o _t ·tanh(C _t )

finally calculating the output hidden layer state h of the generator at the moment t _t ；C _t For obtaining long-term memory information, and h _t Is used to acquire short-term memory information, and is initialized with C ₀ And h ₀ Defaulting to an all-0 matrix; wherein x is _t Is a matrix sequence F _T Input of the t-th time point of (1), h _t-1 The hidden layer state at the moment of t-1 is indicated, W respectively represents a weight vector corresponding to a gating unit or a cell state according to subscripts, b respectively represents offset corresponding to the gating unit or the cell state according to the subscripts, and sigma is an activation function;

the final LSTM network output is composed of the hidden state h of the last time step T integrating all the useful information _T And then sending the data to a full connection layer network:

x _T+1 ＝Linear(h _T )

obtaining the value x to be predicted at the next moment _T+1 ；

The bidirectional LSTM network of the discriminator comprises a forward LSTM network layer and a reverse LSTM network layer, the structures of the forward LSTM network layer and the reverse LSTM network layer are the same as the structure of the LSTM network layer in the generator, a matrix sequence input to the forward LSTM network layer is a forward input, a forward output hidden layer state at the time t is recorded as a forward output hidden layer state

the attention tier extracts the weight matrix by the following formula:

α＝softmax(w ^T tanh(H))

r＝Hα ^T

where H is the output of the LSTM layer, i.e. hidden layer state information { H) at all time points ₁ ,...,h _T V, size v T, where v is hidden layer state information h _T T is the length of the sequence, w ^T Then, the parameter vector obtained by training and learning is transposed, alpha is a weight matrix, and r is the output of the layer; then entering a full-connection layer network Linear () and an activation function sigmoid to fix the value to 0,1]Interval:

PSY _T ＝Sigmoid(Linear(r))

obtaining the sequence F _T Is true per timestamp t _T Wherein PSY _T Are in the size T1 and all lie in [0,1]]In the meantime.

5. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 4, wherein: the step S3 includes the steps of:

s3.1: initializing and fixing a generator G, and starting training a discriminator D; with real data F _T And G forged data F _T ' as the input of D, respectively passing through the bidirectional LSTM layer, the attention layer and the full connection layer, and finally outputting an identification result; if the discriminator input is F _t If the data is real and normal hydrological data, the output result is 1, otherwise the output is 0; the training loss function of discriminator D is as follows:

wherein

For a true data distribution, is>

Then it is the generation data or abnormal data distribution that is asserted>

Indicates x belongs to

Then y is taken to be>

S3.2: optimization of the generator: with the sequence F to be repaired _T As input, output G (F) via LSTM layer and full connection layer _T )＝F _T '; the training loss function of the generator is as follows:

finally, if and only if

Then, obtaining a global optimal solution;

where s is a regularization function, l _j (. X) denotes the loss of the j-th sample, w is the local weight parameter, λ e [0,1]for balancing losses.

6. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 1, wherein: the step S4 includes the steps of:

all K clients will have their own local model training parameter w ^k Sent to the server to update the global parameter W _z+1

n _k And

respectively representing the number of samples on the client k and the local weight, wherein n is the total number of samples of all selected clients, and z represents the training turn; obtaining the latest W _z+1 And then, sending the updated global model to the client k for the next round of optimization updating.

7. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 4, wherein: the identification and repair process comprises the following steps:

the kth client processes hydrologic data into a matrix sequence F _kT Then, the global weight parameters W of the last round of update are downloaded from the server, and the k-th client first downloads the matrix sequence F _kT Inputting the data into a local model discriminator to obtain D (F) _kT )＝PSY _T I.e. the sequence F _kT Wherein each timestamp t is a normal probability of 0,1]Above 0.5 is considered normal; if the abnormal point is included, the matrix sequence F to be repaired _kT Inputting into a generator, reconstructing the data, i.e. data recovery G (F) _kT )＝F _kT ', the part identified as abnormal will be replaced by the repaired data, and finally the repaired sequence F will be obtained by inverse normalization _kT ' reduction toRaw data.