CN115982658A - Hydrological data anomaly identification and repair method based on federated learning framework - Google Patents

Hydrological data anomaly identification and repair method based on federated learning framework Download PDF

Info

Publication number
CN115982658A
CN115982658A CN202211546472.8A CN202211546472A CN115982658A CN 115982658 A CN115982658 A CN 115982658A CN 202211546472 A CN202211546472 A CN 202211546472A CN 115982658 A CN115982658 A CN 115982658A
Authority
CN
China
Prior art keywords
data
hydrological
model
layer
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202211546472.8A
Other languages
Chinese (zh)
Inventor
陈浙梁
童增来
姚东
李歆遒
言薇
徐斌
沈凯华
钱克宠
刘林海
张紫琳
王玉明
倪宪汉
李欢
吕耀光
金晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Zhejiang Hydrological Management Center
Original Assignee
Zhejiang Hydrological Management Center
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Zhejiang Hydrological Management Center filed Critical Zhejiang Hydrological Management Center
Priority to CN202211546472.8A priority Critical patent/CN115982658A/en
Publication of CN115982658A publication Critical patent/CN115982658A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A10/00TECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE at coastal zones; at river basins
    • Y02A10/40Controlling or monitoring, e.g. of flood or hurricane; Forecasting, e.g. risk assessment or mapping

Landscapes

  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a hydrological data anomaly detection and restoration method based on a federal learning framework, which comprises a model training process and a recognition and restoration process, wherein the model training process comprises the following steps: firstly, preprocessing training hydrological data and carrying out abnormal processing, secondly, building a federal learning framework, initializing model parameters by a server, and sending a global model to each client; the client starts to learn the local data characteristics after receiving the model, the model extracts context information by using a bidirectional LSTM with an attention mechanism, then simultaneously optimizes the functions of anomaly detection and data restoration in a mode of resisting learning, and finally updates the functions of the model through iterative interaction between the client and the server. According to the invention, by providing a new method, on the premise that the privacy of hydrological telemetering data is protected, abnormality identification and restoration are simultaneously carried out, and support is provided for improving hydrological forecasting performance and reducing loss caused by uncertain disasters.

Description

Hydrological data anomaly identification and repair method based on federated learning framework
Technical Field
The invention relates to the field of hydrological data processing, in particular to a hydrological data anomaly identification and repair method based on a federal learning framework.
Background
With the enhancement of uncertainty of global natural disasters, the construction of intelligent hydrology is more and more emphasized by people. The method aims to construct an air-space-ground integrated hydrological telemetering system taking technologies such as cloud computing and big data as cores, so that hydrological phenomena occurring in the nature can be observed and recorded more accurately in real time, and a data base is provided for hydrological research. Obviously, as the main source of hydrological data, hydrological telemetering equipment carries the burden of data acquisition and storage. Whether the telemetering equipment can accurately and infallibly provide real and reliable hydrological data is directly related to basic decisions such as flood control and drought control scheduling, ecological environment protection, water resource comprehensive development and the like. However, in the actual operation process of the telemetering equipment, the acquired hydrological data often has abnormal conditions such as numerical errors, partial loss, serious gear-breaking and the like due to system faults, equipment aging, weak address remote signals and the like. This seriously affects the integrity, authenticity and accuracy of the hydrological data, and directly results in greatly reduced capability of statistical analysis of various hydrological models. Therefore, potential features of the data are mined through abnormal recognition of the hydrologic data, and meanwhile, the abnormal data are repaired, so that the method has important significance for improving hydrologic forecasting performance and reducing loss caused by uncertain disasters.
However, the existing method for recognizing and repairing the abnormality of the hydrological data mainly has the following problems: 1) In practical situations, identification and repair of abnormal data are often required to be synchronously solved, but most researches pay more attention to abnormal detection, and the important significance of repairing abnormal data is ignored; 2) Most models do not consider potential time sequence information of characteristics such as water level, rainfall, flow and the like in hydrological telemetering data, so that the accuracy of abnormal recognition is low, and the recovery degree of data restoration is poor; 3) The privacy issues contained in the telemetry data are ignored.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides a generation countermeasure network model based on a federal learning framework and a long-term and short-term memory network, so that the abnormity identification and repair of hydrological telemetering data can be completed simultaneously on the premise of protecting the data privacy.
According to the method, firstly, the original hydrological data of each client (namely the hydrological telemetering equipment) of the federal learning framework is structurally modeled into corresponding time sequence data, then, the server initialization parameters are waited, and the generated countermeasure network model and the global model parameters to be optimized are sent to each client. After receiving the time sequence, the client inputs the processed time sequence into a generated countermeasure network, then the discriminator network is used for discriminating the abnormal sequence, the generator network carries out reconstruction and repair on the abnormal sequence, and the abnormal sequence and the generator network are trained in a countermeasure mode to be optimized step by step. Meanwhile, a long-time memory network is embedded into the data acquisition system, potential characteristics of attention mechanism learning data are introduced, and the time dependency relationship of the data is captured. And then the client sends the trained local model parameters to the server, integrates the local model parameters into new global model parameters by the server and sends the new global model parameters to the client again. And finally, under the condition that the data privacy of the hydrological telemetry station is protected, the hydrological telemetry data can be subjected to a data restoration function, and meanwhile, abnormal data can be identified.
The invention achieves the aim through the following technical scheme: a hydrological data anomaly identification and restoration method based on a federal learning framework comprises a model training process and an identification and restoration process, wherein the model training process comprises the following steps:
s1: preprocessing training hydrological data and performing abnormal processing;
s2: building a federal learning architecture, and initializing model parameters;
s3: the client optimizes the abnormal detection and data restoration functions through counterstudy;
s4: the local client interacts with the server to update the global parameters;
the identification and repair process specifically comprises the following steps: and preprocessing the original hydrological data and inputting the preprocessed original hydrological data into the trained model, wherein the output is the repaired data.
Preferably, the step S1 specifically includes the steps of:
s1.1: screening the hydrological data, and removing noise data and repeated data, namely selecting data with research significance from the noise data and the repeated data;
s1.2: normalizing the screened hydrological data and processing the hydrological data into a matrix sequence F with the same time slot T
S1.3: the processed matrix sequence F T The set proportion (about 10%) of the test data is artificially dirtied into abnormal data for testing the abnormal detection and repair functions of the data. Doing dirty includes adding any one or several of offset exceptions, sequence exceptions, and extremum exceptions.
When the repair is formally recognized, the preprocessing of the data includes S1.1 and S1.2, and does not include S1.3.
Preferably, step S2 specifically includes the steps of:
s2.1: taking K hydrological telemetry stations as clients and a cloud server as a server (namely a trusted third party) to build a federal learning framework; k is the total number of the hydrological telemetry stations;
s2.2: defining the Data set size of the kth client as Data k K is greater than or equal to 1 and less than or equal to K, the total data set size for local training is
Figure BDA0003977884620000021
S2.3: the server initializes the global model parameters, namely generates training parameters for the resisting network and the LSTM network, and sends the global model and the initial parameters to each client.
Preferably, the model comprises a generator and a discriminator, wherein the generator comprises an LSTM network and a full connection layer, and the discriminator comprises a bidirectional LSTM network with attention mechanism and a full connection layer;
when predicting the hydrological data at the time T +1, the hydrological data before the time T +1 needs to be processed to obtain a matrix sequence F T The matrix sequence of the input generator firstly passes through a neuron comprising three control unit state gates, which are respectively:
forgetting gate for obtaining information f to be discarded t :
f t =σ(W f x t +W f h t-1 +b f )
An input gate for obtaining information i to be memorized t And storing current cell state information
Figure BDA0003977884620000031
(intermediate variables):
i t =σ(W i x t +W i h t-1 +b i )
Figure BDA0003977884620000032
according to the forgotten information f t And update the information i t Obtaining a new cell state C t (f t ×C t-1 Indicates selective discarding of past information, and
Figure BDA0003977884620000033
then selective retention is indicated):
Figure BDA0003977884620000034
and an output gate:
o t =σ(W o x t +W o h t-1 +b o )
h t =o t ·tanh(C t )
finally calculating the output hidden layer state h of the generator at the moment t t ;C t For obtaining long-term memory information, and h t Is used to obtain short-term memory information and is initialized with C 0 And h 0 Defaulting to an all-0 matrix; wherein x is t Is a matrix sequence F T Input of the t-th time point of (1), h t-1 Denotes the hidden layer state at the time of t-1, W represents the weight vector of the corresponding gating unit or cell state according to the subscript, b represents the corresponding gating unit according to the subscriptOffset of meta or cellular state, i.e. W f Is the weight vector of the forgetting gate, b f Is the offset of the left behind door, W i Is the weight vector of the input gate, b i Is the offset of the input gate, W c Is a weight vector of the current cell state, b c Is the offset of the current cell state, i.e., W o Is the weight vector of the output gate, b o Is the offset of the output gate, σ is the activation function, and generally adopts Sigmoid function, and tanh is the activation function. In this process, a total of three inputs are included, namely the input value x at time t t The generator at the previous moment outputs a hidden state h t-1 And the cellular state C of the neuron t-1 . The output of the current time comprises the hidden state h of the generator t And cell state C t . The final LSTM outputs the hidden state h from the last time step T, which integrates all the useful information before T And sending the data to a full connection layer network:
x T+1 =Linear(h T )
obtaining the value x to be predicted at the next moment T+1 . Line denotes a full connection layer. The above is the network structure of the generator, which utilizes the gating setting to control the transmission state, memorizes the information needing long-time memory, forgets unimportant content, and mines the time sequence F T The relative long interval and the time sequence change rule of delay and the like. By network structure of LSTM, assume sequence F T A certain fragment x in (1) t1 ,...,x tu (tu<T) has missing or other abnormity, we can use all node information before the node in the sequence to carry out cyclic regression prediction, namely x is predicted according to data nodes between T1 time t1 Then x is added t1 Filling the original abnormal part and predicting x again by combining the previous node information t2 Repeating the steps in a circulating way to finish the data repairing process;
the bidirectional LSTM network of the discriminator comprises a forward LSTM network layer and a backward LSTM network layer, the structures of the forward LSTM network layer and the backward LSTM network layer are the same as the structure of the LSTM network layer in the generator, and the matrix sequence input to the forward LSTM network layer isThe hidden state of positive output at t time is recorded as positive input
Figure BDA0003977884620000041
The matrix sequence input into the reverse LSTM network is reverse input, and the forward output hidden state at the time t is recorded as->
Figure BDA0003977884620000042
Hidden state h of bidirectional LSTM t Is calculated by the following formula, wherein->
Figure BDA0003977884620000043
For the Concat () function, combining the forward and backward hidden layer state information:
Figure BDA0003977884620000044
furthermore, in order to improve the learning ability of the discriminator, an attention mechanism is introduced, and an attention layer extracts a weight matrix by the following formula:
α=softmax(w T tanh(H))
and the product r of H and the weight matrix alpha is taken as the output of the attention layer:
r=Hα T
where H is the output of the bi-directional LSTM layer, i.e., the hidden layer state information at all time points { H } 1 ,...,h T V, size v T, where v is hidden layer state information h T T is the length of the sequence, w T The method is characterized in that the method is the transposition of a parameter vector matrix obtained by training and learning, and is continuously optimized through the training of a model, wherein alpha is a weight matrix, and r is the output of the layer; then enters a full connection layer network Linear () and an activation function sigmoid to fix the value to 0,1]Interval:
PSY T =Sigmoid(Linear(r))
obtaining the sequence F T Is true per timestamp t T Wherein PSY T Are in the size T1 and all lie in [0,1]]In the meantime. The above is the whole of the arbiter networkBody structure in the sequence F T As input, PSY T As a final output, anomaly detection is performed on the sequence time stamp by time stamp. By adding the above structural improvements to the generation of the countermeasure network, the ability of the discriminators to detect anomalies and the generator to fit the data can be enhanced simultaneously, thereby improving the performance of the model as a whole.
Preferably, step S3 specifically includes the steps of:
s3.1: initializing and fixing a generator G, and starting training a discriminator D; with real data F T And G forged data F T ' as the input of D, respectively passing through a bidirectional LSTM layer, an attention layer and a full connection layer, and finally outputting an identification result; if the discriminator input is F T I.e. real and normal hydrologic data, then outputs result P T The judgment values of all time points are as close to 1 as possible, otherwise, the output tends to 0; since the classifier of D generally uses Sigmoid function, the training of the discriminator D is a process to minimize its cross entropy, and the loss function is as follows:
Figure BDA0003977884620000051
wherein
Figure BDA0003977884620000052
For true data samples, <' >>
Figure BDA0003977884620000053
Then it is the generation of a data or abnormal data sample, based on the data or abnormal data sample>
Figure BDA0003977884620000054
Indicates x belongs to
Figure BDA0003977884620000055
Then y is taken to be>
Figure BDA0003977884620000056
It is clear that for the discriminator D, whether generating data or notConstant data, it is desirable that the output result be as close to 0 as possible;
s3.2: optimization of the generator: with the sequence F to be repaired T As input, G (F) is output through the LSTM layer and the full link layer T )=F T '; for the generator, it is desirable that the data generated can fool the arbiter as much as possible, so its training is a process that maximizes cross entropy, with the loss function as follows:
Figure BDA0003977884620000057
finally, if and only if
Figure BDA0003977884620000058
Then, obtaining a global optimal solution;
s3.3: k clients calculate respective loss gradient L k (w) to update the local model:
Figure BDA0003977884620000059
where s is a regularization function, l j (x) represents the loss of the j sample, w is the local weight parameter, and λ ∈ [0,1 ∈]For balancing losses.
Preferably, step S4 specifically includes the steps of:
all K clients will have their own local model training parameter w k Sent to the server to update the global parameter W z+1 (ii) a Different from the traditional centralized training method, the federal learning updates the global training model through a safe parameter aggregation mechanism, and in addition, in order to reduce the communication overhead in the transmission process of model parameters, a federal average algorithm is adopted to accelerate the convergence of the model; i.e. the server is based on:
Figure BDA0003977884620000061
to update the global modelA parameter; wherein n is k And
Figure BDA0003977884620000062
respectively the number of samples on the client k and the local weight, n is the total number of samples of all the selected clients, and the latest global weight W is obtained z+1 And then, sending the updated global model to the client k for the next round of optimization updating.
Preferably, the identification repair process comprises the steps of:
the kth client processes hydrologic data into a matrix sequence F kT Then, downloading the global weight parameter W updated in the last round from the server; at the moment, the generator G and the discriminator D at the local end both have the optimal data restoration capability and the optimal abnormality discrimination capability; thus, the k-th client first sequences the matrix sequence F kT Inputting into discriminator of local model to obtain D (F) kT )=PSY T I.e. the sequence F kT Wherein each timestamp t is a normal probability of 0,1]Above 0.5 is considered normal;
if the abnormal point is included, the matrix sequence F to be repaired kT Inputting into a generator, and reconstructing the data, i.e. data repair G (F) kT )=F kT ', the part which is finally identified as abnormal will be replaced by the repaired data, and finally the repaired sequence F will be obtained by using the reverse normalization kT ' restore to original data. At this point, the data anomaly detection and repair is finished.
The invention has the advantages that the anomaly identification and repair of the hydrological telemetering data can be completed simultaneously on the premise of protecting the data privacy, and the accuracy and the reliability are high.
Drawings
FIG. 1 is a flow chart of model training according to the present invention;
FIG. 2 is a diagram of a federated learning framework training architecture of the present invention;
FIG. 3 is a diagram of an internal structure of a long-short term memory network according to the present invention;
FIG. 4 is a schematic diagram of a bidirectional long-and-short term memory network model based on an attention mechanism according to the present invention;
FIG. 5 is a diagram of a method for generating a countermeasure network model architecture in accordance with the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in further detail below.
The embodiment of the invention provides a method for detecting abnormality of hydrological data and repairing the data based on a federal learning framework, which comprises a model training process and a recognition and repair process, wherein as shown in a figure 1 and a figure 2, the training process comprises the following steps:
s1: preprocessing training hydrological data and performing abnormal processing, namely extracting useful data on the basis of an original data set, and the method comprises the following steps:
s1.1: the original data is cleaned aiming at hydrological telemetering data acquired by four hydrological telemetering stations of Hangzhou city, jinhua city, shaoxing city and Lishui city in Zhejiang province from 1 month and 1 day in 2022 to 3 months and 31 days in 2022 for 90 days, so that the data source of model training and testing is ensured to be real and reliable. The attributes that data possess include:
numbering Name (R) Note
1 sid Telemetry terminal SID encoding
2 Longitude_hy Hydrology remoteLongitude of the survey station
3 Latitude_hy Hydrological telemetry station latitude
4 Terminal_na Terminal name
5 Current_sta Current state
6 Telemetry_pro Telemetry project
7 Monitoring_ele Monitoring element
8 Watershed_na Name of basin
9 Rain_area Area of collecting rain
10 Responsible_un Responsibility unit
11 Sensor_model SensingModel number of the device
12 Device_In Device information
13 …… ……
TABLE 1
As can be seen from the table, the data content contained in the original data set is rich, the amount of information involved is complex, and a great deal of privacy is covered. Obviously, aiming at the targets of data anomaly detection and repair, we are directed to the monitoring data of the sensor. In addition, due to differences in models and geographical locations of different remote sensing site devices, data recording intervals, collected data attributes, and the like may be different. Therefore, redundant data attributes are selected to be screened out, public attributes of the hydrological devices of the four telemetry stations are extracted, and one-time collection record is counted at intervals of five minutes, so that the data volume is remarkably reduced, and subsequent analysis and calculation are facilitated. The extracted data attributes are as follows:
Figure BDA0003977884620000071
Figure BDA0003977884620000081
after data extraction, some attributes are found not to change for a long time under a conventional condition, and the data attributes are judged not to meet the condition of experimental research, such as for rainfall attributes, if rainfall does not occur for a few consecutive days, the attribute value is 0 for a long time; in addition, through data analysis, it is found that the current water level and the water level 5 minute attribute value are always the same due to the small water level variation amplitude, and therefore screening is considered. Therefore, the hydrological telemetry data targeted by the invention mainly comprises the following attributes:
numbering Name (R) Note
1 Sid Telemetry terminal SID encoding
2 Flow Flow rate
3 Tem Ambient temperature
4 Cwl Current water level
5 Vol Supply voltage
6 Ifr Index flow velocity
7 Iwt Instantaneous water temperature
8 Cifr Current instantaneous flow rate
Table 3s1.2: the normalization of the data uses the following formula
Figure BDA0003977884620000082
Wherein, X is the data before normalization, max and min are respectively the maximum value and the minimum value in the attribute data, and X is the data after normalization. After the data of all the attributes are normalized, the same time point of each attribute is taken as the characteristic of the time point. That is, each time point includes the flow at that time, the ambient temperature, the current water level, the power supply voltage, the index flow rate, the instantaneous water temperature, and the current instantaneous flow rate. In order to better extract the sequence features in the matrix sequences, the window size of two hours, namely 24 time points is taken as the length of the sequences, and each matrix sequence F is constructed T
S1.3: to test the performance of the inventive method, a matrix sequence F was constructed for S1.2 T The manual dirty structure test set comprises 1, selecting about 10% of data F T Masking matrix F for artificial dirtying 2 anomaly To F T Masking to generate missing exception data; 3. to F is aligned with T Offset exceptions, sequence exceptions, and extremum exceptions are added.
S2: setting up a federal learning architecture, and initializing model parameters, wherein the method comprises the following specific steps:
s2.1: four hydrological telemetry stations in Hangzhou city, jinhua city, shaoxing city and Lishui city in Zhejiang province are used as local clients, and a cloud server is used as a trustable third party to build a federal learning framework. The configuration of the server comprises that a host operating system is Ubuntu 18.04, a memory 128GB, a CPU is Intel (R) Xeon (R) Gold, 16-core dual threads and a display card is NVDIA Quadro P6000.
S2.2: the data set of each client is the hydrological telemetry data collected from 90 days 1/2022 to 31/2022. The data of the previous 30 days is used as a training set, the data of the whole 90 days is used as a test set, and the test data comprises about 10 percent, namely the data of 9 days is processed into abnormal data F anomaly . Defining the Data set size of K clients as Data K Then the total Data set size for local training is Data = Data 1 +Data 2 +Data 3 +Data 4
S2.3: the server initializes the model parameters W and sends the global model to each client for their respective training of their own data set.
The model comprises a generator and a discriminator, wherein the generator comprises an LSTM network and a full connection layer, and the discriminator comprises a bidirectional LSTM network with an attention mechanism and a full connection layer; the specific process of extracting the time sequence characteristic information by the model is as follows:
firstly, each client receives the global model sent by the server, then, the training of the local data sets is started, and the training process is completed synchronously by four local clients. While calculating the loss gradient of the local data (where s (·) is a regularization function, w is a local weight parameter, λ ∈ [0,1 ]):
Figure BDA0003977884620000091
the processed matrix sequence F T As an input to the model, hydrological data tends to have significant time series characteristics, since it is a collection of hydrological elements such as water level, flow, voltage, etc., observed over time. Obviously, neither the generator nor the discriminator has a specific structure for processing the time-series data. Therefore, the characteristic time sequence characteristics of the hydrological data are not considered in the training process, so that the generator has poor capability of fitting real data, and the discriminator identifies the abnormal constantOne of the main reasons for this lack of precision. To learn the underlying spatial distribution of the data, temporal feature information extraction is performed using bi-directional LSTM with attention mechanism. For the generator, as shown in FIG. 3, with F T As an input sequence, first pass through three gates of the control unit state, an input gate, a forgetting gate and an output gate, respectively. Then according to the gate function and formula:
h t =o t ·tanh(C t )
finally calculating to obtain the hidden layer state h at the moment t t
For the arbiter, in order to better learn the characteristics of the data, as shown in fig. 4, a reverse LSTM layer is added on the original forward LSTM network layer, and the available information of the network is increased by considering the context of two directions. The network structure comprising two forwarding elements
Figure BDA0003977884620000101
And passed backwards>
Figure BDA0003977884620000102
Then the hidden state h t Is calculated by the following formula, wherein->
Figure BDA0003977884620000103
For the Concat () function, combine the forward and backward hidden layer state information: />
Figure BDA0003977884620000104
Furthermore, in order to improve the learning ability of the discriminator, attention mechanism is introduced. This layer extracts the weight matrix by the following formula:
α=softmax(w T tanh(H))
and the product r of H and the weight matrix alpha is taken as the output of the attention layer:
r=Hα T
where H is the output of the bi-directional LSTM layer, i.e. all timesHidden layer state information of intermediate point { h } 1 ,...,h T H, size v x T, where v is hidden layer information h T T is the length of the sequence, w T It is the transpose of a parameter vector matrix, and is continuously optimized by the training of the model, where α is the weight matrix and r is the output of the layer. The value is then fixed to 0,1 at the time of entering the full connection layer network Linear () and the activation function sigmoid]Interval:
PSY T =Sigmoid(Linear(r))
obtaining the sequence F T Is true per timestamp t T Wherein PSY T Are all in the size of [0,1]]In between. The above is the overall structure of the discriminator network, in the sequence F T As input, PSY T As a final output, anomaly detection is performed on the sequence time stamp by time stamp. By adding the above structural improvements to the generation of the countermeasure network, the ability of the discriminators to detect anomalies and the generator to fit data can be enhanced simultaneously, thereby improving the performance of the model as a whole.
S3: the client optimizes the abnormal detection and data restoration functions through counterstudy, and the specific steps are as follows:
s3.1: after extracting the time series features, it is necessary to balance the countermeasure learning process of the generator and the arbiter in the generation countermeasure network to optimize the anomaly detection and data recovery functions. The generated countermeasure network is optimized through the thought of 'binary game' countermeasure, and the characteristics of the discriminators superior to those of the generators are required, otherwise, the gradient disappears easily, and therefore the generators are usually trained once again after the discriminators D are trained for multiple times. The generator G is first initialized and fixed, and the training of the arbiter D is started with the real data F T And G forged data F T ' as D input, respectively passing through the bidirectional LSTM layer, the attention layer and the full connection layer, and finally outputting the discrimination result PSY T . If the discriminator input is F T I.e. real and normal hydrologic data, then outputs result P T The judgment values at all time points are as close to 1 as possible, otherwise, the output tends to be 0. It is apparent that it is desirable for the discriminator D to output a junction regardless of whether data is generated or abnormal data is generatedThe effect is as close to 0 as possible.
S3.2: the optimization of the generator is similar to the traditional training process of generating the confrontation network model, and F T As input, G (F) is output through the full connection layer and the LSTM layer T )=F T '. For the generator, it is desirable that the generated data deceive the arbiter as much as possible, and the data is taken as an optimization target, so that the generated data gradually approaches the original data to achieve the effect of data recovery. Finally, if and only if P FT =P FT′ And then, obtaining a global optimal solution.
S4: the method comprises the following steps of interacting a local client with a server to update global parameters, and specifically comprising the following steps:
after the first round of local training of each client is completed, the weights w obtained by the respective training are used k Send to server to update the global model parameters W of the new round z+1 . Different from the traditional centralized training method, the Federal learning updates the global training model through a safe parameter aggregation mechanism, and in addition, in order to reduce the communication overhead in the model parameter transmission process, a Federal averaging algorithm is adopted to accelerate the convergence of the model. I.e. the server is based on:
Figure BDA0003977884620000111
to update the global model parameters. Obtaining the latest W z+1 And then, sending the updated global model to the client k for the next round of optimization updating.
The overall generation confrontation network model is shown in fig. 5.
The identification and repair process comprises the following steps: the client side processes respective hydrological data into a matrix sequence F kT Wherein the test set includes outliers. Subsequently, the global weight parameter W of the last round of update is downloaded from the server. At this time, the generator G and the discriminator D on the local side both have the optimal data recovery capability and the optimal abnormality discrimination capability. Thus, the client first combines the matrix sequence F kT Inputting into discriminator of local model to obtain D (F) kT ) I.e. byThe probability that each timestamp t in the sequence is normal (0 to 1, more than 0.5 is considered normal). If the abnormal point is included, F is added kt Input to a generator, and regenerate the data, i.e. data repair G (F) kT )=F kT ' the portion that was last identified as abnormal will be replaced by the repaired data. At the same time, the repaired sequence F is subjected to reverse normalization kT ' revert to the original data. At this point, the data anomaly detection and repair is finished.
The implementation application case shows that the method for detecting and repairing the anomaly of the hydrological telemetering data based on the federal learning is effective, and compared with other design methods, the method provided by the invention adopts the federal learning architecture to act on data privacy protection, and a discriminator and a generator in a generated countermeasure network are respectively used for detecting the anomaly of the data and repairing the data. In order to improve the capability of the model for extracting the time sequence characteristics, a bidirectional long-short-time memory network and a common long-short-time memory network based on an attention mechanism are respectively embedded into a discriminator and a generator of the model. The model processes hydrological data of the hydrological telemetering equipment into a time sequence matrix sequence and then serves as input, a bidirectional long-time memory network layer in a discriminator extracts relevant time sequence information, the result, namely a hidden layer state, serves as input of an attention layer to obtain a weight matrix, and finally, an identification result is output through a full connection layer to finish abnormal identification of the data. In addition, the matrix sequence determined as abnormal data by the discriminator is also input into the generator, and data restoration is completed by utilizing the capability of fitting data distribution. The experiment uses the real hydrological data sets of the four telemetry stations in Hangzhou city, jinhua city, shaoxing city and Lishui city provided by the hydrological communication platform in Zhejiang province, and the result fully proves the feasibility and superiority of the model.
While the invention has been described in connection with specific embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the invention as defined by the appended claims.

Claims (7)

1. A hydrological data anomaly identification and restoration method based on a federal learning framework comprises a model training process and an identification and restoration process, wherein the model training process comprises the following steps:
s1: preprocessing training hydrological data and carrying out abnormal processing;
s2: building a federal learning architecture, and initializing model parameters;
s3: the client optimizes the abnormal detection and data restoration functions through counterstudy;
s4: the local client interacts with the server to update global parameters;
the identification and repair process specifically comprises the following steps: and preprocessing the original hydrological data and inputting the preprocessed original hydrological data into the trained model, wherein the output is the repaired data.
2. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 1, wherein: the step S1 includes the steps of:
s1.1: screening the hydrological data, and removing noise data and repeated data;
s1.2: normalizing the screened hydrological data and processing the hydrological data into a matrix sequence F with the same time slot T (ii) a S1.3: the processed matrix sequence F T The set proportion part of (2) is artificially made dirty and abnormal data.
3. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 1, wherein: the step S2 includes the steps of:
s2.1: taking K hydrological telemetry stations as clients and a cloud server as a server to build a federal learning framework; k is the total number of the hydrological telemetry stations;
s2.2: defining the Data set size of the kth client as Data k K is greater than or equal to 1 and less than or equal to K, the total data set size for local training is
Figure FDA0003977884610000011
S2.3: the server initializes the global model parameters, namely generates training parameters for the resisting network and the LSTM network, and sends the global model and the initial parameters to each client.
4. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 2 or 3, wherein: the model comprises a generator and a discriminator, wherein the generator comprises an LSTM network and a full connection layer, and the discriminator comprises a bidirectional LSTM network with an attention mechanism and a full connection layer;
when predicting the hydrological data at the time T +1, the hydrological data before the time T +1 needs to be processed to obtain a matrix sequence F T The matrix sequence of the input generator firstly passes through a neuron comprising three control unit state gates, which are respectively:
forgetting gate for obtaining information f to be discarded t :
f t =σ(W f x t +W f h t-1 +b f )
An input gate for obtaining information i to be memorized t And storing current cell state information
Figure FDA0003977884610000021
i t =σ(W i x t +W i h t-1 +b i )
Figure FDA0003977884610000022
According to the forgotten information f t And update information i t New cell states were obtained:
Figure FDA0003977884610000023
and an output gate:
o t =σ(W o x t +W o h t-1 +b o )
h t =o t ·tanh(C t )
finally calculating the output hidden layer state h of the generator at the moment t t ;C t For obtaining long-term memory information, and h t Is used to acquire short-term memory information, and is initialized with C 0 And h 0 Defaulting to an all-0 matrix; wherein x is t Is a matrix sequence F T Input of the t-th time point of (1), h t-1 The hidden layer state at the moment of t-1 is indicated, W respectively represents a weight vector corresponding to a gating unit or a cell state according to subscripts, b respectively represents offset corresponding to the gating unit or the cell state according to the subscripts, and sigma is an activation function;
the final LSTM network output is composed of the hidden state h of the last time step T integrating all the useful information T And then sending the data to a full connection layer network:
x T+1 =Linear(h T )
obtaining the value x to be predicted at the next moment T+1
The bidirectional LSTM network of the discriminator comprises a forward LSTM network layer and a reverse LSTM network layer, the structures of the forward LSTM network layer and the reverse LSTM network layer are the same as the structure of the LSTM network layer in the generator, a matrix sequence input to the forward LSTM network layer is a forward input, a forward output hidden layer state at the time t is recorded as a forward output hidden layer state
Figure FDA0003977884610000024
The matrix sequence input into the reverse LSTM network is reverse input, and the forward output hidden state at the time t is recorded as->
Figure FDA0003977884610000025
Hidden state h of bidirectional LSTM t Is calculated by the following formula, wherein->
Figure FDA0003977884610000026
For the Concat () function, combining the forward and backward hidden layer state information:
Figure FDA0003977884610000031
the attention tier extracts the weight matrix by the following formula:
α=softmax(w T tanh(H))
and the product r of H and the weight matrix alpha is taken as the output of the attention layer:
r=Hα T
where H is the output of the LSTM layer, i.e. hidden layer state information { H) at all time points 1 ,...,h T V, size v T, where v is hidden layer state information h T T is the length of the sequence, w T Then, the parameter vector obtained by training and learning is transposed, alpha is a weight matrix, and r is the output of the layer; then entering a full-connection layer network Linear () and an activation function sigmoid to fix the value to 0,1]Interval:
PSY T =Sigmoid(Linear(r))
obtaining the sequence F T Is true per timestamp t T Wherein PSY T Are in the size T1 and all lie in [0,1]]In the meantime.
5. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 4, wherein: the step S3 includes the steps of:
s3.1: initializing and fixing a generator G, and starting training a discriminator D; with real data F T And G forged data F T ' as the input of D, respectively passing through the bidirectional LSTM layer, the attention layer and the full connection layer, and finally outputting an identification result; if the discriminator input is F t If the data is real and normal hydrological data, the output result is 1, otherwise the output is 0; the training loss function of discriminator D is as follows:
Figure FDA0003977884610000032
wherein
Figure FDA0003977884610000033
For a true data distribution, is>
Figure FDA0003977884610000034
Then it is the generation data or abnormal data distribution that is asserted>
Figure FDA0003977884610000035
Indicates x belongs to
Figure FDA0003977884610000036
Then y is taken to be>
Figure FDA0003977884610000037
S3.2: optimization of the generator: with the sequence F to be repaired T As input, output G (F) via LSTM layer and full connection layer T )=F T '; the training loss function of the generator is as follows:
Figure FDA0003977884610000038
finally, if and only if
Figure FDA0003977884610000041
Then, obtaining a global optimal solution;
s3.3: k clients calculate respective loss gradient L k (w) to update the local model:
Figure FDA0003977884610000042
where s is a regularization function, l j (. X) denotes the loss of the j-th sample, w is the local weight parameter, λ e [0,1]for balancing losses.
6. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 1, wherein: the step S4 includes the steps of:
all K clients will have their own local model training parameter w k Sent to the server to update the global parameter W z+1
Figure FDA0003977884610000043
n k And
Figure FDA0003977884610000044
respectively representing the number of samples on the client k and the local weight, wherein n is the total number of samples of all selected clients, and z represents the training turn; obtaining the latest W z+1 And then, sending the updated global model to the client k for the next round of optimization updating.
7. The method for recognizing and repairing abnormal hydrological data based on the federal learning framework as claimed in claim 4, wherein: the identification and repair process comprises the following steps:
the kth client processes hydrologic data into a matrix sequence F kT Then, the global weight parameters W of the last round of update are downloaded from the server, and the k-th client first downloads the matrix sequence F kT Inputting the data into a local model discriminator to obtain D (F) kT )=PSY T I.e. the sequence F kT Wherein each timestamp t is a normal probability of 0,1]Above 0.5 is considered normal; if the abnormal point is included, the matrix sequence F to be repaired kT Inputting into a generator, reconstructing the data, i.e. data recovery G (F) kT )=F kT ', the part identified as abnormal will be replaced by the repaired data, and finally the repaired sequence F will be obtained by inverse normalization kT ' reduction toRaw data.
CN202211546472.8A 2022-12-02 2022-12-02 Hydrological data anomaly identification and repair method based on federated learning framework Pending CN115982658A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211546472.8A CN115982658A (en) 2022-12-02 2022-12-02 Hydrological data anomaly identification and repair method based on federated learning framework

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211546472.8A CN115982658A (en) 2022-12-02 2022-12-02 Hydrological data anomaly identification and repair method based on federated learning framework

Publications (1)

Publication Number Publication Date
CN115982658A true CN115982658A (en) 2023-04-18

Family

ID=85972963

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211546472.8A Pending CN115982658A (en) 2022-12-02 2022-12-02 Hydrological data anomaly identification and repair method based on federated learning framework

Country Status (1)

Country Link
CN (1) CN115982658A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116821838A (en) * 2023-08-31 2023-09-29 浙江大学 Privacy protection abnormal transaction detection method and device
TWI832767B (en) * 2023-05-19 2024-02-11 逢甲大學 Hydrological data analysis method and hydrological data analysis system

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI832767B (en) * 2023-05-19 2024-02-11 逢甲大學 Hydrological data analysis method and hydrological data analysis system
CN116821838A (en) * 2023-08-31 2023-09-29 浙江大学 Privacy protection abnormal transaction detection method and device
CN116821838B (en) * 2023-08-31 2023-12-29 浙江大学 Privacy protection abnormal transaction detection method and device

Similar Documents

Publication Publication Date Title
CN109214592B (en) Multi-model-fused deep learning air quality prediction method
Dinda et al. An integrated simulation approach to the assessment of urban growth pattern and loss in urban green space in Kolkata, India: A GIS-based analysis
CN109658695B (en) Multi-factor short-term traffic flow prediction method
Liu et al. A flood forecasting model based on deep learning algorithm via integrating stacked autoencoders with BP neural network
CN115982658A (en) Hydrological data anomaly identification and repair method based on federated learning framework
CN106650767B (en) Flood forecasting method based on cluster analysis and real-time correction
CN106707099A (en) Monitoring and locating method based on abnormal electricity consumption detection module
CN106778841A (en) The method for building up of abnormal electricity consumption detection model
CN110895878B (en) Traffic state virtual detector generation method based on GE-GAN
CN112330951B (en) Method for realizing road network traffic data restoration based on generation of countermeasure network
CN110781266B (en) Urban perception data processing method based on time-space causal relationship
Qin et al. Network-wide traffic states imputation using self-interested coalitional learning
CN112561132A (en) Water flow prediction model based on neural network
Sawaf et al. Extent of detection of hidden relationships among different hydrological variables during floods using data-driven models
CN117233869B (en) Site short-term wind speed prediction method based on GRU-BiTCN
Liu et al. Air Quality Index Forecasting via Genetic Algorithm-Based Improved Extreme Learning Machine
Zhan et al. Daily rainfall data construction and application to weather prediction
Zhang et al. Multifractal analysis of measure representation of flood/drought grade series in the Yangtze Delta, China, during the past millennium and their fractal model simulation
Liu et al. Gridded statistical downscaling based on interpolation of parameters and predictor locations for summer daily precipitation in North China
Vos et al. Long-range seasonal forecasting of 2m-temperature with machine learning
CN113988210A (en) Method and device for restoring distorted data of structure monitoring sensor network and storage medium
Liao et al. Traj2Traj: A road network constrained spatiotemporal interpolation model for traffic trajectory restoration
Qing-Dao-Er-Ji et al. Application of Convolution Neural Network Based on Transfer Learning in Sandstorm Prediction in Inner Mongolia
CN116913098B (en) Short-time traffic flow prediction method integrating air quality and vehicle flow data
Astudillo et al. Predicting air quality using deep learning in Talca City, Chile

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination