CN113010504A - Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm - Google Patents

Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm Download PDF

Info

Publication number
CN113010504A
CN113010504A CN202110239950.XA CN202110239950A CN113010504A CN 113010504 A CN113010504 A CN 113010504A CN 202110239950 A CN202110239950 A CN 202110239950A CN 113010504 A CN113010504 A CN 113010504A
Authority
CN
China
Prior art keywords
data
time
power
value
power sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202110239950.XA
Other languages
Chinese (zh)
Other versions
CN113010504B (en
Inventor
王子涵
仲春林
刘述波
王国际
方超
郑安宁
张凡
姚鹏
姜宇轩
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangsu Fangtian Power Technology Co Ltd
Original Assignee
Jiangsu Fangtian Power Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangsu Fangtian Power Technology Co Ltd filed Critical Jiangsu Fangtian Power Technology Co Ltd
Priority to CN202110239950.XA priority Critical patent/CN113010504B/en
Publication of CN113010504A publication Critical patent/CN113010504A/en
Application granted granted Critical
Publication of CN113010504B publication Critical patent/CN113010504B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/048Activation functions
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The invention discloses a method and a system for detecting power data abnormity based on LSTM and improved K-means algorithm in the technical field of power data analysis, comprising the following steps: inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data, and constructing a sample data set; and detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on an improved K-means algorithm. The method and the device have the advantages that the time-sequence feature extraction of the power data is realized, and meanwhile, the abnormal power data can be efficiently identified under the condition that the power data amount is large.

Description

Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm
Technical Field
The invention belongs to the technical field of electric power data analysis, and particularly relates to an electric power data anomaly detection method and system based on LSTM and an improved K-means algorithm.
Background
The electric power data that present production management system gathered because collection terminal quantity is huge, and the electric power data volume that needs to gather is great, gathers frequently high, and transmission mode is various, leads to the electric power data of gathering uneven. However, information such as equipment replacement and transmission conditions cannot be acquired in real time, and whether the quality of the acquired power data reaches the standard cannot be judged, so that the production power utilization condition cannot be accurate. Therefore, the quality of the power data is the basis of power utilization level analysis, and for the power data acquired by the system, the quality of the power data needs to be detected first, the power data with unqualified quality is checked, and the power data is timely acquired. At present, the quality of checking power data by using a machine learning method is becoming mainstream gradually because checking transmission problems, acquisition terminals and the like consumes a large amount of manpower and material resources. The data quality abnormity detection is carried out by utilizing a machine learning method, and an outlier detection algorithm based on clustering is mostly adopted. However, this approach has two problems: 1) the electric power data volume is large, and the convergence of a general clustering method is slow; 2) the power data has the characteristic of time sequence, and the existing method cannot effectively extract the time sequence characteristics of the power data. The existing outlier detection method has the problems of low clustering efficiency, low algorithm convergence speed and the like caused by improper initial clustering center selection.
Disclosure of Invention
In order to overcome the defects in the prior art, the invention provides the electric power data abnormity detection method and system based on the LSTM and the improved K-means algorithm, which can realize the extraction of the time-sequence characteristics of the electric power data and can efficiently identify the abnormal electric power data under the condition of large electric power data quantity.
In order to achieve the purpose, the technical scheme adopted by the invention is as follows: a power data anomaly detection method includes: inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data, and constructing a sample data set; and detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on an improved K-means algorithm.
Further, the inputting the collected power sequence data of the user into the trained LSTM model, extracting the time-sequence characteristics of the power sequence data, and constructing a sample data set, includes: inputting the collected power sequence data of the user into a trained LSTM model to obtain a predicted value of the power sequence data; and comparing the predicted value and the true value of the power sequence data to obtain a difference value, wherein the difference value is used as a time sequence characteristic of the power sequence data to describe the power data and construct a sample data set.
Further, in the trained LSTM model, the hidden layer corresponding to each time t except for receiving xt,xtData representing the power sequence at time t, and Ct-1,Ct-1Representing the memory state of the hidden layer at time t-1, and by processing these inputs, output ht,htCorresponding to the output value of the hidden layer at the time t, and adding CtOutput to the hidden layer at the next instant, CtIndicating the memory state of the hidden layer at time t.
Further, in the trained LSTM model, a memory unit checks h through a forgetting gatet-1And xt,ht-1Represents the output value of the hidden layer at the moment of t-1 and is Ct-1Each of the numbers in (1) outputs a number between 0 and 1, Ct-1The memory state of a hidden layer at the time of t-1 is represented, 1 represents complete retention, and 0 represents complete deletion; the method specifically comprises the following steps:
ft=σ(Wf[ht-1,xt]+bf) (2)
wherein f istIs the value of the forgetting gate at time t, σ is the sigmoid function, WfIs the weight of the forgetting gate f, bfIs the offset of the forgetting gate f, ht-1Corresponding to the output value of the hidden layer at the time t-1;
using a "memory gate" itControlling the influence of the current data input on the state value of the memory cell itShowing the state of the memory gate i at the time t; creation using tanh function
Figure BDA0002961727490000021
Representing a candidate value vector at the time t, and adding the vector into the state of the memory unit; the specific calculation steps are as follows:
it=σ(Wi[ht-1,xt]+bi) (3)
Figure BDA0002961727490000031
wherein, WiRepresenting the update weight of the memory gate i, biIs the offset of the memory gate i, WcIs a candidate for the memory gate i, bcIt is the update of the candidate value offset,
Figure BDA0002961727490000032
is the candidate vector at time t;
using vectors of candidate values
Figure BDA0002961727490000033
Combining the state C of the last moment of the memory cellt-1The state of the memory cell at the current time is updated,
Figure BDA0002961727490000034
the output of each memory cell is provided by an output gate otAnd controlling, wherein the calculation formula is as follows:
ot=σ(Wo[ht-1,xt]+bo) (6)
ht=ot tanhCt (7)
wherein o istIs the value of the output gate at time t, WoIs the weight of the updated output value, boIs to update the output value offset, htIs the output of the hidden layer at time t.
Further, the method for detecting abnormal data in power sequence data by taking the constructed sample data set as input based on the improved K-means algorithm comprises the following steps: calculating the compactness of all data points in the sample data set, acquiring a data dense area, and further determining an initial clustering center; and calculating Euclidean distances between all data points in the sample data set and each initial clustering center, dividing the data points into K clustering clusters, continuing iteration if the distance between the data point belonging to the clustering cluster and the clustering cluster center is greater than the average distance, and judging the data point as an abnormal data point when the iteration number is greater than or equal to a set value, thereby detecting abnormal data in the power sequence data.
Further, the determining the initial clustering center includes: selecting a data point with highest compactness as a first initial clustering center in a data dense area, and then selecting a data point farthest from the first initial clustering center as a second initial clustering center in the area; next, each initial cluster center is selected as the largest one of the closest distances to the selected initial cluster center.
Further, the closeness of the data points is obtained by:
Figure BDA0002961727490000041
wherein x isiRepresents the ith data point, x, in the sample setjDenotes the jth data point, D (x)i,xj) Denotes xiAnd xjA distance between, Gt(xi) Is xiT most recentA set of adjacent data points.
An electrical data anomaly detection system comprising: the first module is used for inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data and constructing a sample data set; and the second module is used for detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on the improved K-means algorithm.
Compared with the prior art, the invention has the following beneficial effects: according to the method, more valuable time sequence characteristics of the power data are effectively extracted through an LSTM (Long short-term memory neural network) model, so that a predicted value of the power sequence data is obtained, the absolute value of the difference value between the predicted value of the power sequence data and the real power data is used as the time sequence characteristics of the power data, the analyzed time sequence characteristics are combined, an outlier is found out through an improved K-means algorithm suitable for big data under the condition that the power data volume is larger, and the efficiency and the accuracy of data anomaly detection can be effectively improved by fusing the LSTM and the data anomaly detection method of the improved K-means; when the time-sequence characteristic extraction of the electric power data is realized, the outliers with unqualified quality can be efficiently identified under the condition of large electric power data volume.
Drawings
FIG. 1 is a flow chart of abnormal power usage data detection in an embodiment of the present invention;
FIG. 2 is a diagram of the LSTM model architecture in an embodiment of the present invention;
FIG. 3 is a diagram of the neuron structure of the LSTM in the embodiment of the present invention;
FIG. 4 is a flowchart of the K-means clustering algorithm in the embodiment of the present invention;
FIG. 5 is a flow chart of the data anomaly detection algorithm based on the improved K-means in the embodiment of the invention.
Detailed Description
The invention is further described below with reference to the accompanying drawings. The following examples are only for illustrating the technical solutions of the present invention more clearly, and the protection scope of the present invention is not limited thereby.
The first embodiment is as follows:
as shown in fig. 1 to 5, a power data abnormality detection method includes: inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data, and constructing a sample data set, wherein the method comprises the following steps: inputting the collected power sequence data of the user into a trained LSTM model to obtain a predicted value of the power sequence data; comparing the predicted value and the true value of the power sequence data to obtain a difference value serving as a time sequence characteristic of the power sequence data, and constructing a sample data set according to the time sequence characteristic; and detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on an improved K-means algorithm.
Firstly, training an LSTM model by using power consumption data of a user, and acquiring time sequence characteristics of the power consumption data; next, predicting power consumption data by using an LSTM model obtained by training, and taking a difference value between a predicted value and an actual value as a characteristic value of the power consumption of the user; finally, abnormal data detection is carried out on the power consumption data of the user by utilizing a data outlier detection algorithm based on an improved K-means algorithm; therefore, the aim of detecting abnormal electricity consumption data by combining the time sequence characteristics of the electricity consumption data of the user under the condition that the electricity consumption data of the user is large is achieved.
The LSTM (Long short-term memory neural network) is a recurrent neural network for processing time series data, and its structure is shown in fig. 2, and the recurrent neural network can be understood as a cyclic pile of a plurality of forward neural networks with the same structure and parameters, and the number of cycles is consistent with the length of the input sequence. In a recurrent neural network, the input is { x }0,x1,...,xnOutput is { h }0,h1,...,hnThe output of the hidden layer is denoted as { C }0,C1,...,Cn}. Hidden layer A, i.e. a neuron node in the LSTM network, corresponding to each time t, except for receiving xt,xtData representing the power sequence at time t, and Ct-1,Ct-1Representing the memory state of the hidden layer at time t-1, and by processing these inputs, output ht,htCorrespond toHiding the output value of the layer at time t, and dividing CtOutput to the hidden layer at the next instant, CtThe memory state of the hidden layer at the time t is shown, so that the processing of the information at the next time is intervened,
Figure BDA0002961727490000061
wherein U represents the weight of the input layer, W represents the weight of the hidden layer, V represents the weight of the output layer, σ represents the sigmoid function, biIs represented by CtOffset of (b)oRepresents htThe offset of (3).
In the circulation Networks meridian formed in the way, the weight W from the hidden layer to the hidden layer is a 'memory controller' of the whole network, and the weight connected between the hidden layers represents the influence of the past information on the current time information, so that historical memory information is scheduled, and the time sequence information of an input sequence is 'memorized' and 'understood'.
However, the data of the user power consumption to be processed may contain a large amount of history information, that is, the input data sequence of the user power consumption may be long, and in order to deal with the situation that the amount of history information is large, the input time sequence data is long, and important information is lost, the LSTM adopts a "gate" structure to decide to delete or memorize information, and the memory unit of the LSTM is as shown in fig. 3. The first step of the hidden layer a is to decide what information to discard from the cell state, this decision being implemented by a Sigmoid layer called "forgetting gate" which looks at ht-1(previous output) and xt(currently entered) and is cell state Ct-1Each digit in (last state) outputs a number between 0 and 1. 1 represents a complete reservation and 0 represents a complete deletion. The selective memory and forgetting enables the LSTM to avoid the problem of information explosion, the LSTM can better understand the information, and the data processing of the forgetting gate is as follows:
ft=σ(Wf[ht-1,xt]+bf) (2)
where σ denotes a sigmoid function, WfIndicating forgetfulnessDoor weight, bfIndicating forgetting gate bias, xtData representing input at time t, ht-1Representing the output value of hidden layer a at time t-1.
After determining the forgotten information, the LSTM needs to decide what information to store. This part is divided into two steps, the first one, using a "memory gate" itControlling the influence of the current data input on the state value of the memory unit; second step, create a new candidate vector using the tanh layer
Figure BDA0002961727490000071
The vector is added to the state of the memory cell. The specific calculation steps are as follows:
it=σ(Wi[ht-1,xt]+bi) (3)
Figure BDA0002961727490000072
wherein, WiRepresenting the update weight of the memory gate i, biIs the offset of the memory gate i, WcIs a candidate for the memory gate i, bcIt is the update of the candidate value offset,
Figure BDA0002961727490000073
is the candidate vector at time t;
after determining the information needing to be forgotten and the remembered information, using the candidate value vector
Figure BDA0002961727490000074
Combining the state C of the last moment of the memory cellt-1The state of the memory cell at the current time is updated,
Figure BDA0002961727490000075
the output of each memory cell is provided by an output gate otAnd controlling, wherein the calculation formula is as follows:
ot=σ(Wo[ht-1,xt]+bo) (6)
ht=ot tanhCt (7)
wherein o istIs the value of the output gate at time t, WoIs the weight of the updated output value, boIs to update the output value offset, htIs the output of the hidden layer at time t.
And the LSTM adopts a gradient descent method to update the weights of all layers, so that the cost function value is minimum.
Data of electricity consumption sequence of user { x0,x1,...,xnAfter the data are input into the LSTM, the predicted output { h) of the electricity consumption data of the user is obtained0,h1,...,hnAnd solving a difference value between the output and the real power consumption data, and constructing a sample data set as a characteristic vector of the outlier detection algorithm. And detecting abnormal data of the power consumption of the user by using the detected characteristic vector and an outlier detection algorithm based on an improved K-means algorithm.
The purpose of the K-means clustering algorithm is to cluster unlabeled data sets
Figure BDA0002961727490000081
The classification into K classes, the steps are shown in fig. 4:
1. randomly selecting K sample points mu in the sampleiServing as the center point of each cluster;
2. calculating the distance between all the sample points and the center of each cluster, and then dividing the sample points into the nearest cluster; the distance calculation method is as follows:
D=||x-μi||2 (8)
wherein, x muiIs a cluster CiCenter point of (a):
3. recalculating the cluster center according to the existing sample points in the cluster;
Figure BDA0002961727490000082
wherein, muiIs a cluster CiCenter point of (a):
Figure BDA0002961727490000083
4. and (5) repeating the steps 2 and 3.
The K-means algorithm is widely used in the field of anomaly detection as an unsupervised partition clustering algorithm due to high efficiency and simplicity. But because the initial cluster center selection process of the algorithm is random, the clustering effect is easy to fill uncertainty. When the algorithm starts iteration, K initial clustering centers are randomly selected and have no fixed rule. Different iteration starting points have different search paths.
Therefore, the clustering result has a severe dependence on the initial clustering center, so that the final clustering effect is easy to fall into local optimization rather than global optimization. As shown in fig. 1, if the selected initial clustering center is close to the real clustering center, the clustering result is objective and real; as shown in fig. 2, if the randomly selected initial cluster center contains outliers, the final clustering result will have a large error.
Meanwhile, outliers have a significant impact on the clustering results. Each iteration of the algorithm is to divide the cluster-like center according to the characteristic attributes of all data points, and the existence of outliers will certainly cause interference to the cluster center and influence the clustering result.
Therefore, the embodiment is based on an improved K-means algorithm, and detects abnormal data in power sequence data by taking a constructed sample data set as an input, and includes: calculating the compactness of all data points in the sample data set, acquiring a data dense area, and further determining an initial clustering center; and calculating Euclidean distances between all data points in the sample data set and each initial clustering center, dividing the data points into K clustering clusters, continuing iteration if the distance between the data point belonging to the clustering cluster and the clustering cluster center is greater than the average distance, and judging the data point as an abnormal data point when the iteration number is greater than or equal to a set value, thereby detecting abnormal data in the power sequence data. The embodiment changes the selection mode of the initial clustering center, and from the property of the optimal clustering center, the initial clustering center of the algorithm is selected according to the farthest distance in the data tight region by removing the outlier region, so that the initialization process of the algorithm is optimized, and the algorithm obtains a more reasonable initial clustering center before iteration is executed; based on the above, a corresponding anomaly detection algorithm is adopted.
The specific improved K-means algorithm initial point selection principle is as follows:
(1) selection of outliers is avoided. The principle is satisfied, so that the algorithm can be prevented from getting into errors at the beginning, and the result generated by the algorithm is more accurate;
(2) the initial cluster centers are selected and uniformly distributed in the high-density area. Obviously, the true cluster centers should be where the data is most dense and at some distance from each other. Therefore, if the initial clustering center is selected closer to the real clustering center, the iteration times can be reduced, the convergence is accelerated, and the accuracy of the clustering algorithm can be improved.
According to the two principles, the K-means algorithm is improved, the compactness of all data points in the data set is firstly calculated, and sparse data regions are removed to obtain a data point set with high compactness, because the sparse regions are not only far away from the optimal clustering center, but also contain outliers; selecting a data point with highest compactness as a first initial clustering center in a data dense area; then selecting the data point farthest from the first initial clustering center in the area as a second initial clustering center; next, each initial cluster center is selected as the largest one of the closest distances to the selected initial cluster center, so that the uniform distribution of each initial cluster center can be fully ensured. An improved algorithm for initial cluster center selection is described in detail below.
The initialization process optimization algorithm comprises the following steps:
1. for a spatial data set
Figure BDA0002961727490000101
Each data point x iniTo find the tightness
Figure BDA0002961727490000102
Wherein x isiRepresents the ith data point, x, in the sample setjDenotes the jth data point, D (x)i,xj) Denotes xiAnd xjA distance between, Gt(xi) Is xiT sets of nearest neighbor data points;
2. delete all compactities in X
Figure BDA0002961727490000103
Obtaining a dense data point set X';
3. in X', the one with the highest compactness, i.e. Tighmax(x) X as the first initial cluster center c1(ii) a Distance c1The farthest data point is taken as the second initial cluster center c2(ii) a M (3. ltoreq. m. ltoreq.k) th initial cluster center cmIs a data point x satisfying the following conditioni,xi∈X':max(Dmin(xi,c1),Dmin(xi,c2),...,Dmin(xi,cm-1) I 1, 2.., n, until the final K initial cluster centers are obtained.
In the embodiment, firstly, outliers are eliminated as initial centers, so that the iteration starting point of the algorithm is ensured not to deviate from the center of a real cluster in a large range; secondly, the compactness of the data points is used as a main basis for selecting an initial center and accords with the characteristics of the optimal cluster center; finally, the principle of the closest maximum distance ensures uniform distribution of the initial clustering centers.
Due to the characteristics of the K-means algorithm, in each iteration process, if outliers participate in the operation of the cluster center, deviation is brought to a clustering result. Therefore, the abnormal point detection algorithm can be given by utilizing the characteristic that the K-means is sensitive to the outliers, and the abnormal points are detected and eliminated in the iterative process of the algorithm.
The algorithm is as follows:
inputting: d-dimensional data set
Figure BDA0002961727490000111
And finally, clustering number K, clustering function convergence precision epsilon and nearest neighbor number t.
And (3) outputting: k clustered cluster centers C ═ { C ═ C1,c2,...,cKAnd h, a class cluster label L to which the data xi belongs, and an abnormal point set U.
The method comprises the following steps:
1. setting initial clustering criterion function value J00, initial degree of abnormality Abn for each data point x in the datasetx=0;
2. For RdData set in space
Figure BDA0002961727490000112
Each data x in (2)iCalculating the tightness;
3. delete all compactities in X
Figure BDA0002961727490000113
Obtaining a dense data point set X';
4. in X', the one with the highest compactness, i.e. Tighmax(x) X as the first initial cluster center c1(ii) a Distance c1The farthest data point is taken as the second initial cluster center c2(ii) a M (3. ltoreq. m. ltoreq.K) th initial cluster center cmIs a data point x satisfying the following conditioni,xi∈X':max(Dmin(xi,c1),Dmin(xi,c2),...,Dmin(xi,cm-1) I ═ 1, 2.. times, n, until the final K initial cluster centers are obtained, representing K clusters w respectivelyj,j=1,2,...,K;
5. Calculating Euclidean distances between all data points in the X and each clustering center:
Figure BDA0002961727490000121
where i is 1,2, 3, …, m and j is 1,2, …, K. For data point x, if cjSuch that D (x, c)j)=minD(x,cj) J 1,2, K, then point x is divided into cjThe cluster represented, i.e. Lx=wj
6. If the distance between the data point x belonging to the cluster and the cluster center is larger than the average distance in the formed K clusters, namely
Figure BDA0002961727490000122
Wherein m isjIs cjRepresenting the total number of data points owned by the cluster, Abnx++;
7、AbnxIf the number X is more than or equal to 3, judging that X is an abnormal point, removing the abnormal point from the data set X, and merging the abnormal point into U;
8. judging clustering criterion function
Figure BDA0002961727490000123
If the convergence condition | J '-J | is less than or equal to epsilon (J is the function value of the last iteration clustering criterion, and J' is the function value of the current clustering iteration criterion), if not, continuing the iteration in the step 9; if so, finishing the algorithm and outputting C, L and U;
9. recalculating the cluster centers of the various clusters:
Figure BDA0002961727490000124
then go to step 5, mjIs cjRepresenting the total number of data points the cluster owns.
The difference value is analyzed through the algorithm to obtain the abnormal point, and the high-efficiency monitoring on the abnormal point is realized through combining a time sequence algorithm and a clustering algorithm.
According to the method, more valuable time sequence characteristics of the power data are effectively extracted through an LSTM (Long short-term memory neural network) model, so that a predicted value of the power sequence data is obtained, an absolute value of a difference value between the predicted value of the power sequence data and real power data is used as the time sequence characteristics of the power data, an outlier is found out through an improved K-means algorithm suitable for big data under the condition that the power data volume is large by combining the analyzed time sequence characteristics, and the efficiency and the accuracy of data anomaly detection can be effectively improved by fusing the LSTM and the improved K-means data anomaly detection method; when the time-sequence characteristic extraction of the electric power data is realized, the outliers with unqualified quality can be efficiently identified under the condition of large electric power data volume.
Example two:
based on the method for detecting the abnormality of the electric power data according to the first embodiment, the present embodiment provides an electric power data abnormality detection system, including:
the first module is used for inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data and constructing a sample data set;
and the second module is used for detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on the improved K-means algorithm.
The above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and variations can be made without departing from the technical principle of the present invention, and these modifications and variations should also be regarded as the protection scope of the present invention.

Claims (8)

1. A method for detecting power data abnormality is characterized by comprising the following steps:
inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data, and constructing a sample data set;
and detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on an improved K-means algorithm.
2. The method for detecting the power data abnormality according to claim 1, wherein the step of inputting the collected power sequence data of the user into a trained LSTM model, extracting a time-series characteristic of the power sequence data, and constructing a sample data set includes:
inputting the collected power sequence data of the user into a trained LSTM model to obtain a predicted value of the power sequence data;
and comparing the predicted value and the true value of the power sequence data to obtain a difference value, wherein the difference value is used as a time sequence characteristic of the power sequence data to describe the power data and construct a sample data set.
3. The method as claimed in claim 1, wherein in the trained LSTM model, the hidden layer corresponding to each time t is except for x to be receivedt,xtData representing the power sequence at time t, and Ct-1,Ct-1Representing the memory state of the hidden layer at time t-1, and by processing these inputs, output ht,htCorresponding to the output value of the hidden layer at the time t, and adding CtOutput to the hidden layer at the next instant, CtIndicating the memory state of the hidden layer at time t.
4. The method for detecting the abnormality of the electric power data as claimed in claim 1, wherein in the trained LSTM model, the memory unit checks h through a forget gatet-1And xt,ht-1Represents the output value of the hidden layer at the moment of t-1 and is Ct-1Each of the numbers in (1) outputs a number between 0 and 1, Ct-1The memory state of a hidden layer at the time of t-1 is represented, 1 represents complete retention, and 0 represents complete deletion; the method specifically comprises the following steps:
ft=σ(Wf[ht-1,xt]+bf) (2)
wherein f istIs the value of the forgetting gate at time t, σ is the sigmoid function, WfIs the weight of the forgetting gate f, bfIs the offset of the forgetting gate f, ht-1Corresponding to the output value of the hidden layer at the time t-1;
using a "memory gate" itControlling the influence of the current data input on the state value of the memory cell itShowing the state of the memory gate i at the time t; creation using tanh function
Figure FDA0002961727480000021
Figure FDA0002961727480000026
Representing a candidate value vector at the time t, and adding the vector into the state of the memory unit; the specific calculation steps are as follows:
it=σ(Wi[ht-1,xt]+bi) (3)
Figure FDA0002961727480000022
wherein, WiRepresenting the update weight of the memory gate i, biIs the offset of the memory gate i, WcIs a candidate for the memory gate i, bcIt is the update of the candidate value offset,
Figure FDA0002961727480000023
is the candidate vector at time t;
using vectors of candidate values
Figure FDA0002961727480000024
Combining the state C of the last moment of the memory cellt-1The state of the memory cell at the current time is updated,
Figure FDA0002961727480000025
the output of each memory cell is provided by an output gate otAnd controlling, wherein the calculation formula is as follows:
ot=σ(Wo[ht-1,xt]+bo) (6)
ht=ottanhCt (7)
wherein o istIs the value of the output gate at time t, WoIs the weight of the updated output value, boIs to update the output value offset, htIs the output of the hidden layer at time t.
5. The method for detecting the abnormal data of the electric power data according to claim 1, wherein the detecting the abnormal data in the electric power sequence data by taking the constructed sample data set as an input based on the improved K-means algorithm comprises:
calculating the compactness of all data points in the sample data set, acquiring a data dense area, and further determining an initial clustering center;
and calculating Euclidean distances between all data points in the sample data set and each initial clustering center, dividing the data points into K clustering clusters, continuing iteration if the distance between the data point belonging to the clustering cluster and the clustering cluster center is greater than the average distance, and judging the data point as an abnormal data point when the iteration number is greater than or equal to a set value, thereby detecting abnormal data in the power sequence data.
6. The method according to claim 5, wherein the determining an initial clustering center includes: selecting a data point with highest compactness as a first initial clustering center in a data dense area, and then selecting a data point farthest from the first initial clustering center as a second initial clustering center in the area; next, each initial cluster center is selected as the largest one of the closest distances to the selected initial cluster center.
7. The method for detecting an abnormality in electric power data according to claim 5, wherein the degree of closeness of the data points is obtained by:
Figure FDA0002961727480000031
wherein x isiRepresents the ith data point, x, in the sample setjDenotes the jth data point, D (x)i,xj) Denotes xiAnd xjA distance between, Gt(xi) Is xiT sets of nearest neighbor data points.
8. An electric power data abnormality detection system characterized by comprising:
the first module is used for inputting the collected power sequence data of the user into a trained LSTM model, extracting the time sequence characteristics of the power sequence data and constructing a sample data set;
and the second module is used for detecting abnormal data in the power sequence data by taking the constructed sample data set as input based on the improved K-means algorithm.
CN202110239950.XA 2021-03-04 2021-03-04 Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm Active CN113010504B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110239950.XA CN113010504B (en) 2021-03-04 2021-03-04 Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110239950.XA CN113010504B (en) 2021-03-04 2021-03-04 Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm

Publications (2)

Publication Number Publication Date
CN113010504A true CN113010504A (en) 2021-06-22
CN113010504B CN113010504B (en) 2022-06-10

Family

ID=76405160

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110239950.XA Active CN113010504B (en) 2021-03-04 2021-03-04 Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm

Country Status (1)

Country Link
CN (1) CN113010504B (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780440A (en) * 2021-09-15 2021-12-10 江苏方天电力技术有限公司 Low-voltage station area phase identification method for improving data disturbance resistance
CN115834424A (en) * 2022-10-09 2023-03-21 国网甘肃省电力公司临夏供电公司 Method for identifying and correcting abnormal data of line loss of power distribution network
CN117371996A (en) * 2023-12-06 2024-01-09 北京中能亿信软件有限公司 Electric power communication analysis method based on big data

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109302410A (en) * 2018-11-01 2019-02-01 桂林电子科技大学 A kind of internal user anomaly detection method, system and computer storage medium
CN110334726A (en) * 2019-04-24 2019-10-15 华北电力大学 A kind of identification of the electric load abnormal data based on Density Clustering and LSTM and restorative procedure
CN110569925A (en) * 2019-09-18 2019-12-13 南京领智数据科技有限公司 LSTM-based time sequence abnormity detection method applied to electric power equipment operation detection

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109302410A (en) * 2018-11-01 2019-02-01 桂林电子科技大学 A kind of internal user anomaly detection method, system and computer storage medium
CN110334726A (en) * 2019-04-24 2019-10-15 华北电力大学 A kind of identification of the electric load abnormal data based on Density Clustering and LSTM and restorative procedure
CN110569925A (en) * 2019-09-18 2019-12-13 南京领智数据科技有限公司 LSTM-based time sequence abnormity detection method applied to electric power equipment operation detection

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113780440A (en) * 2021-09-15 2021-12-10 江苏方天电力技术有限公司 Low-voltage station area phase identification method for improving data disturbance resistance
CN115834424A (en) * 2022-10-09 2023-03-21 国网甘肃省电力公司临夏供电公司 Method for identifying and correcting abnormal data of line loss of power distribution network
CN115834424B (en) * 2022-10-09 2023-11-21 国网甘肃省电力公司临夏供电公司 Identification and correction method for abnormal data of power distribution network line loss
CN117371996A (en) * 2023-12-06 2024-01-09 北京中能亿信软件有限公司 Electric power communication analysis method based on big data
CN117371996B (en) * 2023-12-06 2024-03-19 北京中能亿信软件有限公司 Electric power communication analysis method based on big data

Also Published As

Publication number Publication date
CN113010504B (en) 2022-06-10

Similar Documents

Publication Publication Date Title
CN113010504B (en) Electric power data anomaly detection method and system based on LSTM and improved K-means algorithm
CN109993270B (en) Lithium ion battery residual life prediction method based on gray wolf group optimization LSTM network
CN109991542B (en) Lithium ion battery residual life prediction method based on WDE optimization LSTM network
CN107992976B (en) Hot topic early development trend prediction system and prediction method
CN108733976B (en) Key protein identification method based on fusion biology and topological characteristics
CN112800231B (en) Power data verification method and device, computer equipment and storage medium
CN112329350A (en) Airplane lead-acid storage battery abnormity detection semi-supervision method based on isolation
CN112287980B (en) Power battery screening method based on typical feature vector
CN112305441B (en) Power battery health state assessment method under integrated clustering
CN112926635A (en) Target clustering method based on iterative adaptive neighbor propagation algorithm
Savargaonkar et al. A cycle-based recurrent neural network for state-of-charge estimation of li-ion battery cells
CN113534938B (en) Method for estimating residual electric quantity of notebook computer based on improved Elman neural network
CN117117859B (en) Photovoltaic power generation power prediction method and system based on neural network
CN116842459B (en) Electric energy metering fault diagnosis method and diagnosis terminal based on small sample learning
CN113657678A (en) Power grid power data prediction method based on information freshness
CN116774086B (en) Lithium battery health state estimation method based on multi-sensor data fusion
CN113376541A (en) Lithium ion battery health state prediction method based on CRJ network
CN113033898A (en) Electrical load prediction method and system based on K-means clustering and BI-LSTM neural network
CN117478390A (en) Network intrusion detection method based on improved density peak clustering algorithm
CN116449218B (en) Lithium battery health state estimation method
CN115794805A (en) Medium-low voltage distribution network measurement data supplementing method
CN115982608A (en) Line loss abnormity judgment method based on line loss dynamic analysis
CN115799580A (en) OS-ELM fuel cell fault diagnosis method based on optimized FCM training
CN113884807B (en) Power distribution network fault prediction method based on random forest and multi-layer architecture clustering
CN114896865A (en) Digital twin-oriented self-adaptive evolutionary neural network health state online prediction method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant