CN117633688A - Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm - Google Patents

Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm Download PDF

Info

Publication number
CN117633688A
CN117633688A CN202311635028.8A CN202311635028A CN117633688A CN 117633688 A CN117633688 A CN 117633688A CN 202311635028 A CN202311635028 A CN 202311635028A CN 117633688 A CN117633688 A CN 117633688A
Authority
CN
China
Prior art keywords
power data
data
model
lof
lstm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202311635028.8A
Other languages
Chinese (zh)
Inventor
季一木
李海天
李玲娟
刘尚东
徐驰
万玲莉
李昆珈
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Posts and Telecommunications
Original Assignee
Nanjing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Posts and Telecommunications filed Critical Nanjing University of Posts and Telecommunications
Priority to CN202311635028.8A priority Critical patent/CN117633688A/en
Publication of CN117633688A publication Critical patent/CN117633688A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/243Classification techniques relating to the number of classes
    • G06F18/2433Single-class perspective, e.g. one-against-all classification; Novelty detection; Outlier detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/10Pre-processing; Data cleansing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/213Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques
    • G06F18/232Non-hierarchical techniques
    • G06F18/2321Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions
    • G06F18/23213Non-hierarchical techniques using statistics or function optimisation, e.g. modelling of probability density functions with fixed number of clusters, e.g. K-means clustering
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • G06F18/2411Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches based on the proximity to a decision surface, e.g. support vector machines
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/25Fusion techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Biology (AREA)
  • Biophysics (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Molecular Biology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computational Linguistics (AREA)
  • Biomedical Technology (AREA)
  • Health & Medical Sciences (AREA)
  • Probability & Statistics with Applications (AREA)
  • Supply And Distribution Of Alternating Current (AREA)

Abstract

The invention belongs to the technical field of power data anomaly detection, and discloses a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which specifically comprises the following steps: before abnormality detection of power data, the average value and variance characteristics of the original large-scale power data are researched through an OVO SVMs model, the original power data are divided into four types of linear trend type, stable type, periodic type and random type, and for different types, ridge regression, k-means, LOF and LSTM fusion algorithm are constructed to perform abnormality detection. The method and the device can realize the rapid division of large-scale power data, can effectively avoid the problem that a single abnormality detection algorithm cannot detect all power data, and improve the accuracy and the efficiency of large-scale power data abnormality detection.

Description

Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm
Technical Field
The invention belongs to the technical field of power data anomaly detection, and particularly relates to a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm.
Background
With the continuous expansion of power systems, the generation and accumulation of large-scale power data has shown a trend of explosive growth. The data comprise various key information in the operation process of the power system, and each link from power generation to power transmission to power distribution comprises various parameters such as voltage, current, frequency, load and the like. However, as the complexity of the power system increases, abnormal situations in the power data are increasingly frequent due to factors such as equipment failure, human misoperation, external environment change, etc., and efficient detection and diagnosis of abnormalities in the power data become important tasks for ensuring stable operation and power supply quality of the power system. Different power data types may have different characteristics, distributions, and anomaly patterns, and applying a single algorithm to all types of power data may be difficult or even inefficient.
At present, the large-scale power data anomaly detection method can be classified into an anomaly detection algorithm based on a regression model, an anomaly detection algorithm based on clustering and an anomaly detection algorithm based on deep learning according to the working principle of the algorithm.
An anomaly detection algorithm based on a regression model detects anomalies by building a regression model of time series data, predicting the values of data points, and comparing the differences between the predicted values and the actual values. An anomaly is typically a large deviation between a predicted value and an actual value. This approach is applicable to time series data that have trending or periodic patterns, as regression models can capture these patterns. Common methods include linear regression, ARIMA (autoregressive integrated moving average model), ridge regression, and the like.
The cluster-based anomaly detection algorithm divides the time series data into different clusters and then detects anomalies by measuring the distance or similarity between the data points and the clusters to which they belong. Outliers may be points that are far from other data points or that do not fit the primary distribution within a cluster. This method is particularly applicable to time series data having similar patterns over certain time periods. Common methods are K-means clustering algorithm, DBSCAN algorithm and the like.
Deep learning based anomaly detection algorithms utilize deep neural networks to capture complex relationships and patterns in time series data. They can automatically learn the feature representations from the data and identify anomalies that do not match the expected patterns. This approach is particularly applicable to time series data that have highly complex and abstract features, as deep learning models are able to handle more complex data patterns. Common approaches include anomaly detection models based on Recurrent Neural Networks (RNNs), long short-term memory networks (LSTM), convolutional Neural Networks (CNNs), and the like.
The diversity of the power data results in different types of data distribution and variation patterns, and it is difficult to apply the single anomaly detection algorithm to all power data types, for example, the anomaly detection algorithm based on the regression model has a good effect of detecting linear trend type power data, but has a poor effect of detecting random type power data.
Disclosure of Invention
In order to solve the problems that the classification of large-scale power data is difficult and all power data cannot be detected through a single anomaly detection algorithm in the prior art, the invention provides the large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, the method can realize the rapid division of the large-scale power data, the problem that all power data cannot be detected through the single anomaly detection algorithm can be effectively avoided, and the accuracy and the efficiency of the large-scale power data anomaly detection are improved.
In order to achieve the above purpose, the invention is realized by the following technical scheme:
the invention discloses a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which comprises the following steps of:
step 1: preprocessing the original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into training set data and testing set data;
step 2: constructing an OVO SVMs model, obtaining a classification result by the OVO SVMs model by adopting the training set obtained in the step 1, comparing the classification result with the classification result of the test set, and obtaining a final OVO SVMs model after parameter optimization of the OVO SVMs model;
step 3: classifying the power data through a final OVO SVMs model;
step 4: building a ridge regression, k-means, LOF and LSTM fusion algorithm;
step 5: abnormality detection is carried out on the electric power data classified in the step 3 by adopting the ridge regression, k-means, LOF and LSTM fusion algorithm constructed in the step 4
Further, in step 1, preprocessing the original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into a training set and a testing set data, which specifically comprises:
step 1.1: preprocessing original large-scale power data, wherein the preprocessing comprises missing value processing, outlier processing, data normalization and dimension reduction;
step 1.2: calculating the mean and variance characteristics of the power data according to the date;
step 1.3: and (3) the power data after the feature extraction is processed according to 7: the scale of 3 is divided into training set data and test set data.
Further, in step 2, an OVO SVMs model is constructed, the training set obtained in step 1 is adopted to obtain a classification result by the OVO SVMs model, the classification result is compared with the classification result of the test set, and after parameter optimization is performed on the OVO SVMs model, a final OVO SVMs model is obtained, specifically:
step 2.1: an SVM classification model is designed between any two kinds of power data samples in four kinds of linear trend type, stable type, periodic type and random type, and the objective function of the SVM classification model is as follows:
wherein,is the normal vector of the hyperplane, representing the interface in the sample space, C is a regularization parameter that controls the degree of penalty in classification errors, ζ i Is a relaxation variable indicating the allowed error level for each sample,/->Class label of the i-th sample, +.>Is a feature vector of a sample, which contains feature information extracted from the data, and b is a deviation, which represents the distance between the hyperplane and the origin.
The decision function of the SVM classification model is:
where f (x) is the output of the decision function, alpha, for predicting the class of data points x i Is the coefficient obtained in the training process of the support vector machine,class label of the i-th sample, +.>Is the value of the kernel function, representing the feature vector +.>And the inner product of the data points x to be predicted after mapping to the high-dimensional feature space.
Step 2.2: the method comprises the steps of training a corresponding SVM classification model by a test set consisting of any two kinds of power data of linear trend type, stable type, periodic type and random type, testing the SVM classification model on the test set, and obtaining the final classification result with the highest vote number in the classification results in a voting mode by the test result.
Further, in step 3, the power data is classified by the final OVO SVMs model, specifically:
step 3.1: the unclassified power data are classified into preset categories, namely linear trend type, stable type, random type and periodic type by using the trained OVO SVMs.
Further, in step 4, a ridge regression, k-means, LOF and LSTM fusion algorithm is constructed, specifically:
step 4.1: building a ridge regression algorithm, wherein the loss function of the ridge regression algorithm is as follows:
wherein y is i Is the actual observation value, w is the model parameter vector, lambda is the regularized strength hyper-parameter, x i Is the actual observation time, w T Is a model parameterTranspose of the number vector, N, represents the number of samples.
Step 4.2: constructing a k-means algorithm, wherein the used distance measurement method is Euclidean distance, and the expression is:
where X represents a data point and Y represents a cluster center.
Using the contour coefficients as an evaluation index for the k-means algorithm, the contour coefficients s (i) for data point i are expressed as:
where a (i) represents the average distance of data point i to other points in the same cluster and b (i) represents the average distance of data point i to all data points in some other cluster.
The final contour coefficient is the average of the contour coefficients of all data points expressed as:
step 4.3: constructing an LOF algorithm, using an LOF value as an evaluation index of the LOF algorithm, wherein the LOF value expression of the data point p (i) is as follows:
where k is the domain size, LRD (p (j)) is the local reachable density of each data point p (j), LRD (p (i)) is the local reachable density of each data point p (i), expressed as:
wherein, reach-dist (p (i), p (j)) is the reachable distance from data point p (i) to data point p (j), and the expression is:
reach-dist(p(i),p(j))=max{k-distance(p(j)),dist(p(i),p(j))}
where dist (p (i), p (j)) represents the Euclidean distance between data point p (i) to data point p (j).
Step 4.4: an LSTM model is constructed, and the LSTM model comprises an LSTM layer with 4 memory units, a first random activation layer, a first full connection layer, a second random deactivation layer and a second full connection layer.
Further, in step 5, abnormality detection is performed on the power data classified in step 3 by using the ridge regression, k-means, LOF and LSTM fusion algorithm constructed in step 4, specifically:
step 5.1: the ridge regression model is applied to the classified linear trend type power data, the prediction residual error of each sample is calculated, and based on the distribution condition of the residual error, an abnormal threshold value is defined by using a statistical method. The residual is compared to a predefined threshold and samples exceeding the threshold are identified as anomalous data.
Step 5.2: the k-means model is applied to the classified stable power data, k clustering centers are initialized randomly, the distance between each sample and each clustering center point is calculated, the samples are divided into the nearest center points, the mean value of all sample characteristics divided into each class is calculated, the mean value is used as the new clustering center of each class, the above process is calculated repeatedly until the clustering centers are not changed, and the final clustering center and the class to which each sample belongs are output.
Step 5.3: applying the LOF model to the classified random power data, calculating the local reachable density and local outlier factor of each data point, determining a threshold value of the LOF value, and marking the data points with the LOF value exceeding the threshold value as abnormal points.
Step 5.4: dividing the classified periodic power data into a training set and a testing set, training an LSTM model by adopting the training set, and optimizing the LSTM model after comparing the training set with the testing value of the testing set to obtain a final LSTM model, and predicting the periodic power data by the final LSTM model.
The beneficial effects of the invention are as follows:
1. according to the large-scale power data anomaly detection method, the OVO SVMs are trained by selecting typical linear trend data, stable data, periodic data and random data in the power data through the OVO SVMs, and the trained OVO SVMs can realize rapid division of the power data with high precision and high efficiency.
2. The large-scale power data anomaly detection method can be used for anomaly detection of classified linear trend type, stable type, random type and periodic type power data by using ridge regression, k-means, LOF and LSTM algorithms respectively, can effectively avoid the problem that a single anomaly detection algorithm cannot detect all types of power data, and greatly improves anomaly detection efficiency.
Drawings
Fig. 1 is a flow chart of the large-scale power data anomaly detection method of the present invention.
FIG. 2 is a flowchart of a K-means clustering algorithm.
Fig. 3 is a flowchart of the LSTM algorithm.
Detailed Description
Embodiments of the invention are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the invention. That is, in some embodiments of the invention, these practical details are unnecessary.
As shown in fig. 1-3, the present invention is a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm, comprising the steps of,
step 1: preprocessing the original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into training set data and testing set data;
step 1.1: preprocessing the original large-scale power data, wherein the preprocessing comprises missing value processing, abnormal value processing and the like;
in step 1.1, when detecting the power data, the missing value may cause deviation in analysis and modeling, and if the missing value is not processed, the abnormal detection result may be greatly affected, so that the missing value is filled by using a mean filling method; outliers may be the result of data acquisition errors or other anomalies that may affect the accuracy of the analysis and the stability of the results, thus eliminating outliers.
Step 1.2: calculating the mean and variance characteristics of the power data according to the date;
in step 1.2, the mean and variance characteristics reflect the central location and degree of dispersion of the data population. Raw power data X defining a time series T ={x 1 ,x 2 ,...,x T Where T represents the time series length, the formulas of the mean and variance of the original power data are expressed as follows, respectively:
average value:wherein T is the total length of the original power data of the time sequence, and mu is the average value of the original power data of the time sequence. The mean is the most common statistical feature in data analysis and reflects the central trend of the time series.
Variance:wherein μ is the time series raw power data mean value, σ 2 Is the time series variance. The variance reflects the degree of dispersion of the time series, and for two groups of data with similar mean values, the variances of the two groups of data are not necessarily similar, so that the variance features are needed to be assisted on the basis of the mean features.
Step 1.3: and (3) the power data after the feature extraction is processed according to 7: the scale of 3 is divided into training set data and test set data.
Step 2: and (3) constructing an OVO SVMs model, obtaining a classification result by the OVO SVMs model by adopting the training set obtained in the step (1), comparing the classification result with the classification result of the test set, and obtaining a final OVO SVMs model after parameter optimization of the OVO SVMs model.
Step 2.1: an SVM classification model is designed between any two kinds of power data samples in four kinds of linear trend type, stable type, periodic type and random type, and the objective function of the SVM classification model is as follows:
wherein,is the normal vector of the hyperplane, representing the interface in the sample space, C is a regularization parameter that controls the degree of penalty in classification errors, ζ i Is a relaxation variable indicating the allowed error level for each sample,/->Class label of the i-th sample, +.>Is a feature vector of a sample, which contains feature information extracted from the data, and b is a deviation, which represents the distance between the hyperplane and the origin.
The decision function of the SVM classification model is:
where f (x) is the output of the decision function, alpha, for predicting the class of data points x i Is the coefficient obtained in the training process of the support vector machine,class label of ith sample,/>Is the value of the kernel function, representing the feature vector +.>And the inner product of the data points x to be predicted after mapping to the high-dimensional feature space.
Step 2.2: and training a corresponding SVM classification model by using the electric power data test set, testing the test set, and obtaining the final classification result with the highest number of votes in the classification result in a voting mode.
In step 2.2, the electric power data are preset into four types of linear trend type, stable type, periodic type and random type, and are respectively marked as 01, 02, 03 and 04; selecting vectors corresponding to (01, 02), (01, 03), (01, 04), (02, 03), (02, 04), (03, 04) as a training set during training, and training by using 6 different SVMs to obtain 6 trained SVM classification models; when the unknown power data is classified, each classifier judges the class of the unknown power data, a ticket is thrown for the corresponding class, and the class with the largest ticket is finally obtained and is used as the class of the power data.
Step 3: the power data were classified by the final OVO SVMs model.
Step 3.1: the unclassified power data are classified into preset categories, namely linear trend type, stable type, random type and periodic type by using the trained OVO SVMs, and the specific steps are as follows: the trained OVO SVMs are used to classify the power data after feature extraction, each of which is classified into one of linear trend type, stationary type, random type and periodic type.
And 4, constructing a ridge regression, k-means, LOF and LSTM fusion algorithm.
Step 4.1: building a ridge regression algorithm, wherein the loss function of the ridge regression algorithm is as follows:
wherein y is i Is the actual observation value, w is the model parameter vector, lambda is the regularized strength hyper-parameter, x i Is the actual observation time, w T Is a transpose of the model parameter vector, N representing the number of samples.
Step 4.2: constructing a k-means algorithm, wherein the used distance measurement method is Euclidean distance, and the expression is:
where X represents a data point and Y represents a cluster center.
Using the contour coefficients as an evaluation index for the k-means algorithm, the contour coefficients s (i) for data point i are expressed as:
where a (i) represents the average distance of data point i to other points in the same cluster and b (i) represents the average distance of data point i to all data points in some other cluster.
The final contour coefficient is the average of the contour coefficients of all data points expressed as:
step 4.3: constructing an LOF algorithm, using an LOF value as an evaluation index of the LOF algorithm, wherein the LOF value expression of the data point p (i) is as follows:
where k is the domain size, LRD (p (j)) is the local reachable density of each data point p (j), LRD (p (i)) is the local reachable density of each data point p (i), expressed as:
wherein, reach-dist (p (i), p (j)) is the reachable distance from data point p (i) to data point p (j), and the expression is:
reach-dist(p(i),p(j))=max{k-distance(p(j)),dist(p(i),p(j))}
where dist (p (i), p (j)) represents the Euclidean distance between data point p (i) to data point p (j).
Step 4.4: an LSTM model is constructed, and the LSTM model comprises an LSTM layer with 4 memory units, a first random activation layer, a first full connection layer, a second random deactivation layer and a second full connection layer.
And 5, performing anomaly detection on the classified power data in the step 3 by adopting a ridge regression, k-means, LOF and LSTM fusion algorithm constructed in the step 4.
Step 5.1: the ridge regression model is applied to the classified linear trend type power data, the prediction residual error of each sample is calculated, and an abnormal threshold value is defined by using a statistical method based on the distribution condition of the residual error. The residual is compared to a predefined threshold and samples exceeding the threshold are identified as anomalous data.
In step 5.1, the preprocessed power data is divided into a training set and a testing set, and the training set is used to build a ridge regression model. The ridge regression model adds regularization terms on the basis of linear regression to control the complexity of the model. The ridge regression model is trained using a training set. During each training step, the model adjusts the regression coefficients to minimize the loss function. The regularization term penalizes the larger regression coefficients, preventing overfitting. The performance of the trained ridge regression model was evaluated using the test set. And inputting the power data to be tested into a trained ridge regression model to obtain a predicted power data value. By comparing the predicted value with the actual value, the residual for each data point can be calculated. And setting a proper abnormal threshold according to the distribution condition of the residual errors. Data points exceeding this threshold are considered outliers.
Step 5.2: the k-means model is applied to the classified stable power data, k clustering centers are initialized randomly, the distance between each sample and each clustering center point is calculated, the samples are divided into the nearest center points, the mean value of all sample characteristics divided into each class is calculated, the mean value is used as the new clustering center of each class, the above process is calculated repeatedly until the clustering centers are not changed, and the final clustering center and the class to which each sample belongs are output.
In step 5.2, the data is divided into 2 clusters, and 2 data points are randomly selected as initial cluster center points. For each data point, its distance from the center of the respective cluster is calculated, and then the data point is assigned to the cluster closest to it. For each cluster, the mean of all data points in the cluster is calculated and then taken as the new cluster center. The above process is repeated until no significant change in the cluster center occurs or a predetermined number of iterations is reached. After K-means converges, for each data point, its distance from the center of the cluster to which it belongs is calculated, and if the distance exceeds a certain threshold, the data point can be marked as abnormal.
Step 5.3: applying the LOF model to the classified random power data, calculating the local reachable density and local outlier factor of each data point, determining a threshold value of the LOF value, and marking the data points with the LOF value exceeding the threshold value as abnormal points.
In step 5.3, for each data point, the euclidean distance between the data point and other data points is calculated, then a predefined parameter k, namely the number of neighbors, is selected, the distances are arranged in ascending order, and the first k distances are taken as the k-distances of the data point. For each data point, its reachable distance and local reachable density are calculated. A local anomaly factor (LOF) is calculated for each data point, which is the ratio of the local reachable density of that data point to the average local reachable density of its neighbors. Based on the previously calculated LOF value, an appropriate threshold, typically greater than 1, is set to determine if the data point is an outlier. If the LOF value of the data point exceeds the set threshold, it is marked as an outlier.
Step 5.4: dividing the classified periodic power data into a training set and a testing set, training an LSTM model by adopting the training set, and optimizing the LSTM model after comparing the training set with the testing value of the testing set to obtain a final LSTM model, and predicting the periodic power data by the final LSTM model.
According to the invention, the OVO SVMs model is adopted to select typical linear trend data, stable data, periodic data and random data in the electric power data to train the OVO SVMs model, the trained OVO SVMs model can realize rapid division of the electric power data with high precision and high efficiency, and the classified linear trend, stable, random and periodic electric power data can be subjected to anomaly detection by using ridge regression, k-means, LOF and LSTM algorithms respectively, so that the problem that a single anomaly detection algorithm cannot detect all types of electric power data can be effectively avoided, and the anomaly detection efficiency is greatly improved.
The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims (6)

1. A large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm is characterized by comprising the following steps of: the large-scale power data anomaly detection method comprises the following steps:
step 1, preprocessing original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after characteristic extraction into a training set and a testing set;
step 2, constructing an OVO SVMs model, obtaining a classification result by the OVO SVMs model by adopting the training set obtained in the step 1, comparing the classification result with the classification result of the test set, and obtaining a final OVO SVMs model after parameter optimization of the OVO SVMs model;
step 3, classifying the electric power data through the final OVO SVMs obtained in the step 2;
step 4, constructing a ridge regression, k-means, LOF and LSTM fusion algorithm;
and 5, performing anomaly detection on the classified power data in the step 3 by adopting a ridge regression, k-means, LOF and LSTM fusion algorithm constructed in the step 4.
2. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 1, preprocessing original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into training set and test set data, wherein the method specifically comprises the following steps:
step 1.1, preprocessing original large-scale power data, wherein the preprocessing comprises missing value processing, outlier processing, data normalization and dimension reduction;
step 1.2, calculating the mean value and variance characteristics of the power data according to the date;
step 1.3, the power data after feature extraction is processed according to 7: the scale of 3 is divided into training set data and test set data.
3. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 2, an OVO SVMs model is constructed, a training set obtained in step 1 is adopted to obtain a classification result by the OVO SVMs model, the classification result is compared with a classification result of a test set, and after parameter optimization is carried out on the OVO SVMs model, a final OVO SVMs model is obtained, and the method specifically comprises the following steps:
step 2.1, designing an SVM classification model between any two kinds of power data samples in four kinds of linear trend type, stable type, periodic type and random type, wherein the objective function of the SVM classification model is as follows:
wherein,is the normal vector of the hyperplane, representing the interface in the sample space, C is a regularization parameter that controls the degree of penalty in classification errors, ζ i Is a relaxation variable indicating the allowed error level for each sample,/->Class label of the i-th sample, +.>Is a feature vector of a sample, which contains feature information extracted from the data, b is a deviation, which represents the distance between the hyperplane and the origin;
the decision function of the SVM classification model is:
where f (x) is the output of the decision function, alpha, for predicting the class of data points x i Is the coefficient obtained in the training process of the support vector machine,class label of the i-th sample, +.>Is the value of the kernel function, representing the feature vector +.>And an inner product of the data points x to be predicted after being mapped to the high-dimensional feature space, wherein N represents the number of samples;
and 2.2, training a corresponding SVM classification model by a test set consisting of any two kinds of power data of four kinds of linear trend type, stable type, periodic type and random type, and testing on the test set, wherein the test result is in a voting form, and the highest vote number in the obtained classification result is a final classification result.
4. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 3, classifying the electric power data through an OVO SVMs model, specifically: the unclassified power data are classified into preset categories, namely linear trend type, stable type, periodic type and random type by using the trained OVO SVMs.
5. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 4, a ridge regression, k-means, LOF and LSTM fusion algorithm is constructed, which specifically comprises the following steps:
step 4.1, constructing a ridge regression algorithm, wherein the loss function of the ridge regression algorithm is as follows:
wherein y is i Is the actual observation value, w is the model parameter vector, lambda is the regularized strength hyper-parameter, x i Is the actual observation time, w T Is a transpose of the model parameter vector, N representing the number of samples;
step 4.2, constructing a k-means algorithm, wherein the used distance measurement method is Euclidean distance, and the expression is:
wherein X represents a data point and Y represents a cluster center;
using the contour coefficients as an evaluation index for the k-means algorithm, the contour coefficients s (i) for data point i are expressed as:
where a (i) represents the average distance of data point i to other points in the same cluster, b (i) represents the average distance of data point i to all data points in some other cluster,
the final contour coefficient is the average of the contour coefficients of all data points expressed as:
step 4.3, constructing an LOF algorithm, wherein the LOF value is used as an evaluation index of the LOF algorithm, and the LOF value expression of the data point p (i) is as follows:
where k is the domain size, LRD (p (j)) is the local reachable density of each data point p (j), LRD (p (i)) is the local reachable density of each data point p (i), expressed as:
wherein, reach-dist (p (i), p (j)) is the reachable distance from data point p (i) to data point p (j), and the expression is:
reach-dist(p(i),p(j))=max{k-distance(p(j)),dist(p(i),p(j))}
wherein dist (p (i), p (j)) represents the Euclidean distance between data point p (i) to data point p (j);
and 4.4, constructing an LSTM model, wherein the LSTM model comprises an LSTM layer with 4 memory units, a first random activation layer, a first full connection layer, a second random deactivation layer and a second full connection layer.
6. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 5, abnormal detection is performed on the classified power data in step 3 by adopting the ridge regression, k-means, LOF and LSTM fusion algorithm constructed in step 4, and the method specifically comprises the following steps:
step 5.1: the ridge regression model is applied to the classified linear trend type power data, the prediction residual error of each sample is calculated, and based on the distribution condition of the residual error, an abnormal threshold value is defined by using a statistical method. The residual is compared to a predefined threshold and samples exceeding the threshold are identified as anomalous data.
Step 5.2, applying a k-means model to the classified stable power data, randomly initializing k clustering centers, calculating the distance between each sample and each clustering center point, dividing the sample to the nearest center point, calculating the mean value of all sample characteristics divided into each category, taking the mean value as a new clustering center of each category, repeating the calculation step 5.2 until the clustering center is not changed, and outputting the final clustering center and the category to which each sample belongs;
step 5.3, applying an LOF model to the classified random power data, calculating local reachable density and local outlier factor of each data point, determining a threshold value of an LOF value, and marking the data points with the LOF values exceeding the threshold value as abnormal points;
and 5.4, dividing the classified periodic power data into a training set and a testing set, training an LSTM model by adopting the training set, and optimizing the LSTM model after comparing the training set with the testing value of the testing set to obtain a final LSTM model, and predicting the periodic power data by the final LSTM model.
CN202311635028.8A 2023-12-01 2023-12-01 Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm Pending CN117633688A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202311635028.8A CN117633688A (en) 2023-12-01 2023-12-01 Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202311635028.8A CN117633688A (en) 2023-12-01 2023-12-01 Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm

Publications (1)

Publication Number Publication Date
CN117633688A true CN117633688A (en) 2024-03-01

Family

ID=90021166

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202311635028.8A Pending CN117633688A (en) 2023-12-01 2023-12-01 Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm

Country Status (1)

Country Link
CN (1) CN117633688A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133435A (en) * 2024-05-08 2024-06-04 北京理工大学长三角研究院(嘉兴) SVR and clustering-based complex spacecraft on-orbit anomaly detection method

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN118133435A (en) * 2024-05-08 2024-06-04 北京理工大学长三角研究院(嘉兴) SVR and clustering-based complex spacecraft on-orbit anomaly detection method

Similar Documents

Publication Publication Date Title
CN108073158A (en) Based on PCA and KNN density algorithm Wind turbines Method for Bearing Fault Diagnosis
CN116757534B (en) Intelligent refrigerator reliability analysis method based on neural training network
CN113255848B (en) Water turbine cavitation sound signal identification method based on big data learning
CN106843195B (en) The Fault Classification differentiated based on adaptive set at semi-supervised Fei Sheer
CN110571792A (en) Analysis and evaluation method and system for operation state of power grid regulation and control system
CN109993225B (en) Airspace complexity classification method and device based on unsupervised learning
CN116380445B (en) Equipment state diagnosis method and related device based on vibration waveform
EP1958034B1 (en) Use of sequential clustering for instance selection in machine condition monitoring
CN110795690A (en) Wind power plant operation abnormal data detection method
CN118154174B (en) Intelligent operation and maintenance cloud platform for industrial equipment
CN117633688A (en) Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm
Shi et al. Health index synthetization and remaining useful life estimation for turbofan engines based on run-to-failure datasets
CN112416662A (en) Multi-time series data anomaly detection method and device
CN114118219A (en) Data-driven real-time abnormal detection method for health state of long-term power-on equipment
CN110673577B (en) Distributed monitoring and fault diagnosis method for complex chemical production process
Sharma et al. A semi-supervised generalized vae framework for abnormality detection using one-class classification
CN117330315A (en) Rotary machine fault monitoring method based on online migration learning
CN111461565A (en) Power supply side power generation performance evaluation method under power regulation
CN117951646A (en) Data fusion method and system based on edge cloud
CN117131022B (en) Heterogeneous data migration method of electric power information system
CN116956197B (en) Deep learning-based energy facility fault prediction method and device and electronic equipment
CN113935413A (en) Distribution network wave recording file waveform identification method based on convolutional neural network
Huang et al. A hybrid bayesian deep learning model for remaining useful life prognostics and uncertainty quantification
CN116861232A (en) Air quality data anomaly detection model based on DBN-OCSVM
CN117435969A (en) Switch cabinet health state assessment method, device, equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination