CN117633688A

CN117633688A - Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm

Info

Publication number: CN117633688A
Application number: CN202311635028.8A
Authority: CN
Inventors: 季一木; 李海天; 李玲娟; 刘尚东; 徐驰; 万玲莉; 李昆珈
Original assignee: Nanjing University of Posts and Telecommunications
Current assignee: Nanjing University of Posts and Telecommunications
Priority date: 2023-12-01
Filing date: 2023-12-01
Publication date: 2024-03-01

Abstract

The invention belongs to the technical field of power data anomaly detection, and discloses a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which specifically comprises the following steps: before abnormality detection of power data, the average value and variance characteristics of the original large-scale power data are researched through an OVO SVMs model, the original power data are divided into four types of linear trend type, stable type, periodic type and random type, and for different types, ridge regression, k-means, LOF and LSTM fusion algorithm are constructed to perform abnormality detection. The method and the device can realize the rapid division of large-scale power data, can effectively avoid the problem that a single abnormality detection algorithm cannot detect all power data, and improve the accuracy and the efficiency of large-scale power data abnormality detection.

Description

Large-scale power data anomaly detection method based on ridge regression-k-means clustering-LOF-LSTM fusion algorithm

Technical Field

The invention belongs to the technical field of power data anomaly detection, and particularly relates to a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm.

Background

With the continuous expansion of power systems, the generation and accumulation of large-scale power data has shown a trend of explosive growth. The data comprise various key information in the operation process of the power system, and each link from power generation to power transmission to power distribution comprises various parameters such as voltage, current, frequency, load and the like. However, as the complexity of the power system increases, abnormal situations in the power data are increasingly frequent due to factors such as equipment failure, human misoperation, external environment change, etc., and efficient detection and diagnosis of abnormalities in the power data become important tasks for ensuring stable operation and power supply quality of the power system. Different power data types may have different characteristics, distributions, and anomaly patterns, and applying a single algorithm to all types of power data may be difficult or even inefficient.

At present, the large-scale power data anomaly detection method can be classified into an anomaly detection algorithm based on a regression model, an anomaly detection algorithm based on clustering and an anomaly detection algorithm based on deep learning according to the working principle of the algorithm.

An anomaly detection algorithm based on a regression model detects anomalies by building a regression model of time series data, predicting the values of data points, and comparing the differences between the predicted values and the actual values. An anomaly is typically a large deviation between a predicted value and an actual value. This approach is applicable to time series data that have trending or periodic patterns, as regression models can capture these patterns. Common methods include linear regression, ARIMA (autoregressive integrated moving average model), ridge regression, and the like.

The cluster-based anomaly detection algorithm divides the time series data into different clusters and then detects anomalies by measuring the distance or similarity between the data points and the clusters to which they belong. Outliers may be points that are far from other data points or that do not fit the primary distribution within a cluster. This method is particularly applicable to time series data having similar patterns over certain time periods. Common methods are K-means clustering algorithm, DBSCAN algorithm and the like.

Deep learning based anomaly detection algorithms utilize deep neural networks to capture complex relationships and patterns in time series data. They can automatically learn the feature representations from the data and identify anomalies that do not match the expected patterns. This approach is particularly applicable to time series data that have highly complex and abstract features, as deep learning models are able to handle more complex data patterns. Common approaches include anomaly detection models based on Recurrent Neural Networks (RNNs), long short-term memory networks (LSTM), convolutional Neural Networks (CNNs), and the like.

The diversity of the power data results in different types of data distribution and variation patterns, and it is difficult to apply the single anomaly detection algorithm to all power data types, for example, the anomaly detection algorithm based on the regression model has a good effect of detecting linear trend type power data, but has a poor effect of detecting random type power data.

Disclosure of Invention

In order to solve the problems that the classification of large-scale power data is difficult and all power data cannot be detected through a single anomaly detection algorithm in the prior art, the invention provides the large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, the method can realize the rapid division of the large-scale power data, the problem that all power data cannot be detected through the single anomaly detection algorithm can be effectively avoided, and the accuracy and the efficiency of the large-scale power data anomaly detection are improved.

In order to achieve the above purpose, the invention is realized by the following technical scheme:

the invention discloses a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which comprises the following steps of:

step 1: preprocessing the original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into training set data and testing set data;

step 2: constructing an OVO SVMs model, obtaining a classification result by the OVO SVMs model by adopting the training set obtained in the step 1, comparing the classification result with the classification result of the test set, and obtaining a final OVO SVMs model after parameter optimization of the OVO SVMs model;

step 3: classifying the power data through a final OVO SVMs model;

step 4: building a ridge regression, k-means, LOF and LSTM fusion algorithm;

step 5: abnormality detection is carried out on the electric power data classified in the step 3 by adopting the ridge regression, k-means, LOF and LSTM fusion algorithm constructed in the step 4

Further, in step 1, preprocessing the original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into a training set and a testing set data, which specifically comprises:

step 1.1: preprocessing original large-scale power data, wherein the preprocessing comprises missing value processing, outlier processing, data normalization and dimension reduction;

step 1.2: calculating the mean and variance characteristics of the power data according to the date;

step 1.3: and (3) the power data after the feature extraction is processed according to 7: the scale of 3 is divided into training set data and test set data.

Further, in step 2, an OVO SVMs model is constructed, the training set obtained in step 1 is adopted to obtain a classification result by the OVO SVMs model, the classification result is compared with the classification result of the test set, and after parameter optimization is performed on the OVO SVMs model, a final OVO SVMs model is obtained, specifically:

step 2.1: an SVM classification model is designed between any two kinds of power data samples in four kinds of linear trend type, stable type, periodic type and random type, and the objective function of the SVM classification model is as follows:

wherein,is the normal vector of the hyperplane, representing the interface in the sample space, C is a regularization parameter that controls the degree of penalty in classification errors, ζ _i Is a relaxation variable indicating the allowed error level for each sample,/->Class label of the i-th sample, +.>Is a feature vector of a sample, which contains feature information extracted from the data, and b is a deviation, which represents the distance between the hyperplane and the origin.

The decision function of the SVM classification model is:

where f (x) is the output of the decision function, alpha, for predicting the class of data points x _i Is the coefficient obtained in the training process of the support vector machine,class label of the i-th sample, +.>Is the value of the kernel function, representing the feature vector +.>And the inner product of the data points x to be predicted after mapping to the high-dimensional feature space.

Step 2.2: the method comprises the steps of training a corresponding SVM classification model by a test set consisting of any two kinds of power data of linear trend type, stable type, periodic type and random type, testing the SVM classification model on the test set, and obtaining the final classification result with the highest vote number in the classification results in a voting mode by the test result.

Further, in step 3, the power data is classified by the final OVO SVMs model, specifically:

step 3.1: the unclassified power data are classified into preset categories, namely linear trend type, stable type, random type and periodic type by using the trained OVO SVMs.

Further, in step 4, a ridge regression, k-means, LOF and LSTM fusion algorithm is constructed, specifically:

step 4.1: building a ridge regression algorithm, wherein the loss function of the ridge regression algorithm is as follows:

wherein y is _i Is the actual observation value, w is the model parameter vector, lambda is the regularized strength hyper-parameter, x _i Is the actual observation time, w ^T Is a model parameterTranspose of the number vector, N, represents the number of samples.

Step 4.2: constructing a k-means algorithm, wherein the used distance measurement method is Euclidean distance, and the expression is:

where X represents a data point and Y represents a cluster center.

Using the contour coefficients as an evaluation index for the k-means algorithm, the contour coefficients s (i) for data point i are expressed as:

where a (i) represents the average distance of data point i to other points in the same cluster and b (i) represents the average distance of data point i to all data points in some other cluster.

The final contour coefficient is the average of the contour coefficients of all data points expressed as:

step 4.3: constructing an LOF algorithm, using an LOF value as an evaluation index of the LOF algorithm, wherein the LOF value expression of the data point p (i) is as follows:

where k is the domain size, LRD (p (j)) is the local reachable density of each data point p (j), LRD (p (i)) is the local reachable density of each data point p (i), expressed as:

wherein, reach-dist (p (i), p (j)) is the reachable distance from data point p (i) to data point p (j), and the expression is:

reach-dist(p(i),p(j))＝max{k-distance(p(j)),dist(p(i),p(j))}

where dist (p (i), p (j)) represents the Euclidean distance between data point p (i) to data point p (j).

Step 4.4: an LSTM model is constructed, and the LSTM model comprises an LSTM layer with 4 memory units, a first random activation layer, a first full connection layer, a second random deactivation layer and a second full connection layer.

Further, in step 5, abnormality detection is performed on the power data classified in step 3 by using the ridge regression, k-means, LOF and LSTM fusion algorithm constructed in step 4, specifically:

step 5.1: the ridge regression model is applied to the classified linear trend type power data, the prediction residual error of each sample is calculated, and based on the distribution condition of the residual error, an abnormal threshold value is defined by using a statistical method. The residual is compared to a predefined threshold and samples exceeding the threshold are identified as anomalous data.

Step 5.2: the k-means model is applied to the classified stable power data, k clustering centers are initialized randomly, the distance between each sample and each clustering center point is calculated, the samples are divided into the nearest center points, the mean value of all sample characteristics divided into each class is calculated, the mean value is used as the new clustering center of each class, the above process is calculated repeatedly until the clustering centers are not changed, and the final clustering center and the class to which each sample belongs are output.

Step 5.3: applying the LOF model to the classified random power data, calculating the local reachable density and local outlier factor of each data point, determining a threshold value of the LOF value, and marking the data points with the LOF value exceeding the threshold value as abnormal points.

Step 5.4: dividing the classified periodic power data into a training set and a testing set, training an LSTM model by adopting the training set, and optimizing the LSTM model after comparing the training set with the testing value of the testing set to obtain a final LSTM model, and predicting the periodic power data by the final LSTM model.

The beneficial effects of the invention are as follows:

1. according to the large-scale power data anomaly detection method, the OVO SVMs are trained by selecting typical linear trend data, stable data, periodic data and random data in the power data through the OVO SVMs, and the trained OVO SVMs can realize rapid division of the power data with high precision and high efficiency.

2. The large-scale power data anomaly detection method can be used for anomaly detection of classified linear trend type, stable type, random type and periodic type power data by using ridge regression, k-means, LOF and LSTM algorithms respectively, can effectively avoid the problem that a single anomaly detection algorithm cannot detect all types of power data, and greatly improves anomaly detection efficiency.

Drawings

Fig. 1 is a flow chart of the large-scale power data anomaly detection method of the present invention.

FIG. 2 is a flowchart of a K-means clustering algorithm.

Fig. 3 is a flowchart of the LSTM algorithm.

Detailed Description

Embodiments of the invention are disclosed in the drawings, and for purposes of explanation, numerous practical details are set forth in the following description. However, it should be understood that these practical details are not to be taken as limiting the invention. That is, in some embodiments of the invention, these practical details are unnecessary.

As shown in fig. 1-3, the present invention is a large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm, comprising the steps of,

step 1.1: preprocessing the original large-scale power data, wherein the preprocessing comprises missing value processing, abnormal value processing and the like;

in step 1.1, when detecting the power data, the missing value may cause deviation in analysis and modeling, and if the missing value is not processed, the abnormal detection result may be greatly affected, so that the missing value is filled by using a mean filling method; outliers may be the result of data acquisition errors or other anomalies that may affect the accuracy of the analysis and the stability of the results, thus eliminating outliers.

in step 1.2, the mean and variance characteristics reflect the central location and degree of dispersion of the data population. Raw power data X defining a time series _T ＝{x ₁ ,x ₂ ,...,x _T Where T represents the time series length, the formulas of the mean and variance of the original power data are expressed as follows, respectively:

average value:wherein T is the total length of the original power data of the time sequence, and mu is the average value of the original power data of the time sequence. The mean is the most common statistical feature in data analysis and reflects the central trend of the time series.

Variance:wherein μ is the time series raw power data mean value, σ ² Is the time series variance. The variance reflects the degree of dispersion of the time series, and for two groups of data with similar mean values, the variances of the two groups of data are not necessarily similar, so that the variance features are needed to be assisted on the basis of the mean features.

Step 2: and (3) constructing an OVO SVMs model, obtaining a classification result by the OVO SVMs model by adopting the training set obtained in the step (1), comparing the classification result with the classification result of the test set, and obtaining a final OVO SVMs model after parameter optimization of the OVO SVMs model.

The decision function of the SVM classification model is:

where f (x) is the output of the decision function, alpha, for predicting the class of data points x _i Is the coefficient obtained in the training process of the support vector machine,class label of ith sample，/>Is the value of the kernel function, representing the feature vector +.>And the inner product of the data points x to be predicted after mapping to the high-dimensional feature space.

Step 2.2: and training a corresponding SVM classification model by using the electric power data test set, testing the test set, and obtaining the final classification result with the highest number of votes in the classification result in a voting mode.

In step 2.2, the electric power data are preset into four types of linear trend type, stable type, periodic type and random type, and are respectively marked as 01, 02, 03 and 04; selecting vectors corresponding to (01, 02), (01, 03), (01, 04), (02, 03), (02, 04), (03, 04) as a training set during training, and training by using 6 different SVMs to obtain 6 trained SVM classification models; when the unknown power data is classified, each classifier judges the class of the unknown power data, a ticket is thrown for the corresponding class, and the class with the largest ticket is finally obtained and is used as the class of the power data.

Step 3: the power data were classified by the final OVO SVMs model.

Step 3.1: the unclassified power data are classified into preset categories, namely linear trend type, stable type, random type and periodic type by using the trained OVO SVMs, and the specific steps are as follows: the trained OVO SVMs are used to classify the power data after feature extraction, each of which is classified into one of linear trend type, stationary type, random type and periodic type.

And 4, constructing a ridge regression, k-means, LOF and LSTM fusion algorithm.

wherein y is _i Is the actual observation value, w is the model parameter vector, lambda is the regularized strength hyper-parameter, x _i Is the actual observation time, w ^T Is a transpose of the model parameter vector, N representing the number of samples.

where X represents a data point and Y represents a cluster center.

reach-dist(p(i),p(j))＝max{k-distance(p(j)),dist(p(i),p(j))}

And 5, performing anomaly detection on the classified power data in the step 3 by adopting a ridge regression, k-means, LOF and LSTM fusion algorithm constructed in the step 4.

Step 5.1: the ridge regression model is applied to the classified linear trend type power data, the prediction residual error of each sample is calculated, and an abnormal threshold value is defined by using a statistical method based on the distribution condition of the residual error. The residual is compared to a predefined threshold and samples exceeding the threshold are identified as anomalous data.

In step 5.1, the preprocessed power data is divided into a training set and a testing set, and the training set is used to build a ridge regression model. The ridge regression model adds regularization terms on the basis of linear regression to control the complexity of the model. The ridge regression model is trained using a training set. During each training step, the model adjusts the regression coefficients to minimize the loss function. The regularization term penalizes the larger regression coefficients, preventing overfitting. The performance of the trained ridge regression model was evaluated using the test set. And inputting the power data to be tested into a trained ridge regression model to obtain a predicted power data value. By comparing the predicted value with the actual value, the residual for each data point can be calculated. And setting a proper abnormal threshold according to the distribution condition of the residual errors. Data points exceeding this threshold are considered outliers.

In step 5.2, the data is divided into 2 clusters, and 2 data points are randomly selected as initial cluster center points. For each data point, its distance from the center of the respective cluster is calculated, and then the data point is assigned to the cluster closest to it. For each cluster, the mean of all data points in the cluster is calculated and then taken as the new cluster center. The above process is repeated until no significant change in the cluster center occurs or a predetermined number of iterations is reached. After K-means converges, for each data point, its distance from the center of the cluster to which it belongs is calculated, and if the distance exceeds a certain threshold, the data point can be marked as abnormal.

In step 5.3, for each data point, the euclidean distance between the data point and other data points is calculated, then a predefined parameter k, namely the number of neighbors, is selected, the distances are arranged in ascending order, and the first k distances are taken as the k-distances of the data point. For each data point, its reachable distance and local reachable density are calculated. A local anomaly factor (LOF) is calculated for each data point, which is the ratio of the local reachable density of that data point to the average local reachable density of its neighbors. Based on the previously calculated LOF value, an appropriate threshold, typically greater than 1, is set to determine if the data point is an outlier. If the LOF value of the data point exceeds the set threshold, it is marked as an outlier.

According to the invention, the OVO SVMs model is adopted to select typical linear trend data, stable data, periodic data and random data in the electric power data to train the OVO SVMs model, the trained OVO SVMs model can realize rapid division of the electric power data with high precision and high efficiency, and the classified linear trend, stable, random and periodic electric power data can be subjected to anomaly detection by using ridge regression, k-means, LOF and LSTM algorithms respectively, so that the problem that a single anomaly detection algorithm cannot detect all types of electric power data can be effectively avoided, and the anomaly detection efficiency is greatly improved.

The foregoing is only a preferred embodiment of the invention, it being noted that: it will be apparent to those skilled in the art that various modifications and adaptations can be made without departing from the principles of the present invention, and such modifications and adaptations are intended to be comprehended within the scope of the invention.

Claims

1. A large-scale power data anomaly detection method based on a ridge regression-k-means clustering-LOF-LSTM fusion algorithm is characterized by comprising the following steps of: the large-scale power data anomaly detection method comprises the following steps:

step 1, preprocessing original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after characteristic extraction into a training set and a testing set;

step 2, constructing an OVO SVMs model, obtaining a classification result by the OVO SVMs model by adopting the training set obtained in the step 1, comparing the classification result with the classification result of the test set, and obtaining a final OVO SVMs model after parameter optimization of the OVO SVMs model;

step 3, classifying the electric power data through the final OVO SVMs obtained in the step 2;

step 4, constructing a ridge regression, k-means, LOF and LSTM fusion algorithm;

2. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 1, preprocessing original large-scale power data, extracting the mean value and variance characteristics of the power data every day, and dividing the power data after the characteristic extraction into training set and test set data, wherein the method specifically comprises the following steps:

step 1.1, preprocessing original large-scale power data, wherein the preprocessing comprises missing value processing, outlier processing, data normalization and dimension reduction;

step 1.2, calculating the mean value and variance characteristics of the power data according to the date;

step 1.3, the power data after feature extraction is processed according to 7: the scale of 3 is divided into training set data and test set data.

3. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 2, an OVO SVMs model is constructed, a training set obtained in step 1 is adopted to obtain a classification result by the OVO SVMs model, the classification result is compared with a classification result of a test set, and after parameter optimization is carried out on the OVO SVMs model, a final OVO SVMs model is obtained, and the method specifically comprises the following steps:

step 2.1, designing an SVM classification model between any two kinds of power data samples in four kinds of linear trend type, stable type, periodic type and random type, wherein the objective function of the SVM classification model is as follows:

wherein,is the normal vector of the hyperplane, representing the interface in the sample space, C is a regularization parameter that controls the degree of penalty in classification errors, ζ _i Is a relaxation variable indicating the allowed error level for each sample,/->Class label of the i-th sample, +.>Is a feature vector of a sample, which contains feature information extracted from the data, b is a deviation, which represents the distance between the hyperplane and the origin;

the decision function of the SVM classification model is:

where f (x) is the output of the decision function, alpha, for predicting the class of data points x _i Is the coefficient obtained in the training process of the support vector machine,class label of the i-th sample, +.>Is the value of the kernel function, representing the feature vector +.>And an inner product of the data points x to be predicted after being mapped to the high-dimensional feature space, wherein N represents the number of samples;

and 2.2, training a corresponding SVM classification model by a test set consisting of any two kinds of power data of four kinds of linear trend type, stable type, periodic type and random type, and testing on the test set, wherein the test result is in a voting form, and the highest vote number in the obtained classification result is a final classification result.

4. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 3, classifying the electric power data through an OVO SVMs model, specifically: the unclassified power data are classified into preset categories, namely linear trend type, stable type, periodic type and random type by using the trained OVO SVMs.

5. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 4, a ridge regression, k-means, LOF and LSTM fusion algorithm is constructed, which specifically comprises the following steps:

step 4.1, constructing a ridge regression algorithm, wherein the loss function of the ridge regression algorithm is as follows:

wherein y is _i Is the actual observation value, w is the model parameter vector, lambda is the regularized strength hyper-parameter, x _i Is the actual observation time, w ^T Is a transpose of the model parameter vector, N representing the number of samples;

step 4.2, constructing a k-means algorithm, wherein the used distance measurement method is Euclidean distance, and the expression is:

wherein X represents a data point and Y represents a cluster center;

where a (i) represents the average distance of data point i to other points in the same cluster, b (i) represents the average distance of data point i to all data points in some other cluster,

step 4.3, constructing an LOF algorithm, wherein the LOF value is used as an evaluation index of the LOF algorithm, and the LOF value expression of the data point p (i) is as follows:

reach-dist(p(i),p(j))＝max{k-distance(p(j)),dist(p(i),p(j))}

wherein dist (p (i), p (j)) represents the Euclidean distance between data point p (i) to data point p (j);

and 4.4, constructing an LSTM model, wherein the LSTM model comprises an LSTM layer with 4 memory units, a first random activation layer, a first full connection layer, a second random deactivation layer and a second full connection layer.

6. The large-scale power data anomaly detection method based on the ridge regression-k-means clustering-LOF-LSTM fusion algorithm, which is characterized by comprising the following steps of: in step 5, abnormal detection is performed on the classified power data in step 3 by adopting the ridge regression, k-means, LOF and LSTM fusion algorithm constructed in step 4, and the method specifically comprises the following steps:

Step 5.2, applying a k-means model to the classified stable power data, randomly initializing k clustering centers, calculating the distance between each sample and each clustering center point, dividing the sample to the nearest center point, calculating the mean value of all sample characteristics divided into each category, taking the mean value as a new clustering center of each category, repeating the calculation step 5.2 until the clustering center is not changed, and outputting the final clustering center and the category to which each sample belongs;

step 5.3, applying an LOF model to the classified random power data, calculating local reachable density and local outlier factor of each data point, determining a threshold value of an LOF value, and marking the data points with the LOF values exceeding the threshold value as abnormal points;

and 5.4, dividing the classified periodic power data into a training set and a testing set, training an LSTM model by adopting the training set, and optimizing the LSTM model after comparing the training set with the testing value of the testing set to obtain a final LSTM model, and predicting the periodic power data by the final LSTM model.