CN111767273B

CN111767273B - Data intelligent detection method and device based on improved SOM algorithm

Info

Publication number: CN111767273B
Application number: CN202010575124.8A
Authority: CN
Inventors: 胡伟; 郭秋婷; 黄建平; 陈浩; 盛银波
Original assignee: Tsinghua University; State Grid Corp of China SGCC; Jiaxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Current assignee: Tsinghua University; State Grid Corp of China SGCC; Jiaxing Power Supply Co of State Grid Zhejiang Electric Power Co Ltd
Priority date: 2020-06-22
Filing date: 2020-06-22
Publication date: 2023-05-23
Anticipated expiration: 2040-06-22
Also published as: CN111767273A

Abstract

The invention discloses a data intelligent detection method and device based on an improved SOM algorithm, wherein the method comprises the following steps: acquiring a sample set, decomposing the sample set according to dimensions, detecting one-dimensional isolated points based on density by dimension, primarily screening outliers according to dimensions by the multi-dimensional sample set, and removing the outliers; clustering the sample set based on a self-organizing feature mapping algorithm, and removing abnormal data points; improving a self-organizing feature-based mapping algorithm by a kernel function method, clustering a sample set by the improved self-organizing feature-based mapping algorithm, and removing abnormal data points; and removing abnormal data points in the sample set according to expert experience to finish intelligent detection of the data. According to the method, abnormal data can be removed by means of density-based one-dimensional isolated point detection, the data quality is improved, the influence of sample data nonlinearity can be reduced by introducing a kernel function into weight updating of a self-organizing map algorithm, and the clustering effect of an SOM algorithm is improved.

Description

Data intelligent detection method and device based on improved SOM algorithm

Technical Field

The invention relates to the technical field of big data intelligent detection, in particular to a data intelligent detection method and device based on an improved SOM algorithm.

Background

The data acquisition devices such as various sensors are arranged on the equipment, so that the equipment operation data can be acquired, and the equipment operation state can be monitored. Due to factors such as complex system, bad environment and the like, the data collected by the system has the characteristics of strong nonlinearity, large noise, extremely unstable and the like. Thus, abnormal data detection is one of the important steps in data preprocessing.

The outlier data is also called outliers. The abnormal data may be erroneous data due to equipment failure, erroneous measurement, or the like, or may be a meaningful event corresponding to reality. The erroneous data has an adverse effect on the system operation. If the error data is not found and removed in time, the equipment can be damaged, and potential hidden danger is brought to the system operation.

The phenomena of interruption, deletion, acquisition deviation and the like of the data acquisition of the existing automatic service system are common. The task of basic data detection is heavy, but the current system has insufficient self-detection capability and mainly depends on manual detection. The manual error detection not only consumes a great deal of manpower and time, but also cannot guarantee the accuracy of the manual error detection.

Disclosure of Invention

The present invention aims to solve at least one of the technical problems in the related art to some extent.

Therefore, an object of the present invention is to provide an improved SOM algorithm-based data intelligent detection method, which uses density-based one-dimensional isolated point detection to remove abnormal data, improve data quality, and introduce a kernel function into weight update of a self-organizing map algorithm to reduce the nonlinear influence of sample data and improve the clustering effect of the SOM algorithm.

Another object of the present invention is to provide a data intelligent detection device based on an improved SOM algorithm.

In order to achieve the above objective, an embodiment of an aspect of the present invention provides a data intelligent detection method based on an improved SOM algorithm, including:

obtaining a sample set, decomposing the sample set according to dimensions, detecting one-dimensional isolated points based on density by dimensions, primarily screening outliers according to dimensions by the multi-dimensional sample set, and removing the outliers;

clustering the sample set based on a self-organizing feature mapping algorithm, and removing abnormal data points;

the self-organizing feature mapping algorithm is improved through a kernel function method, a sample set is clustered through the improved self-organizing feature mapping algorithm, and abnormal data points are removed;

and removing abnormal data points in the sample set according to expert experience to finish intelligent detection of the data.

In order to achieve the above object, another embodiment of the present invention provides an intelligent data detection device based on an improved SOM algorithm, including:

the first rejecting module is used for acquiring a sample set, decomposing the sample set according to dimensions, detecting one-dimensional isolated points based on density by one dimension, primarily screening outliers according to dimensions by the multi-dimensional sample set, and rejecting the outliers;

the second rejecting module is used for rejecting abnormal data points by clustering the sample set based on a self-organizing feature mapping algorithm;

the third rejecting module is used for improving the self-organizing feature mapping algorithm through a kernel function method, clustering a sample set through the improved self-organizing feature mapping algorithm, and rejecting abnormal data points;

and the intelligent detection module is used for eliminating abnormal data points in the sample set according to expert experience to finish intelligent detection of the data.

The technical scheme of the invention has the following technical effects:

(1) The intelligent data detection method based on the density and improved self-organizing feature mapping algorithm is established, abnormal data can be removed, and the data quality is improved.

(2) The kernel function is introduced into the weight updating of the self-organizing map algorithm, so that the influence of sample data nonlinearity can be reduced, and the clustering effect of the SOM algorithm is improved.

Additional aspects and advantages of the invention will be set forth in part in the description which follows and, in part, will be obvious from the description, or may be learned by practice of the invention.

Drawings

The foregoing and/or additional aspects and advantages of the invention will become apparent and readily appreciated from the following description of the embodiments, taken in conjunction with the accompanying drawings, in which:

FIG. 1 is a flow chart of a method for intelligent detection of data based on an improved SOM algorithm according to one embodiment of the present invention;

FIG. 2 is a flow chart of a method of improved SOM algorithm data detection in accordance with one embodiment of the present invention;

fig. 3 is a schematic structural diagram of a data intelligent detection device based on an improved SOM algorithm according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in detail below, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to like or similar elements or elements having like or similar functions throughout. The embodiments described below by referring to the drawings are illustrative and intended to explain the present invention and should not be construed as limiting the invention.

The following describes a data intelligent detection method and device based on an improved SOM algorithm according to an embodiment of the present invention with reference to the accompanying drawings.

First, a data intelligent detection method based on an improved SOM algorithm according to an embodiment of the present invention will be described with reference to the accompanying drawings.

Fig. 1 is a flowchart of a method for intelligently detecting data based on an improved SOM algorithm according to an embodiment of the present invention.

As shown in fig. 1, the data intelligent detection method based on the improved SOM algorithm comprises the following steps:

step S1, a sample set is obtained, the sample set is decomposed according to dimensions, one-dimensional isolated point detection based on density is carried out dimension by dimension, the multi-dimensional sample set is subjected to preliminary screening on outliers according to dimensions, and outliers are removed.

Further, the multi-dimensional sample set is subjected to preliminary screening on outliers according to dimensions, and the outliers are removed in a mode that sample points with Euclidean distance between two points larger than a preset neighborhood are removed.

In particular, the DBSCAN algorithm is one of the most widely used density-based clustering algorithms. The basic idea of the algorithm is: for each object in the cluster, the number of objects contained in a given epsilon neighborhood must not be less than a given value (MinPts), that is, the density of its neighborhood must not be less than a certain threshold. The algorithm utilizes the high-density connectivity of the classes to divide the areas with high enough density into the classes, and can find clusters with arbitrary shapes in the noisy spatial database.

Referring to the DBSCAN algorithm, the embodiment of the invention primarily screens outliers through a density-based one-dimensional isolated point detection algorithm. The algorithm comprises the following steps:

(1) The input sample x, which has dimension M and sample size n. Two parameters of the algorithm are set: epsilon-neighborhood radius epsilon and threshold MinPts;

(2) Let variable i=1 representing the dimension;

(3) Taking the I-th dimension of x, denoted as x _I ＝[x _I1 ,x _I2 ,...,x _In ]；

(4) Will x _I1 ,x _I2 ,...,x _In Ascending order to obtain new sequence y _I ＝[y _I1 ,y _I2 ,...,y _In ]；

(5) Let k=1, mark all data as "undetected";

(6) Calculating y _Ik And y is _Ii I=1, 2,.. _i ＝||y _Ik -y _Ii I, get satisfied D _i Epsilon, i.e. falling at y _Ik Sample size N in epsilon-neighborhood;

i) If N=1, i.e. y _Ik Does not contain other sample points in the epsilon-neighborhood than itself, will y _Ik Marking as "detected" and marking the value in its corresponding sequence as "outlier";

ii) if 1<N<MinPts+1, then y _Ik Less than a threshold, if the cluster-merging requirement is not met, y _Ik Labeled "detected" and its corresponding sequence of values labeled "outliers". It should be noted that in this case a false positive may occur at the boundary point, but may be corrected by the data point following it;

iii) If N is greater than or equal to MinPts+1, y _Ik The object in epsilon-neighborhood of (2) satisfies the threshold condition, y _Ik And samples in their epsilon-neighborhood are in the same cluster, so y _Ik And none of the sample points in their epsilon-neighborhood are outliers, will y _Ik And the sample points in the epsilon-neighborhood are marked as 'detected', and the values in the corresponding sequences are marked as 'normal points';

(7) Let k equal the minimum of the values marked as "undetected", repeat step (6) until all values are marked as "detected";

(8) I=i+1, repeating steps (3) to (7) until I > M.

In summary, step S1 adopts a density-based one-dimensional isolated point detection method to perform preliminary screening on outliers according to dimensions on multidimensional data, eliminates obvious outliers in the data, and reduces the clustering effect of data noise on a second-stage self-organizing feature mapping (SOM) algorithm.

And S2, clustering the sample set based on a self-organizing feature mapping algorithm, and removing abnormal data points.

The Self-Organizing Map (SOM) learning algorithm is a competition-free learning algorithm. The SOM network is composed of an input layer and an output layer, wherein the output layer is a two-dimensional grid. The input layer is made up of N neurons for receiving an external N-dimensional input vector. The output layer (competing layer) is typically arranged in a one-dimensional or two-dimensional planar arrangement, consisting of M neurons, for the nodes of the input layerMapped onto the contention layer node. All nodes of input layer and all nodes of competition layer use weight w _ij (i=1, 2, …, N; j=1, 2, …, M) and the connection weights are dynamically updated during the network training process.

For each input vector, competition is generated between neurons by comparison between the input vector value and the weight value, and the neuron with the weight vector closest to the input pattern is considered to be most responsive to the input pattern, and is designated as the winning neuron. The winning neuron not only strengthens itself, but also brings surrounding adjacent neurons to be strengthened, while suppressing surrounding farther neurons.

For L N-dimensional input vectors x _k ＝(x _1k ,x _2k ,···，x _Nk ) ^T K=1, 2, ··, the specific steps of the L algorithm are as follows:

(1) And determining the SOM network topology structure, wherein the number of neurons of an input layer is N, and the number of neurons of an output layer is M.

(2) Setting t=0, initializing a weight matrix w _j (0) (j=1, 2, …, M) to which a random value is given. The only limitation here is w _j (0) (j=1, 2, …, M) are different from each other. It is generally desirable to keep the weights small. Another algorithm initialization method is to randomly select a weight vector from the available set of input vectors.

(3) Providing an input vector x for a network _k (t)＝(x _1k ,x _2k ,x _3k ,...,x _Nk ) ^T K is more than or equal to 1 and less than or equal to L. To eliminate the influence of dimension, the input data should be normalized first.

(4) Calculating the distance between the current input vector and the competitive layer neuron, and selecting the neuron with the smallest distance as the winning neuron

(5) The weight vector of the winning neuron and the neurons in the neighborhood range is adjusted as follows:

eta (t) is the learning rate parameter and ranges from 0<η(t)<1, decreasing with time. N (N) _q (t) is the neighborhood radius of the winning neuron q, also decreasing with time. The direct result of updating equation (1) is that the weight vector of winning neuron q moves toward the input vector, which also has an effect on the movement of neighbor neuron j within range.

(6) And (3) judging whether the input vectors are all provided for the network, if so, turning to the next step, and otherwise, returning to the step (3).

(7) The learning rate and neighborhood radius are updated.

Wherein eta (0) is the initial learning rate, N _q (0) Is the initial neighborhood radius.

(8) Let t=t+1, judge whether the iteration number reaches the predetermined total iteration number T, if yes, the algorithm ends, otherwise go back to step (3).

In summary, step S2 adopts a self-organizing feature mapping algorithm to cluster the data, and further screens out abnormal data points.

And S3, improving the self-organizing feature mapping algorithm by a kernel function method, clustering the sample set by the improved self-organizing feature mapping algorithm, and removing abnormal data points.

Further, the self-organizing feature mapping algorithm is improved by a kernel function, including changing winning rules and weight adjustment formulas based on neurons in the self-organizing feature mapping algorithm.

Specifically, by changing the calculation method of the distance between the current input vector and the output layer neuron based on the self-organizing feature mapping algorithm, the winning rule of the neuron is changed, and then a kernel function is utilized to obtain a weight adjustment formula.

The self-organizing feature mapping SOM algorithm has a higher convergence speed and can converge to a smaller error. However, as can be seen from the formula (1), in the SOM algorithm, the adjustment of the winning neuron q and its neighborhood depends on X to the weights w of the neurons _j Euclidean distance between the two _j I. Thus, when the boundary of the input sample is linearly inseparable and the class distribution is a non-gaussian distribution or a non-elliptical distribution, the SOM classifier has poor classification effect. The core approach offers the possibility to solve the above-mentioned problems.

The kernel method can effectively solve the problem of nonlinearity of the input sample. The essence of kernel-based learning is to transform the non-linear problem in the low-dimensional input space into a more easily solved linear problem in the high-dimensional (even infinite-dimensional) feature space by means of kernel-induced implicit mapping, and to characterize it in the form of an inner product.

The kernel method is introduced into the distance metric and weight update formula for determining winning neurons. Because of the flexibility and diversity of the cores, SOM algorithms based on different distance metrics and weight update formulas can be derived.

Defining a nonlinear mapping phi X-phi (X) epsilon F, wherein X epsilon R, R is a sample set, and F is a feature space. The euclidean distance may be replaced by a formalized objective function:

J(w _j )＝||Φ(X)-Φ(w _j )|| ² (4)

minimum value is found, where the norm in equation (4) can be written as:

||Φ(X)-Φ(w _j )|| ² ＝Φ(X) ^T Φ(X)+Φ(w _j ) ^T Φ(w _j )-2Φ(X) ^T Φ(w _j ) (5)

each of these can be seen as an inner product in feature space, again in terms of the definition of a kernel that satisfies the Mercer condition:

K(x _i ,x _j )＝Φ(x _i ) ^T Φ(x _j ) (6)

substituting formula (6) into (5) includes:

J(w _j )＝||Φ(X)-Φ(w _j )|| ² ＝K(X,X)+K(w _j ,w _j )-2K(X,w _j ) (7)

solving for a function J (w _j ) As the minimum value, a gradient descent method can be used. Derived to w _j Is a new adjustment formula:

according to the flexibility of kernel mapping, different kernel functions can induce different distance metrics, and different kernel functions determine different neuron winning rules and weight adjustment formulas. The following are 4 classical kernel functions that meet the Mercer condition:

polynomial K (x, y) = (x) ^T ·y) ^d ,d≥2 (9)

Radial basis

Cauchy

The number of the logarithms K (x, y) =log (1 +: i x-y i ² /σ ² ) (12)

By substituting equations (9) to (12) into equation (8), a KSOM weight adjustment equation based on the above four kernel functions can be obtained:

w _j (t+1)＝w _j (t)-η(t)(2d(w _j (t) ^T ) ^d-1 w _j -(x ^T w _j (t)) ^d-1 x) (13)

under the new distance metric, the winning neuron q is redefined:

the modified SOM algorithm is different from the winning rule and the weight adjustment formula of the winning neuron, the rest of the algorithm is unchanged, and the specific flow is shown in figure 2.

In conclusion, the method of introducing the kernel function improves the self-organizing feature mapping algorithm, improves the weight calculation part in the original algorithm, and can improve the data clustering effect. Furthermore, the SOM algorithm is improved by adopting a kernel function, so that rapid convergence can be realized, and abrupt change values in data can be detected and identified.

And S4, eliminating abnormal data points in the sample set according to expert experience, and finishing intelligent detection of the data.

Judging the abnormal class according to expert experience, wherein members of the abnormal class are abnormal data, and providing the abnormal data in the data to finish intelligent detection of the data.

According to the data intelligent detection method based on the improved SOM algorithm, a sample set is decomposed according to dimensions, isolated point detection based on density is carried out dimension by dimension, and sample points with Euclidean distance between two points larger than a set neighborhood are removed; the method is characterized in that the abnormal data detection is realized by adopting a clustering analysis based on a kernel function method to improve an SOM algorithm, and a kernel method is introduced to improve a neural network weight adjustment formula, so that the influence of sample data nonlinearity is reduced; judging the abnormal class according to expert experience, wherein the members of the abnormal class are abnormal data. Therefore, based on density and improved self-organizing feature mapping algorithm, abnormal data can be removed, data quality is improved, a kernel function is introduced into weight updating of the self-organizing feature mapping algorithm, influence of sample data nonlinearity can be reduced, and clustering effect of the SOM algorithm is improved.

Next, a data intelligent detection device based on an improved SOM algorithm according to an embodiment of the present invention will be described with reference to the accompanying drawings.

As shown in fig. 3, the data intelligent detection device based on the improved SOM algorithm includes: the system comprises a first rejection module 100, a second rejection module 200, a third rejection module 300 and an intelligent detection module 400.

The first rejecting module 100 is configured to obtain a sample set, decompose the sample set according to dimensions, perform density-based one-dimensional isolated point detection on a dimension-by-dimension basis, perform preliminary screening on outliers according to dimensions by using a multi-dimensional sample set, and reject the outliers.

The second culling module 200 is configured to cull abnormal data points by clustering the sample set based on a self-organizing feature mapping algorithm.

And the third eliminating module 300 is configured to improve the self-organizing feature mapping algorithm by a kernel function method, cluster the sample set by the improved self-organizing feature mapping algorithm, and eliminate abnormal data points.

The intelligent detection module 400 is configured to reject abnormal data points in the sample set according to expert experience, and complete intelligent detection of data.

Further, in one embodiment of the present invention, the outliers are culled, including: and eliminating sample points with Euclidean distance between two points larger than the preset neighborhood.

Further, in one embodiment of the present invention, the improvement of the self-organizing feature-based mapping algorithm by a kernel function method comprises: the winning rules and weight adjustment formulas based on neurons in the self-organizing feature mapping algorithm are changed.

Further, in one embodiment of the invention, different kernel functions determine different neuron winning rules and weight adjustment formulas.

It should be noted that the foregoing explanation of the embodiment of the data intelligent detection method based on the improved SOM algorithm is also applicable to the apparatus of this embodiment, and will not be repeated here.

According to the data intelligent detection device based on the improved SOM algorithm, a sample set is decomposed according to dimensions, isolated point detection based on density is carried out dimension by dimension, and sample points with Euclidean distance between two points larger than a set neighborhood are removed; the method is characterized in that the abnormal data detection is realized by adopting a clustering analysis based on a kernel function method to improve an SOM algorithm, and a kernel method is introduced to improve a neural network weight adjustment formula, so that the influence of sample data nonlinearity is reduced; judging the abnormal class according to expert experience, wherein the members of the abnormal class are abnormal data. Therefore, based on density and improved self-organizing feature mapping algorithm, abnormal data can be removed, data quality is improved, a kernel function is introduced into weight updating of the self-organizing feature mapping algorithm, influence of sample data nonlinearity can be reduced, and clustering effect of the SOM algorithm is improved.

Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present invention, the meaning of "plurality" means at least two, for example, two, three, etc., unless specifically defined otherwise.

In the description of the present specification, a description referring to terms "one embodiment," "some embodiments," "examples," "specific examples," or "some examples," etc., means that a particular feature, structure, material, or characteristic described in connection with the embodiment or example is included in at least one embodiment or example of the present invention. In this specification, schematic representations of the above terms are not necessarily directed to the same embodiment or example. Furthermore, the particular features, structures, materials, or characteristics described may be combined in any suitable manner in any one or more embodiments or examples. Furthermore, the different embodiments or examples described in this specification and the features of the different embodiments or examples may be combined and combined by those skilled in the art without contradiction.

While embodiments of the present invention have been shown and described above, it will be understood that the above embodiments are illustrative and not to be construed as limiting the invention, and that variations, modifications, alternatives and variations may be made to the above embodiments by one of ordinary skill in the art within the scope of the invention.

Claims

1. The intelligent data detection method based on the improved self-organizing feature mapping algorithm is characterized by comprising the following steps of:

acquiring a nonlinear sample set through a sensor arranged on equipment, decomposing the sample set according to dimensions, detecting one-dimensional isolated points based on density by dimension, primarily screening outliers according to dimensions by the multidimensional sample set, and eliminating the outliers, wherein the sample set comprises equipment operation data, and eliminating the outliers comprises eliminating sample points with Euclidean distance between two points larger than a preset neighborhood;

clustering the sample set after outlier removal by a self-organizing feature mapping algorithm, and removing a first abnormal data point;

changing a winning rule of neurons in the self-organizing feature mapping algorithm by changing a calculation method of distances between a current input vector and neurons of an output layer based on the self-organizing feature mapping algorithm, improving a weight adjustment formula in the self-organizing feature mapping algorithm by a kernel function method, clustering a sample set with first abnormal data points removed by the improved self-organizing feature mapping algorithm, and removing second abnormal data points, wherein the distances between the current input vector and neurons of the output layer based on the self-organizing feature mapping algorithm are calculated in the self-organizing feature mapping algorithm, the neurons with the smallest distances are winning neurons, and the expression of the winning neurons is as follows:

wherein x is _k (t) is the input vector, k=1, 2, ··, L and L are the number of input vectors, w _j (t) is a weight matrix, j=1, 2, …, M is the number of neurons in the output layer;

the weight adjustment formula is:

wherein eta (t) is a learning rate parameter, and the range is 0<η(t)<1，N _q (t) is the neighborhood radius of the winning neuron q;

and removing the abnormal data points in the sample set from which the second abnormal data points are removed according to expert experience so as to finish intelligent detection of the data.

2. The method for intelligent detection of data based on improved self-organizing feature mapping algorithms of claim 1, wherein different kernel functions determine different neuron winning rules and weight adjustment formulas.

3. An intelligent data detection device based on an improved self-organizing feature mapping algorithm is characterized by comprising:

the first rejecting module is used for acquiring a nonlinear sample set through a sensor arranged on equipment, decomposing the sample set according to dimensions, detecting one-dimensional isolated points based on density by one dimension, primarily screening outliers according to dimensions by the multidimensional sample set, and rejecting the outliers, wherein the sample set comprises equipment operation data, and rejecting the outliers comprises rejecting sample points with Euclidean distance between two points larger than a preset neighborhood;

the second eliminating module is used for eliminating the first abnormal data points by clustering the sample set after the outlier is eliminated based on the self-organizing feature mapping algorithm;

a third rejecting module, configured to change a winning rule of neurons in the self-organizing feature mapping algorithm by changing a calculation method of a distance between a current input vector and neurons of an output layer of the self-organizing feature mapping algorithm, improve a weight adjustment formula in the self-organizing feature mapping algorithm by a kernel function method, cluster a sample set from which a first abnormal data point is rejected by the improved self-organizing feature mapping algorithm, reject a second abnormal data point,in the self-organizing feature mapping algorithm, the distance between the current input vector and the neuron of the output layer based on the self-organizing feature mapping algorithm is calculated, the neuron with the smallest distance is a winning neuron, and the expression of the winning neuron is as follows:

the weight adjustment formula is:

and the intelligent detection module is used for removing the abnormal data points in the sample set from which the second abnormal data points are removed according to expert experience so as to finish intelligent detection of the data.

4. The intelligent data detection device based on improved self-organizing feature mapping algorithm as recited in claim 3, wherein different kernel functions determine different neuron winning rules and weight adjustment formulas.