CN109492767A

CN109492767A - A kind of method for detecting abnormality applied to unsupervised field based on self-encoding encoder

Info

Publication number: CN109492767A
Application number: CN201811330477.0A
Authority: CN
Inventors: 李锐; 于治楼; 尹青山; 安程治; 段强
Original assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Current assignee: Jinan Inspur Hi Tech Investment and Development Co Ltd
Priority date: 2018-11-09
Filing date: 2018-11-09
Publication date: 2019-03-19

Abstract

The present invention provides a kind of method for detecting abnormality applied to unsupervised field based on self-encoding encoder, belongs to abnormality detection technical field, and legacy data is carried out unsupervised training by self-encoding encoder using the neural network function in self-encoding encoder by the present invention.Obtained model can be used to compress the data newly inputted, and compressed data are used for and compressed training data () is compared before.If compressed error is more than threshold value, it is judged as abnormal data.Data after compressed encoding can more embody the substantive characteristics of data, can catch the feature mode of data, therefore more accurate.

Description

A kind of method for detecting abnormality applied to unsupervised field based on self-encoding encoder

Technical field

The present invention relates to abnormality detection technology more particularly to a kind of exceptions applied to unsupervised field based on self-encoding encoder Detection method.

Background technique

When handling a large amount of high dimensional datas, on the one hand, because data volume is big, variable is more, time cost is very high；It is another Aspect, because variable is excessive, certain key variables features may be covered by other a large amount of characteristics of variables, be eventually led to The Partial key characteristics of variables of progress abnormality processing can not play the role of due.

Abnormality detection is a kind of algorithm being in daily use.It is mainly used to detect whether a data is abnormal data.Abnormal inspection The algorithm of survey has very much.

Abnormality detection is a research direction with very broad prospect of application, is examined in the failure of some engineering fields It surveys, the intrusion detection of the fraud detection of financial field, security fields suffers from extraordinary application scenarios.Abnormality detection is detection Data undesirably, behavior, but Internet era now, the complicated multiplicity of various information, possible a certain item data just have Hundreds of variable causes the difficulty of abnormality detection to increase at geometric multiple.Time cost is very high, this locates us in time It is significantly unfavorable to manage the abnormal conditions generated, it is possible to cause very big loss.

Self-encoding encoder (autoencoder) is a kind of unsupervised deep learning method, is also often used to compressed data.With Classical PCA(pivot in a column) analysis difference, self-encoding encoder is a kind of nonlinear compression method, can be extracted non-linear in data Information.In the occasion of most of self-encoding encoder, the function of compression and decompression is by neural fusion.

Error threshold setting is the key that realize abnormality processing, if threshold value setting is too low, may cause many normal numbers According to abnormal data is mistaken as, if instead threshold value setting is excessively high, it may cause some abnormal datas and be mistaken as normal data.

Summary of the invention

Based on the above content, the invention proposes a kind of applied to unsupervised abnormality detection side of the field based on self-encoding encoder Method, it is more suitable for data variable, without the abnormality detection under the unsupervised environment of label.

In the present invention, the algorithm parameter of self-encoding encoder can be set to default parameters, or can also rule of thumb into Row is adjusted.Self-encoding encoder also has many derivative algorithms, and this kind of algorithms can be similarly used in the method that we introduce.

Using self-encoding encoder, data are subjected to coding further decoding, obtained result is compared with former data, works as error After reaching threshold value, illustrate that the data are larger with the most data difference for constituting self-encoding encoder, it can be determined that for abnormal number According to.

Further,

First with the neural network function in self-encoding encoder, original normal data is subjected to unsupervised instruction by self-encoding encoder Practice.

Obtained model can be used to compress the data newly inputted, and compressed data are used for and compressed instruction Practice data (normal data) to be compared.

If compressed error is more than threshold value, it is judged as abnormal data.Data after compressed encoding can more embody number According to substantive characteristics, the feature mode of data can be caught, therefore more accurate.

Further,

Operating process are as follows:

1) partial history normal data training self-encoding encoder model is first taken；

2) data to be tested are carried out abnormality detection using trained model, and exports result；

5, according to the method described in claim 4, it is characterized in that,

Operating process is broadly divided into two aspects: 1) error threshold is arranged, and 2) detection foundation.

Wherein, the error threshold setting, after referring to that model training is good, holds each data for training sample Row encoding operation is to get to coded data corresponding to these data；It is calculated from the data after these codings average Coded data；Then each training sample data and this average data calculate Euclidean distance to get to one group of number and instruction Practice the consistent distance values of sample；Then average and standard deviation is calculated, threshold value is finally obtained.Threshold value is that average value adds 3 times Or 6 times of standard deviation.

The detection judge whether new data is extremely according to get to after threshold value in next step；Using model to newly into The sample come executes encoding operation, and obtained coded data and the average data data obtained before calculate Euclidean distance；This Distance is compared with threshold value and obtains result.

The beneficial effects of the invention are as follows

The abnormality detection model in current industrial application is improved, can preferably be applied using deep learning in the field of big data Jing Zhong allows abnormality detection to be applied under big data scene.

Algorithm realization is carried out by Major Epidemic programming language.Abnormality detection is industry 4.0, and industry internet field is most heavy One of application wanted plays the role of important technical support in industry internet application to company.

Detailed description of the invention

Fig. 1 is workflow schematic diagram of the invention.

Specific embodiment

More detailed elaboration is carried out to the contents of the present invention below:

Application scenarios of the present invention belong to unsupervised field, so needing gradually to adjust threshold parameter according to the actual situation.

Dynamic encoder is a kind of compression algorithm of data, wherein the compression and decompression function of data be data it is relevant, It is damaging, learn automatically from sample.In the occasion for largely mentioning autocoder, the function of compression and decompression is logical Cross neural fusion.

1) autocoder is that data are relevant (data-specific or data-dependent), it means that from Dynamic encoder can only compress those data similar with training data.Exist for example, training the autocoder come using face Compress other picture, such as poor performance when trees because it learn to be characterized in it is relevant to face.

2) autocoder damages, and means that the output of decompression is to degenerate compared with original input, MP3, The compression algorithms such as JPEG are also such.This is different from lossless compression algorithm.3) autocoder is learned automatically from data sample It practises, it means that be easy to train the input of specified class a kind of specific encoder, without completing any new work Make.

It is carried out abnormality detection using this unsupervised deep learning method of autocoder, this method may be implemented Software view is cured in hardware.This method is applied to edge calculations end or Embedded model, as one The innovative application of kind.

Operating procedure are as follows:

1: first taking a part of history normal data training self-encoding encoder model.This partial data has not needed label.

2: data to be tested being carried out abnormality detection using trained model, and export result.

Specific judgment basis is as follows:

After model training is good, coding (drop is executed to each data for training sample (trained history normal data) Dimension processing) it operates to get coded data corresponding to these data is arrived.It is calculated from the data after these codings average Coded data (column vector or row vector).Then each training sample data and this average data calculate Euclidean distance, Obtain one group of (number is consistent with training sample) distance values.Then average and standard deviation is calculated.Finally obtain threshold value For the standard deviation of average value plus 3 times (or 6 times).This threshold value is used to judge whether the data in future are abnormal data, i.e., super Crossing this threshold value is abnormal data.

It is in next step exactly to judge whether new data is abnormal after obtaining threshold value.The sample newly come in is held using model Row encoding operation (dimension-reduction treatment), obtained coded data and the average data data obtained before calculate Euclidean distance.This Distance is compared with threshold value and obtains result.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of method for detecting abnormality applied to unsupervised field based on self-encoding encoder, which is characterized in that

Using self-encoding encoder, data are subjected to coding further decoding, obtained result are compared with former data, when error reaches After threshold value, illustrates that the data are larger with more than half data differences for constituting self-encoding encoder, be judged as abnormal data.

2. the method according to claim 1, wherein

3. according to the method described in claim 2, it is characterized in that,

The model obtained after training is used to compress the data newly inputted, new compressed data are obtained, after compression Data be used for and compressed training data be compared；If compressed error is more than threshold value, it is judged as abnormal data.

4. according to the method described in claim 3, it is characterized in that,

Operating process are as follows:

2) data to be tested are carried out abnormality detection using trained model, and exports result.

5. according to the method described in claim 4, it is characterized in that,

6. according to the method described in claim 5, it is characterized in that,

Wherein, the error threshold setting, after referring to that model training is good, executes volume to each data for training sample Code operates to arrive coded data corresponding to these data；Average coding is calculated from the data after these codings Data；Then each training sample data and this average data calculate Euclidean distance to get to one group of number and training sample This consistent distance values；Then average and standard deviation is calculated, threshold value is finally obtained.

7. according to the method described in claim 6, it is characterized in that

Threshold value is the standard deviation that average value adds 3 times or 6 times.

8. according to the method described in claim 7, it is characterized in that

The detection judge whether new data is extremely according to get to after threshold value in next step；Using model to newly coming in Sample executes encoding operation, and obtained coded data and the average data data obtained before calculate Euclidean distance；This distance It is compared with threshold value and obtains result.