CN109508733A

CN109508733A - A kind of method for detecting abnormality based on distribution probability measuring similarity

Info

Publication number: CN109508733A
Application number: CN201811233705.2A
Authority: CN
Inventors: 高欣; 井潇; 何杨; 查森; 纪维佳; 任昺
Original assignee: Beijing University of Posts and Telecommunications
Current assignee: Beijing University of Posts and Telecommunications
Priority date: 2018-10-23
Filing date: 2018-10-23
Publication date: 2019-03-22

Abstract

The embodiment of the present invention proposes a kind of method for detecting abnormality based on distribution probability measuring similarity, it include: multiple subsets that multiple stochastical sampling obtains normal sample data, the random isolation processes that each subset is saved with full binary tree structure delimit the threshold depth of backtracking according to drift ratio；External leaf node position and the threshold depth that each tree is fallen according to test point, leaf node where it trace back to the ancestor node of threshold depth, extract training data of all data as measurement and test point similarity under the node；With certain point in test point and training dataset for endpoint, the probability that remaining data points occur between this two o'clock is calculated separately in each attribute dimensions and obtains the exceptional value of the point in conjunction with the dissimilar degree of all the points in Min Shi distance calculating test point and data set.Technical solution provided in an embodiment of the present invention can effectively solve training data and concentrate data without exception and local anomaly test problems.

Description

A kind of method for detecting abnormality based on distribution probability measuring similarity

[technical field]

The present invention relates to machine learning field method for detecting abnormality, more particularly to one kind to be based on distribution probability measuring similarity Method for detecting abnormality.

[background technique]

When solving the problems, such as abnormality detection using machine learning method, there are no abnormal datas to be trained, part is different Often, each dimension dimension different distributions of data differ larger problem.The unsupervised part of data is solved according to suitable sorting algorithm Abnormality detection problem is one of the hot spot studied now to improve model to normal and whole exceptional sample discrimination.It solves At present for abnormality detection problem, conventional method is broadly divided into three types.The first is the method by mathematical statistics, is passed through The probability size that each dimension in statistical number strong point or each dimension various combination of data point occur judges whether exception.Second for based on The method for detecting abnormality of distance mainly judges to be by calculating test point with the degree of closeness of normal data is locally or globally gone up No is abnormal.The third is to judge whether it is abnormal by clustering or calculating the modes such as distribution relative density based on data distribution. But these methods have certain supposed premise: abnormal point and normal distribution cluster meet farther out, abnormal data density it is far low In normal data density etc..But in actual application environment, being distributed with for abnormal data may very be concentrated or away from normal number It is closer according to distribution cluster, or even have abnormal data and be wrapped in inside normal data cluster.The hypotheses that above-mentioned algorithm is done It excessively idealizes, this does not always set up even possibility very little in actual application environment, causes model recognition effect unstable It is fixed.In addition, will appear the situation of cluster exception in the events such as Epidemic outbreak of disease, network attack.It is a large amount of different when occurring extremely Regular data distribution concentrates on one or more distributed areas, that is to say, that spatial abnormal feature feature is more concentrated, is densely distributed, base Correct judgement can not be made to this situation in the serial algorithm of density.And in a practical situation, the case where anomalous concentration is that have Very big researching value, discrete abnormal point illustrates that such abnormal probability of happening is lower, and for the region of anomalous concentration Probability of happening is relatively high, detects that this exception can reduce loss to the full extent.Such as in network attack, if can be in time It detects cluster exception, finds out its attack mode, then can provide effective Informational support for network O&M worker, avoid system It is broken.

In conclusion for abnormality detection problem, there are following difficult points at present: abnormal data acquisition is more difficult；Abnormal number More closely there is local anomaly according to away from normal data, existing major part algorithm only judges abnormal journey by global or local data distribution Degree, can not comprehensively consider global and local distributed intelligence；Abnormal data distribution is more intensive, and distribution density is close with normal data Even higher, highdensity abnormal data cluster is easily judged as normal data by the algorithm based on relative density；Each attribute of data point Cloth range difference is larger, and by traditional distance calculating method such as Euclidean distance, weight has big difference between different dimensions, and early period Normalization, standardization can adapt data original distribution state again；Abnormal data is distributed in inside normal data, and is distributed Range has certain intersection；Have label training data concentrate data point distribution will not completely high density concentrate on a region, need to sieve The training data for having information redundancy is selected, guarantees that subsequent processing is interference-free.

[summary of the invention]

In view of this, the embodiment of the present invention proposes a kind of method for detecting abnormality based on distribution probability measuring similarity, To improve disaggregated model to the discrimination of positive negative sample entirety.

A kind of method for detecting abnormality based on distribution probability measuring similarity that the embodiment of the present invention proposes, comprising:

Multiple stochastical sampling obtains multiple subsets of normal sample data, complete with binary tree structure save each subset with Machine isolation processes delimit the threshold depth of backtracking according to drift ratio；

External leaf node position and the threshold depth that each tree is fallen according to test point, leaf node is recalled where it To the ancestor node of threshold depth, training data of all data as measurement and test point similarity under the node is extracted；

Remaining data points appearance is calculated separately in each attribute dimensions for endpoint with certain point in test point and training dataset Probability between this two o'clock calculates the dissimilar degree of all the points in test point and data set in conjunction with Min Shi distance, obtains this The exceptional value of point.

In the above method, multiple stochastical sampling obtains multiple subsets of normal sample data, with the preservation of full binary tree structure The random isolation processes of each subset, according to drift ratio delimit backtracking threshold depth method are as follows: by training dataset D with Machine samples to obtain several training subsets X_all, and each subset X contains m sample X={ X₁, X₂..., X_m, m is less than training The positive integer of data set D size, can select appropriate value according to the actual situation, and each sample point contains n dimension, i.e., i-th SampleRandomly select dimension and isolation threshold, isolation threshold be subset in certain dimension between Random value between its maximum value and minimum value；Continuous iteration, until meet following three conditions one of them, then terminate and change Generation: (1) spatially only one sample of each isolation；(2) spatially each sample point is identical in the dimension values；(3) reach Iteration limit number；By this process record in a tree structure, a complete binary tree is formed, each node can contain There is zero or two child node, what is saved in leaf node is the sample in each insulating space, and what internal node saved is isolation Dimension and corresponding threshold value, depth threshold of the Dt as the retrospective search neighbours training points in tree need according to each training points The average value E (h (x)) of depth h (x) is determined in each tree, as follows:

Wherein, E (h (x)) is the mean depth after sample x is traversed on all t isolation trees, and t is selected according to the actual situation Select suitable positive integer, l_iIt (x) is the pathdepth of i-th tree；

Need to be arranged a drift ratio r, 0≤r≤1, the i.e. relative depature in all normal training dataset D for Dt The ratio data of normal data distribution, r setting need to according in the dispersion degree and actual conditions of data distribution to model Indices demand is measured, and is selected before each training sample mean depth in ((1-r) * 100) % according to the drift ratio r of setting Minimum value as tree in retrospective search part training points depth threshold Dt.

In the above method, external leaf node position and the threshold depth of each tree are fallen according to test point, where it Leaf node traces back to the ancestor node of threshold depth, extracts all data under the node as measurement and test point similarity The method of training data are as follows: test point is sent into every one tree, if test sample falls in certain node under Dt depth, by test point Place node is recalled upwards, and until forefathers' node of Dt depth, training sample Ltd all under forefathers' node are extracted work For the data for next calculating test sample intensity of anomaly, the data extracted in all trees are incorporated as next instruction Practice sample, if test sample more than Dt depth, using data all in Ltd as next training sample, is surveyed this For sample sheet, farther out, local training data in its vicinity is less for the distribution in the overall situation away from normal data, therefore by Dt Under all training samples extract as next training sample, the intensity of anomaly of further validation test data.

In the above method, it is calculated separately in each attribute dimensions for endpoint with certain point in test point and training dataset There is the probability between this two o'clock in remainder strong point, and the dissmilarity of all the points in test point and data set is calculated in conjunction with Min Shi distance Degree, the method for obtaining the exceptional value of the point are as follows: firstly, by Rⁱ(x, y) is defined as sample x and sample y in i-th dimension xⁱAnd yⁱTwo Region between value, at this time x ∈ Ltd.If S is the subspace set where all data of Ltd, SⁱFor space S i-th dimension sky Between distribution, ifSelect SⁱMost value be boundary, then Rⁱ(x, y) is converted to Rⁱ(x, S), Numⁱ(x, y, d, S) is I-th dimension dⁱWhether in RⁱBoolean in (x, y) range, wherein d is other samples in Ltd in addition to x, Mⁱ(x,y|Ltd,S) It is as follows for the training points number in Ltd in i-th dimension between x and y:

Wherein, I () is indicator function, and condition in bracket is otherwise 0 if true, its value is 1；Then with Mⁱ(x,y| Ltd, S) different degree of the ratio as x and y in i-th dimension in Ltd is accounted for, calculate different degree of the x and y in all dimensions D ' (x, y), as follows:

Wherein, p is the index value in Minkowski Distance, finally, the abnormality score p (y) of test point y is as follows:

Wherein, p (y) be the similarity at test point y and all midpoints Ltd and, test point is ranked up, p (y) is got over Greatly, intensity of anomaly is higher.

[Detailed description of the invention]

In order to illustrate the technical solution of the embodiments of the present invention more clearly, below will be to needed in the embodiment attached Figure is briefly described, it should be apparent that, drawings in the following description are only some embodiments of the invention, for this field For those of ordinary skill, without any creative labor, it can also be obtained according to these attached drawings other attached Figure.

Fig. 1 is that the process for the method for detecting abnormality based on distribution probability measuring similarity that the embodiment of the present invention is proposed is shown It is intended to.

[specific embodiment]

For a better understanding of the technical solution of the present invention, being retouched in detail to the embodiment of the present invention with reference to the accompanying drawing It states.

It will be appreciated that described embodiments are only a part of the embodiments of the present invention, instead of all the embodiments.Base Embodiment in the present invention, it is obtained by those of ordinary skill in the art without making creative efforts it is all its Its embodiment, shall fall within the protection scope of the present invention.

The embodiment of the present invention provides the method for detecting abnormality based on distribution probability measuring similarity, referring to FIG. 1, it is this The flow diagram for the method for detecting abnormality based on distribution probability measuring similarity that inventive embodiments are proposed, as shown in Figure 1, Method includes the following steps:

Step 101, multiple stochastical sampling obtains multiple subsets of normal sample data, is saved with full binary tree structure each The random isolation processes of subset delimit the threshold depth of backtracking according to drift ratio.

Specifically, obtaining several training subsets X_all by training dataset D stochastical sampling, each subset X contains m Sample X={ X₁, X₂..., X_m, m is the positive integer less than training dataset D size, can select suitable number according to the actual situation Value, each sample point contain n dimension, i.e. i-th of sample Randomly select dimension and isolation threshold Value, isolation threshold are random value of the subset in certain dimension between its maximum value and minimum value；Continuous iteration, until meeting Below three conditions one of them, then terminate iteration: (1) spatially only one sample of each isolation；(2) spatially each Sample point is identical in the dimension values；(3) reach iteration limit number；By this process record in a tree structure, formed One complete binary tree, each node can contain zero or two child node, and what is saved in leaf node is each insulating space In sample, what internal node saved is the dimension and corresponding threshold value of isolation, and Dt is used as the retrospective search neighbours training points in tree Depth threshold, need the average value E (h (x)) of the depth h (x) in each tree according to each training points to determine, following institute Show:

Algorithm 1 and algorithm 2 are the isolation processes of step 101 and the pseudocode of depth threshold setting method:

Step 102, external leaf node position and the threshold depth that each tree is fallen according to test point, leaf where it Node traces back to the ancestor node of threshold depth, extracts training of all data as measurement and test point similarity under the node Data.

Specifically, test point is sent into every one tree, if test sample falls in certain node under Dt depth, by test point institute Recall upwards in node, until forefathers' node of Dt depth, training sample Ltd all under forefathers' node are extracted into conduct The data extracted in all trees are incorporated as next training by the data for next calculating test sample intensity of anomaly Sample, if test sample more than Dt depth, using data all in Ltd as next training sample, tests this For sample, farther out, local training data in its vicinity is less, therefore will be under Dt for the distribution in the overall situation away from normal data All training samples are extracted as next training sample, the intensity of anomaly of further validation test data.

Algorithm 3 and algorithm 4 are the pseudocode that local training data method is extracted in step 102:

Step 103, its remainder is calculated separately for endpoint in each attribute dimensions with certain point in test point and training dataset There is the probability between this two o'clock in strong point, and the dissimilar journey of all the points in test point and data set is calculated in conjunction with Min Shi distance Degree, obtains the exceptional value of the point.

Specifically, firstly, by Rⁱ(x, y) is defined as sample x and sample y in i-th dimension xⁱAnd yⁱRegion between two values, this When x ∈ Ltd.If S is the subspace set where all data of Ltd, SⁱFor space S i-th dimension spatial distribution range, ifSelect SⁱMost value be boundary, then Rⁱ(x, y) is converted to Rⁱ(x, S), Numⁱ(x, y, d, S) is i-th dimension dⁱWhether RⁱBoolean in (x, y) range, wherein d is other samples in Ltd in addition to x, Mⁱ(x, y | Ltd, S) it is in Ltd i-th The training points number between x and y is tieed up, as follows:

Table solves 10 public affairs first is that the embodiment of the present invention provides the method for detecting abnormality based on distribution probability measuring similarity When opening data set abnormality detection task, the pretreatment of data set is divided into normal data and abnormal data for all kinds of in data set.

Table one

Table solves 10 public affairs second is that the embodiment of the present invention provides the method for detecting abnormality based on distribution probability measuring similarity When opening data set abnormality detection task, AUC value (ranking of random selection positive sample is higher than the probability of random selection negative sample) Contrast and experiment, wherein in the embodiment of the present invention control methods be the type solution never KNN of balanced sort problem, Eight kinds of methods of iForest, SCiForest, iNNE, ALSH, L1SH, L2SH, KLSH.By table one, it can be concluded that, the present invention is mentioned Method DPSM out is concentrated in public data and is significantly improved in AUC value compared to control methods.It is especially real in the first seven group Have greatly improved in testing, proposition method is highest level in eight groups of methods, remaining two groups also close with highest level. The method that the embodiment of the present invention is proposed achieves certain breakthrough in method for detecting abnormality.

Table two

In conclusion the embodiment of the present invention has the advantages that

In the technical solution that the present invention is implemented, multiple stochastical sampling obtains multiple subsets of normal sample data, with complete two Fork tree construction saves the random isolation processes of each subset, and the threshold depth of backtracking delimited according to drift ratio；According to test point External leaf node position and the threshold depth for falling in each tree, the ancestors that leaf node where it traces back to threshold depth save Point extracts training data of all data as measurement and test point similarity under the node；With test point and training dataset Certain interior point is endpoint, probability of the remaining data points appearance between this two o'clock is calculated separately in each attribute dimensions, in conjunction with Min Shi Distance calculates the dissimilar degree of all the points in test point and data set, obtains the exceptional value of the point.According to embodiments of the present invention The technical solution of offer can effectively solve local anomaly test problems, can be using original training data according to where test point The distribution of regional area normal data obtains its intensity of anomaly, improve the local anomaly of abnormality detection model detectability and its Overall target.

The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the invention, all in essence of the invention Within mind and principle, any modification, equivalent substitution, improvement and etc. done be should be included within the scope of the present invention.

Claims

1. a kind of method for detecting abnormality based on distribution probability measuring similarity, which is characterized in that the method step includes:

(1) repeatedly stochastical sampling obtain normal sample data multiple subsets, complete with binary tree structure save each subset with Machine isolation processes delimit the threshold depth of backtracking according to drift ratio；

(2) external leaf node position and the threshold depth that each tree is fallen according to test point, leaf node is recalled where it To the ancestor node of threshold depth, training data of all data as measurement and test point similarity under the node is extracted；

(3) remaining data points appearance is calculated separately for endpoint in each attribute dimensions with certain point in test point and training dataset Probability between this two o'clock calculates the dissimilar degree of all the points in test point and data set in conjunction with Min Shi distance, obtains this The exceptional value of point.

2. the method according to claim 1, wherein repeatedly stochastical sampling obtains multiple sons of normal sample data Collection, the random isolation processes of each subset are saved with full binary tree structure, and the threshold depth of backtracking, tool delimited according to drift ratio Body is described as follows: obtaining several training subsets X_all by training dataset D stochastical sampling, each subset X contains m sample X ={ X₁, X₂..., X_m, m is the positive integer less than training dataset D size, can select appropriate value according to the actual situation, each Sample point contains n dimension, i.e. i-th of sampleDimension and isolation threshold are randomly selected, is isolated Threshold value is random value of the subset in certain dimension between its maximum value and minimum value；Continuous iteration, until meeting following three A condition one of them, then terminate iteration: (1) spatially only one sample of each isolation；(2) spatially each sample point It is identical in the dimension values；(3) reach iteration limit number；By this process record in a tree structure, formation one is complete Full binary tree, each node can contain zero or two child node, and what is saved in leaf node is the sample in each insulating space This, what internal node saved is the dimension and corresponding threshold value of isolation, depth of the Dt as the retrospective search neighbours training points in tree Threshold value needs the average value E (h (x)) of the depth h (x) in each tree according to each training points to determine, as follows:

Wherein, E (h (x)) is the mean depth after sample x is traversed on all t isolation trees, and t selects to close according to the actual situation Suitable positive integer, l_iIt (x) is the pathdepth of i-th tree；

Need to be arranged a drift ratio r for Dt, 0≤r≤1, i.e., relative depature is normal in all normal training dataset D The setting of the ratio data of data distribution range, r need to be according to every to model in the dispersion degree and actual conditions of data distribution Index demand is measured, and is selected before each training sample mean depth in ((1-r) * 100) % most according to the drift ratio r of setting Depth threshold Dt of the small value as retrospective search part training points in tree.

3. the method according to claim 1, wherein falling in the external leaf node position of each tree according to test point It sets and threshold depth, leaf node where it traces back to the ancestor node of threshold depth, extract all data under the node and make To measure the training data with test point similarity, illustrate are as follows: test point is sent into every one tree, if test sample is fallen in Certain node under Dt depth, the node where test point are recalled upwards, will be under forefathers' node until forefathers' node of Dt depth All training sample Ltd are extracted as the following data for calculating test sample intensity of anomaly, will be extracted in all trees To data be incorporated as next training sample, if test sample more than Dt depth, using data all in Ltd as Next training sample, for this test sample, the distribution in the overall situation away from normal data farther out, in its vicinity Local training data is less, therefore training samples all under Dt are extracted as next training sample, further tests Demonstrate,prove the intensity of anomaly of test data.

4. the method according to claim 1, wherein with certain point in test point and training dataset for endpoint, Calculate separately remaining data points in each attribute dimensions and probability between this two o'clock occur, in conjunction with Min Shi distance calculate test point with The dissimilar degree of all the points, obtains the exceptional value of the point, illustrates in data set are as follows: firstly, by Rⁱ(x, y) is defined as sample This x and sample y are in i-th dimension xⁱAnd yⁱRegion between two values, at this time x ∈ Ltd.If S is the subspace where all data of Ltd Set, SⁱFor space S i-th dimension spatial distribution range, ifSelect SⁱMost value be boundary, then Rⁱ(x, y) turns It is changed to Rⁱ(x, S), Numⁱ(x, y, d, S) is i-th dimension dⁱWhether in RⁱBoolean in (x, y) range, wherein d is that y is removed in Ltd Other samples in addition, Mⁱ(x, y | Ltd, S) is the training points number in Ltd in i-th dimension positioned at x and y between, as follows:

Wherein, I () is indicator function, and condition in bracket is otherwise 0 if true, its value is 1；Then with Mⁱ(x, y | Ltd, S) it accounts for Different degree of the ratio as x and y in i-th dimension in Ltd calculates different degree D ' (x, y) of the x and y in all dimensions, It is as follows:

Wherein, p (y) be the similarity at test point y and all midpoints Ltd and, test point is ranked up, p (y) is bigger, different Chang Chengdu is higher.