CN111740991A

CN111740991A - Anomaly detection method and system

Info

Publication number: CN111740991A
Application number: CN202010567982.8A
Authority: CN
Inventors: 张鹏飞
Original assignee: Inesa R&d Center
Current assignee: Inesa R&d Center
Priority date: 2020-06-19
Filing date: 2020-06-19
Publication date: 2020-10-02
Anticipated expiration: 2040-06-19
Also published as: CN111740991B

Abstract

The invention relates to the technical field of information data processing, in particular to an anomaly detection method, which utilizes an unsupervised model and a supervised model to mutually print false labels to process a small number of label sets to obtain positive and negative label sets, then carries out iterative processing until the positive and negative label sets show convergence, and designs an anomaly detection system for the anomaly detection method, wherein the anomaly detection system comprises a data acquisition unit for acquiring a data set, a model prediction unit for carrying out training fitting on the unsupervised model and the supervised model and unmarked data prediction, a training set updating unit for integrating the predicted positive and negative sample sets and updating the data set by a return data acquisition unit, a judgment unit for judging whether the positive and negative sample sets are converged, and a detection unit for detecting abnormal points of a test set, and the anomaly detection method can improve the indexes such as accuracy, recall rate and precision of label printing by division work, therefore, the problems of low confidence and poor accuracy of abnormal point detection under the condition of limited marking quantity are solved.

Description

Anomaly detection method and system

Technical Field

The invention relates to the technical field of information data processing, in particular to an anomaly detection method and an anomaly detection system.

Background

Outlier detection, also known as outlier detection, refers to the task of finding data points that are significantly different from normal data.

Outliers usually account for a small overall data size, but they mean distinctive information compared to normal points. The task of anomaly detection can therefore often address important issues in the relevant field, leading to significant discoveries. Such as new disease monitoring, credit card fraud identification, network security attacks, traffic anomalies, and planetary detection, among others.

The detection method comprises an unsupervised method, a supervised method and a semi-supervised method, and the specific use is usually determined according to the labeling condition of a training sample.

The method has the advantages that the method does not need to use a data label, but has limited performance, supervised learning is difficult to be allocated to fields when facing similar monitoring tasks such as novel infectious diseases or unknown fault detection, the requirement on data labeling by semi-supervised learning is low, the information in the unlabelled data can be fully utilized to improve the detection accuracy, and the semi-supervised learning effect is unstable when the number of labels is very small.

Therefore, under the condition that the accurate and representative mark acquisition difficulty is high, the method has important practical significance for improving the accuracy of abnormal point detection to the maximum extent.

Disclosure of Invention

The invention breaks through the difficult problems in the prior art and designs the detection method and the system which can stably and accurately detect the abnormal points under the condition that the available label data are extremely rare.

In order to achieve the above object, the present invention provides an abnormality detection method, comprising: the specific abnormality detection method is as follows: receiving an abnormal detection requirement sent by terminal equipment and a small quantity of label sets to be subjected to abnormal detection, performing mutual false label printing processing on the unsupervised model and the supervised model on the small quantity of label sets according to the condition of the small quantity of label sets to form a positive label set and a negative label set, then performing mutual false label printing processing on the unsupervised model and the supervised model on the positive label set and the negative label set until the positive label set and the negative label set are converged, and obtaining an abnormal result data set subjected to detection and marking.

Further, the positive and negative label sets are a sample set marked as "0" after unsupervised prediction and a sample set marked as "1" after supervised prediction.

Further, the specific steps of the mutual false labeling processing of the unsupervised model and the supervised model are as follows:

s1, setting abnormal point proportion parameters, taking all data sets as training sets, and training unsupervised models;

s2, performing unsupervised model prediction on the unlabeled data set U, and labeling the normal sample as '0' and the normal sample label set as L0;

s3, when the number of labels meets the training requirement of the supervised model, using a small amount of labeled data sets L in the data sets as training sets, improving the classification capability of the supervised model by increasing sample weight, and setting the class _ weight parameter as 'balanced' for training the supervised model;

s4, carrying out supervised model prediction on the unlabelled data set U, and labeling the abnormal sample as '1' and the abnormal sample label set as L1;

s5, putting L0 and L1 into a training set, called a positive and negative label set, updating the labeled training set L to Li and the unlabeled training set U to Ui.

Further, the anomaly detection method further comprises test set anomaly point detection.

Further, the specific method for unsupervised model prediction in S2 is as follows: and predicting the unlabeled data set U by using the trained unsupervised model, marking the samples with labels of '1' when the abnormal point score exceeds a certain threshold value is judged as abnormal samples according to the set abnormal point proportion parameters, marking the other samples as normal samples after unsupervised model prediction with labels of '0', and integrating the normal samples into the data set L0.

Further, the specific method for supervised model prediction in S4 is as follows: and predicting a non-label data set U on the trained supervised model, wherein the labeled samples are few, the supervised model is in an under-fitting state, and a confusion matrix of a classification result has the characteristics of high precision and low recall rate, so that the confidence coefficient of the abnormal sample is predicted to be high, the label is marked as '1', the L1 data set is integrated, and the confidence coefficient of the normal sample is predicted to be lower.

Further, the method for detecting the abnormal point in the test set comprises the following steps: on the basis that the label-free data is fully utilized, the test set is L + U, the training model is a supervised model, and test set prediction is carried out.

The invention also designs an anomaly detection system, which is characterized in that: the system comprises a data acquisition unit for acquiring a data set, a model prediction unit for performing training fitting and unmarked data prediction on an unsupervised model and a supervised model, a training set updating unit for integrating a positive sample set and a negative sample set after prediction and updating the data set by a return data acquisition unit, a judgment unit for judging whether the positive sample set and the negative sample set are converged or not and a detection unit for detecting abnormal points of a test set.

Furthermore, the system of the invention further comprises a request receiving module, which is used for receiving a request sent by the terminal device for carrying out anomaly detection on the data set to be predicted; and the control module is used for sending the abnormal detection result to the terminal equipment and controlling the terminal equipment to display the abnormal detection result.

The invention also designs computer equipment which is characterized in that: comprising a processor and a memory for storing processor-executable instructions; wherein the processor is configured to: the following steps may be performed: receiving an abnormal detection requirement sent by terminal equipment and a small quantity of label sets to be subjected to abnormal detection, performing mutual false label printing processing on the unsupervised model and the supervised model on the small quantity of label sets according to the condition of the small quantity of label sets to form a positive label set and a negative label set, then performing mutual false label printing processing on the unsupervised model and the supervised model on the positive label set and the negative label set until the positive label set and the negative label set are converged, and obtaining an abnormal result data set subjected to detection and marking.

The invention also provides a computer-readable storage medium, which is characterized in that: the computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, cause the processor to perform the steps of: receiving an abnormal detection requirement sent by terminal equipment and a small quantity of label sets to be subjected to abnormal detection, performing mutual false label printing processing on the unsupervised model and the supervised model on the small quantity of label sets according to the condition of the small quantity of label sets to form a positive label set and a negative label set, then performing mutual false label printing processing on the unsupervised model and the supervised model on the positive label set and the negative label set until the positive label set and the negative label set are converged, and obtaining an abnormal result data set subjected to detection and marking.

Compared with the prior art, the method and the device have the advantages that the label-free data are predicted by the aid of the unsupervised model and the supervised model respectively, accuracy of label printing in division work is improved, repeated iterative training is conducted on positive and negative data sets obtained through prediction, and indexes such as recall rate and precision are improved, so that the problems of low confidence coefficient and poor precision of abnormal point detection under the condition that the number of labels is limited are solved.

Drawings

Fig. 1 is a schematic flow chart of an anomaly detection method according to an embodiment of the present invention.

Fig. 2 is a schematic structural diagram of an anomaly detection system according to an embodiment of the present invention.

Fig. 3 is a line chart comparing an anomaly detection method with a conventional method according to an embodiment of the present invention.

Wherein, 1 is a data acquisition unit, 2 is a judgment unit, 3 is a model prediction unit, 4 is a training set updating unit, 5 is a detection unit, 3-1 is an unsupervised model, and 3-2 is a supervised model.

Detailed Description

The invention will be further described with reference to the accompanying drawings, but is not to be construed as being limited thereto.

Referring to fig. 1, in an embodiment of the present invention, the unsupervised model 3-1 uses an isolated forest IForest model, the supervised model 3-2 uses a LightGBM model, and other existing and commonly used unsupervised models 3-1 and supervised models 3-2 may be selected according to actual situations.

The confidence of the newly added label is improved in a mode that the unsupervised model 3-1IForest and the supervised model 3-2LightGBM mutually mark false labels. The method comprises the steps of firstly, carrying out abnormal point detection on an unsupervised IForest, marking a normal label (a '0' label) on a normal sample after a suspicious sample is eliminated, marking an abnormal label (a '1' label) on the suspicious sample by a supervised model 3-2LightGBM, and optimizing indexes such as recall rate, precision and the like after iterative training compared with the semi-supervised learning based on LightGBM self-training.

In this embodiment, after receiving an anomaly detection request sent by a terminal device and a small number of tag sets to be detected, a specific anomaly detection flow includes the following steps:

s1, the unsupervised model 3-1 is trained, and it is firstly assumed that the abnormal points account for a small proportion of the whole body and the characteristic values are significantly different from the normal points. Based on this assumption and the advantage that no tags are required by unsupervised model 3-1, the full dataset of the dataset is used for training of the isolated forest IForest. Setting an abnormal point proportion parameter in advance, wherein the closer the abnormal point proportion in an actual scene is, the higher the performance of the model is;

s2, performing unsupervised model 3-1 prediction on the unlabeled data set U, predicting the unlabeled data set (the initial state is U) by using the fitted IForest model, judging a sample with an abnormal point score exceeding a certain threshold value as an abnormal sample according to the setting of a registration parameter, and judging the rest samples as normal samples (labels 0) predicted by the unsupervised model 3-1, wherein the normal samples are called L0;

s3 when the number of the labels reaches the training requirement of the supervised model 3-2, the abnormal point detection task can be regarded as a classification task of supervised learning, but two problems need to be faced: sample labels are few and sample class is unbalanced. Therefore, when a small number of labeled datasets (with an initial state of L) in the dataset are used for training of the supervised model 3-2LightGBM, the supervised model classification capability needs to be improved by increasing the sample weight, and the class _ weight parameter is set to 'balanced';

s4 carries out supervised model 3-2 prediction on the unlabelled data set U, the unlabelled data set (with the initial state of U) is predicted on the basis of the trained LightGBM, and under the condition that the number of label samples is small, the model is in an under-fitting state, and a confusion matrix of classification results has the characteristics of high precision and low recall rate. The confidence of predicting as an abnormal sample is high ("1" label), i.e. the L1 dataset, and the confidence of predicting as a normal sample is lower.

S5, putting L0 and L1 into a training set which is called a positive and negative label set, updating a labeled training set L to Li, and updating an unlabeled training set U to Ui;

s6, judging whether the positive and negative sample sets converge, if the updated training set updated sample number does not exceed 10% of the mark amount, judging convergence to execute the next step, or judging the state of the unlabeled data set Ui: if the sample set is empty, judging convergence and executing the next step, otherwise, repeating the steps 2-5 and iterating until the sample set is converged;

s7, on the basis of full utilization of unlabeled data, training data to be L + U, training a LightGBM model, and performing test set prediction.

Specifically, the steps of IForest anomaly detection are as follows:

a) randomly selecting m sample points from the training data as subsamples, and putting the subsamples into root nodes of the tree;

b) randomly assigning a dimension (attribute), and randomly generating a cutting point p in the current node data, wherein the cutting point is generated between the maximum value and the minimum value of the assigned dimension in the current node data;

c) a hyperplane is generated by the cut point, and then the data space of the current node is divided into 2 subspaces: placing data smaller than p in the specified dimension on the left child of the current node, and placing data larger than or equal to p on the right child of the current node;

d) recursion steps b and c in the child nodes, and new child nodes are continuously constructed until only one data in the child nodes (the cutting can not be continued) or the child nodes reach the defined height;

e) after t subtrees are obtained, for a training data x, traversing each subtree, calculating the number of layers of x in each tree finally, and obtaining the height average value of x in each tree, namely APLt;

f) after obtaining the APL of each test data, we can set a threshold, and the test data with the APL lower than the threshold is abnormal.

Referring to fig. 2, the present embodiment further designs an anomaly detection system, which includes a data acquisition unit 1 for acquiring a data set, a model prediction unit 3 for performing training fitting and unmarked data prediction on an unsupervised model 3-1 and a supervised model 3-2, a training set update unit 4 for integrating the predicted positive and negative sample sets and updating the data set by the returned data acquisition unit 1, a determination unit 2 for determining whether the positive and negative sample sets converge, and a detection unit 5 for detecting an anomaly point of the test set.

In one embodiment, a computer device is designed, comprising a processor and a memory for storing processor-executable instructions; wherein the processor is configured to: the following steps may be performed: receiving an abnormal detection requirement sent by terminal equipment and a small quantity of label sets to be subjected to abnormal detection, performing mutual false label printing processing on the unsupervised model and the supervised model on the small quantity of label sets according to the condition of the small quantity of label sets to form a positive label set and a negative label set, then performing mutual false label printing processing on the unsupervised model and the supervised model on the positive label set and the negative label set until the positive label set and the negative label set are converged, and obtaining an abnormal result data set subjected to detection and marking.

In one embodiment, the specific anomaly detection flow steps are as follows:

s6, judging whether the positive and negative sample sets converge, if the updated training set updated sample number does not exceed 10% of the mark amount, judging convergence to execute the next step, or judging the state of the unlabeled data set Ui: if the set is empty, convergence is judged to execute the next step,

otherwise, repeating the steps 2-5, and iterating until the sample set is converged.

In one embodiment, the computer executable instructions, when executed by the processor, further cause the processor to perform the steps of: on the basis of fully utilizing the label-free data, training data are L + U, and a LightGBM model is trained to predict a test set.

In one embodiment, the present invention also contemplates a computer-readable storage medium configured on a server, the computer-readable storage medium having stored therein computer-executable instructions that, when executed by a processor, cause the processor to perform the steps of: receiving an abnormal detection requirement sent by terminal equipment and a small quantity of label sets to be subjected to abnormal detection, performing mutual false label printing processing on the unsupervised model and the supervised model on the small quantity of label sets according to the condition of the small quantity of label sets to form a positive label set and a negative label set, then performing mutual false label printing processing on the unsupervised model and the supervised model on the positive label set and the negative label set until the positive label set and the negative label set are converged, and obtaining an abnormal result data set subjected to detection and marking.

In order to illustrate that the effect of the invention on anomaly detection is obviously superior to that of the existing supervised and unsupervised models, an anomaly detection contrast test is specially carried out on the river water quality, and the figure 3 is shown.

In the graph, the horizontal axis x is the percentage of the number of abnormal marks in the total number of samples, the smaller x indicates that the number of samples used for training is less, the vertical axis y represents the F1 value of the training result, and the F1 value is an index comprehensively considering the precision rate and the recall rate, and can better reflect the actual effect.

In the figure, it can be seen that line a represents the detection result of the unsupervised abnormal forest model, line b represents the detection result of the supervised LGB, both of which have the very poor system prediction effect in the case of small sample size, and lines c and d represent the prediction effect after fine adjustment by the method of the present invention, and a good abnormal detection effect can be formed in the case of very small data size.

The invention is a semi-supervised anomaly detection framework based on unsupervised and supervised models, a data set with different label proportions is generated in a random label removing mode, the unsupervised model 3-1IForest and the supervised model 3-2LightGBM are used for respectively predicting the unlabelled data, the accuracy of labeler label printing is improved, ten times of experiments are repeated under the condition of each label proportion, and the average value of classification performance indexes is obtained. Finally, compared with the traditional unsupervised model 3-1, the supervised model 3-2 and the self-training model, the abnormal point detection performance of the method is obviously better in the performance of classification indexes, and the performance of the method is still more stable under the condition of extremely small amount of marking data.

It should be noted that, those skilled in the art can understand that the unsupervised model and the supervised model in the above embodiment method can use not only the IForest model and the LightGBM model, but also any unsupervised model and supervised model, so as to achieve the purpose and effect of the present invention.

It will be understood by those skilled in the art that all or part of the processes of the methods of the above embodiments may be implemented by hardware related to computer program instructions, and the program may be stored in a computer readable storage medium, for example, in the storage medium of a computer system, and executed by at least one processor in the computer system, so as to implement the processes of the embodiments including the methods described above. The storage medium may be a magnetic disk, an optical disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), or the like.

The technical features of the embodiments described above may be arbitrarily combined, and for the sake of brevity, all possible combinations of the technical features in the embodiments described above are not described, but should be considered as being within the scope of the present specification as long as there is no contradiction between the combinations of the technical features.

The above-mentioned embodiments only express several embodiments of the present invention, and the description is specific and detailed, but it should not be understood as the limitation of the scope of the present invention, it should be noted that, for those skilled in the art, many variations and modifications can be made without departing from the concept of the present invention, and these all fall into the protection scope of the present invention.

Claims

1. An abnormality detection method characterized by: the specific abnormality detection method is as follows: receiving an abnormal detection requirement sent by terminal equipment and a small amount of label sets to be subjected to abnormal detection, performing mutual false label printing processing on the unsupervised model (3-1) and the supervised model (3-2) on the small amount of label sets according to the condition of the small amount of label sets to form a positive label set and a negative label set, then performing mutual false label printing processing on the unsupervised model (3-1) and the supervised model (3-2) on the positive label set and the negative label set until the positive label set and the negative label set appear to be converged, and obtaining an abnormal result data set after detection and marking;

the positive and negative label sets are a sample set marked as '0' after unsupervised prediction and a sample set marked as '1' after supervised prediction.

2. The abnormality detection method according to claim 1, characterized in that: the method comprises the following specific steps of:

s1, setting abnormal point proportion parameters, taking all data sets as training sets, and training the unsupervised model (3-1);

s2, performing unsupervised model (3-1) prediction on the unlabeled data set U, and labeling the normal sample as '0' and the normal sample label set as L0;

s3, when the number of the labels reaches the training requirement of the supervised model (3-2), taking a small amount of labeled data set L in the data set as a training set, improving the classification capability of the supervised model (3-2) by increasing the sample weight, and setting the class _ weight parameter as 'balanced' to train the supervised model (3-2);

s4, carrying out supervised model (3-2) prediction on the unlabelled data set U, and labeling the abnormal sample as '1' and the abnormal sample label set as L1;

3. The abnormality detection method according to claim 1, characterized in that: the anomaly detection method also comprises test set anomaly point detection.

4. An abnormality detection method according to claim 2, characterized in that: the specific method for predicting the unsupervised model (3-1) in the S2 comprises the following steps: and predicting the unlabeled data set U by using the trained unsupervised model (3-1), marking the samples with labels of '1' when the scores of the abnormal points exceed a certain threshold value according to the set abnormal point proportion parameters, marking the samples with labels of '0' when the other samples are normal samples predicted by the unsupervised model (3-1), and integrating the normal samples into the data set L0.

5. An abnormality detection method according to claim 2, characterized in that: the specific method for predicting the supervised model (3-2) in the S4 comprises the following steps: and (3) predicting the unlabeled data set U on the trained supervised model (3-2), wherein the labeled samples are few, the supervised model (3-2) is in an under-fit state, and a confusion matrix of a classification result has the characteristics of high precision and low recall rate, so that the confidence coefficient of the abnormal samples is high, the labeled samples are marked as '1', the L1 data set is integrated, and the confidence coefficient of the normal samples is lower.

6. An abnormality detection method according to claim 3, characterized in that: the method for detecting the abnormal points in the test set comprises the following steps: on the basis that label-free data is fully utilized, the test set is L + U, the training model is a supervised model (3-2), and test set prediction is carried out.

7. An anomaly detection system, characterized by: the system comprises a data acquisition unit (1) for acquiring a data set, a model prediction unit (3) for performing training fitting and unmarked data prediction on an unsupervised model (3-1) and a supervised model (3-2), a training set updating unit (4) for integrating a positive and negative sample set after prediction and updating the data set of a return data acquisition unit (1), a judgment unit (2) for judging whether the positive and negative sample set is converged, and a detection unit (5) for detecting abnormal points of a test set.

8. The detection system of claim 7, wherein: the device also comprises a request receiving module used for receiving a request sent by the terminal equipment for carrying out abnormity detection on the data set to be predicted;

and the control module is used for sending the abnormal detection result to the terminal equipment and controlling the terminal equipment to display the abnormal detection result.

9. A computer device, characterized by: comprises a processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to: the method of any one of claims 1-6 may be performed.

10. A computer-readable storage medium characterized by: the computer-readable storage medium has stored therein computer-executable instructions that, when executed by a processor, cause the processor to perform the steps of the method of any one of claims 1-6.