CN112749035A

CN112749035A - Anomaly detection method, device and computer readable medium

Info

Publication number: CN112749035A
Application number: CN201911056820.1A
Authority: CN
Inventors: 王梦天; 王梦杰; 莫登耀
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2019-10-31
Filing date: 2019-10-31
Publication date: 2021-05-04
Anticipated expiration: 2039-10-31
Also published as: CN112749035B

Abstract

The application provides an anomaly detection scheme, which comprises the steps of firstly carrying out baseline fitting by utilizing historical data of a system, then detecting a test sample according to a baseline, and determining whether the system running state corresponding to the test sample is abnormal or not. During training, abnormal samples are removed from an initial training sample set to obtain a first training sample set, then unequal-probability sampling is carried out on the first training sample set to obtain a second training sample set, base line fitting is carried out based on the second training sample set, due to the fact that the abnormal samples are removed, the training samples are all normal samples, the influence of the abnormal samples on the base line fitting is avoided, samples in a small number of high-pressure intervals are reserved as much as possible through the unequal-probability sampling, samples in the high-pressure intervals can be avoided being absent in the training samples, and the base lines obtained through fitting have more accurate detection capability.

Description

Anomaly detection method, device and computer readable medium

Technical Field

The present application relates to the field of information technology, and in particular, to an anomaly detection method, an anomaly detection device, and a computer readable medium.

Background

The read-write delay is an important index in the operation process of the cloud computing system, and the problems of the whole system can be reflected from the index, for example, when the system is abnormal, the index is obviously increased generally, so that the accurate abnormality detection of the system is very important.

Many key indexes of the system often have associated changes, for example, the read-write delay index is influenced by the read-write size and the read-write times per second index. The increase of the read-write delay index does not mean that abnormality occurs, and may be a normal increase caused by the increase of the indexes such as the read-write size and the read-write times per second. The interpretable index impulse height is not abnormal, and the unexplainable index fluctuation is abnormal which needs to investigate the reason and carry out corresponding operation and maintenance processing. Due to the fact that nonlinear complex linkage relation exists among multiple indexes, alarming can hardly be carried out in a rule setting mode. Therefore, a scheme capable of accurately detecting system abnormality is lacking at present.

Content of application

An object of the present application is to provide an anomaly detection scheme to solve the problem that a system anomaly can be accurately detected.

To achieve the above object, some embodiments of the present application provide an abnormality detection method including:

removing abnormal samples from an initial training sample set to obtain a first training sample set, wherein the samples in the initial training sample set comprise historical data of response indexes and pressure indexes of a system;

performing unequal probability sampling on the first training sample set to obtain a second training sample set, wherein the sample entry probability of the sample in the unequal probability sampling during the operation of the high-pressure section is greater than that of the sample in the operation of the low-pressure section;

performing baseline fitting based on the response indexes and the pressure indexes of the samples in the second training sample set to obtain a baseline for anomaly detection;

and detecting a test sample according to the baseline, and determining whether the system running state corresponding to the test sample is abnormal, wherein the test sample comprises a response index and a pressure index to be detected when the system runs.

Some embodiments of the present application also provide an abnormality detection apparatus, including:

the cleaning module is used for removing abnormal samples from an initial training sample set to obtain a first training sample set, wherein the samples in the initial training sample set comprise historical data of response indexes and pressure indexes of a system;

the sampling module is used for carrying out unequal probability sampling on the first training sample set to obtain a second training sample set, wherein the sample entry probability of the samples in the unequal probability sampling in the middle-high pressure interval operation is greater than that of the samples in the low-pressure interval operation;

the training module is used for performing baseline fitting on the basis of the response indexes and the pressure indexes of the samples in the second training sample set to obtain a baseline for anomaly detection;

and the detection module is used for detecting a test sample according to the baseline and determining whether the system running state corresponding to the test sample is abnormal or not, wherein the test sample comprises a response index to be detected and a pressure index to be detected during system running.

Further, some embodiments of the present application also provide a computing apparatus comprising a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform the anomaly detection method.

Further embodiments of the present application also provide a computer readable medium having stored thereon computer program instructions executable by a processor to implement the anomaly detection method.

In the anomaly detection scheme provided by the embodiment of the application, the historical data of the system is firstly utilized to perform baseline fitting, then the test sample is detected according to the baseline, and whether the system running state corresponding to the test sample is abnormal or not is determined. During training, abnormal samples are removed from an initial training sample set to obtain a first training sample set, then unequal-probability sampling is carried out on the first training sample set to obtain a second training sample set, base line fitting is carried out based on the second training sample set, due to the fact that the abnormal samples are removed, the training samples are all normal samples, the influence of the abnormal samples on the base line fitting is avoided, samples in a small number of high-pressure intervals are reserved as much as possible through the unequal-probability sampling, samples in the high-pressure intervals can be avoided being absent in the training samples, and the base lines obtained through fitting have more accurate detection capability.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

fig. 1 is a processing flow chart of an anomaly detection method according to an embodiment of the present application;

FIG. 2 is a flowchart illustrating an exemplary embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a clustering result in an embodiment of the present application;

FIG. 4 is a flowchart illustrating a process for adjusting hyper-parameters in an embodiment of the present application;

fig. 5 is a schematic structural diagram of an anomaly detection device according to an embodiment of the present application;

FIG. 6 is a schematic structural diagram of another computing apparatus for implementing anomaly detection according to an embodiment of the present application;

the same or similar reference numbers in the drawings identify the same or similar elements.

Detailed Description

The present application is described in further detail below with reference to the attached figures.

In a typical configuration of the present application, the terminal, the devices serving the network each include one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

The memory may include forms of volatile memory in a computer readable medium, Random Access Memory (RAM) and/or non-volatile memory, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

Computer-readable media, which include both non-transitory and non-transitory, removable and non-removable media, may implement the information storage by any method or technology. The information may be computer readable instructions, data structures, program means, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Disks (DVD) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device.

The embodiment of the application provides an anomaly detection method, which removes the anomaly samples in the initial training sample set, so that the training samples are all normal samples, the influence of the anomaly samples on baseline fitting is avoided, and the samples in a small number of high-pressure intervals are reserved as much as possible through unequal approximate sampling, so that the samples in the training samples in the absence of the high-pressure intervals can be avoided, and the baseline obtained by fitting has better detection capability.

In a practical scenario, the execution subject of the method may be a network device or an apparatus formed by integrating a user equipment and a network device through a network. The user equipment includes but is not limited to various terminal devices such as a personal computer, a smart phone, a tablet computer, and the like, and the network device includes but is not limited to implementations such as a network host, a single network server, multiple network server sets, or a cloud computing-based computer cluster.

Fig. 1 illustrates a processing flow of an anomaly detection method provided in an embodiment of the present application, where the method may include the following steps:

step S101, abnormal samples are removed from the initial training sample set, and a first training sample set is obtained. Wherein, the samples in the initial training sample set include historical data of the response index and the pressure index of the system, for example, the response index and the pressure index of the system when operating in the last half month can be taken as training samples. The response index may be a key index capable of reflecting the overall operation condition of the system, for example, read-write delay (IoLatency), and the pressure index may be an index having a certain correlation with the response index and capable of causing a change in the response index, for example, for the response index IoLatency, the pressure index related thereto may be read-write size (IoSize), number of reads and writes per second (Iops), throughput (throughput), or the like.

During the actual operation of the system, the increase of the response index does not mean that an abnormality occurs, possibly because the abnormality occurs in the system or because the corresponding pressure index is increased. Because the samples in the initial training sample set are historical data of the response indexes and the pressure indexes of the system, the samples with higher response indexes are possible to be normal samples (namely the response indexes are normally increased due to the increase of the pressure indexes) and also possible to be abnormal samples (the pressure indexes are increased due to the abnormality of the system), and the abnormal samples in the initial training sample set are removed, so that the training samples are all normal samples, and the influence of the abnormal samples on the baseline fitting can be avoided.

In some embodiments of the present application, the abnormal samples may be eliminated by using the processing method shown in fig. 2, and the processing flow includes the following processing steps:

step S201, clustering the samples in the initial training sample set based on the response index and the pressure index, and determining a plurality of sample classes. For example, the samples may be clustered using a clustering algorithm such as K-means. In an actual scene, if the number of the indexes is large, the processing dimension is also high during clustering, and the complexity is also increased, so that the number of the pressure indexes can be reduced before clustering, and the complexity of clustering is reduced.

The clustering processing scheme provided by the embodiment of the application is as follows: firstly, when the number of the pressure indexes is larger than a preset value, the most relevant associated indexes with the response indexes are determined according to the pressure indexes. The number of the associated indexes is less than or equal to a preset value, the preset value can be set according to different requirements, for example, the number can be set to 1, when the pressure indexes are the read-write size, the read-write times per second and the throughput, the number of the pressure indexes exceeds one, and then the 1 associated index most related to the response index can be determined based on the pressure indexes. Therefore, when the K-means clustering is carried out, the clustering is carried out based on 1 response index and 1 correlation index, and only the clustering needs to be carried out in a two-dimensional space, so that the processing complexity is greatly reduced.

According to different processing modes, the related indexes can be directly selected from the pressure indexes, or can be new indexes obtained by processing and calculating based on the pressure indexes. For example, principal component analysis may be performed based on a plurality of pressure indicators, and combined into one principal component as a correlation indicator, or a Spearman correlation coefficient may be calculated between each pressure indicator and a response indicator, and one of the most correlated pressure indicators may be selected as a correlation indicator.

After determining the correlation indicator, the samples in the training sample set may be clustered based on the correlation indicator and the pressure indicator, determining a plurality of sample classes. When the pressure indexes are too much, the number of the related indexes obtained after the pressure indexes are processed is small, so that the complexity of clustering processing can be reduced, and the processing efficiency is improved.

Step S202, after finishing clustering, removing the sample class of which the class center does not meet the preset condition as an abnormal sample to obtain a first training sample set. The preset condition is used for excluding the pressure index rise caused by system abnormality, for example, in the embodiment of the present application, if it is considered that when the response index is greater than Y and the pressure index or the related index is smaller than X, the height of the response index cannot be interpreted as a normal rise caused by the pressure index or the related index rise. Therefore, the preset condition can be set to be related to the class center of the response index and the threshold value of the class center of the pressure index or the related index, when the class center of the response index is larger than one threshold value and the class center of the pressure index or the related index is smaller than the other threshold value, the class center of the sample class can be considered not to meet the preset condition, and the sample contained in the sample class is taken as an abnormal sample to be removed.

Fig. 3 shows a schematic diagram of a clustering result after clustering based on response index read-write delay and pressure index throughput. The clustering result comprises 5 sample classes G0-G4, and samples of the sample class G2 and the sample class G4 in the Area are determined as abnormal samples according to the relation between the class center and a threshold Y, X in a preset condition.

In some embodiments of the present application, before removing the abnormal samples, the samples in the initial sample training set may be normalized. For example, in this embodiment, by the normalization process, the dimension of each index can be eliminated and the data can be more concentrated, so as to facilitate the subsequent process. For the normalized samples, when the abnormal samples are eliminated, samples of sample classes in which the class center of the response index exceeds 3sigma and the class center of the pressure index is within 3 (or more conservative 2 or 1.5) sigma may be regarded as abnormal samples, so that the sample classes are removed, and the samples in the remaining sample classes are the first training sample set.

And step S102, performing unequal probability sampling on the first training sample set to obtain a second training sample set. The purpose of the scheme is to detect the abnormality of the system running state and ensure certain real-time performance, so that the sampling time interval of the monitoring data is short, a large number of samples can be generated, and if the monitoring data is directly processed based on all the samples, huge calculation cost is required, so that sampling is required to reduce the calculation cost.

In the scheme of the embodiment of the application, the unequal sampling is performed, and the sample entry probability of the sample during operation in the high pressure section is greater than the sample entry probability of the sample during operation in the low pressure section, wherein the high pressure section corresponds to the case when the pressure index is higher during operation of the system, whereas the low pressure section corresponds to the case when the pressure index is lower during operation of the system, and the high pressure section and the low pressure section can be determined according to the operation condition of the system in an actual scene, for example, a distinguishing threshold value can be set artificially, and the operation state of the system is considered to be divided into a plurality of pressure sections with different heights according to different pressure indexes at that time.

In an actual scene, the situation that the system is in a high-pressure state during operation is often less, so that the number of samples in a high-pressure interval in historical data is often obviously less than that of samples in a low-pressure interval, and if simple random sampling is adopted, samples in an excessive low-pressure interval are probably reserved in a training sample set, only a small number of samples in a high-pressure interval are extracted, so that the training samples are unbalanced to influence baseline fitting.

In some embodiments of the present application, in conjunction with the aforementioned clustering of samples, the unequal sampling may be performed in the following manner: firstly, according to the number of samples of each sample class in a first training sample set, determining the sampling weight of the sample class. The sampling weight of the sample class is inversely related to the number of samples in the sample class and positively related to the sample entry probability of the samples in the sample class, that is, the greater the number of samples in the sample class is, the smaller the sampling weight of the sample class is, and the smaller the probability of the samples being taken is. For example, in this embodiment, the sampling weight may be set to 1/sqrt (group _ size), which is the reciprocal of the square root of the number of samples, and group _ size is the number of samples in the sample class. Taking sample class G1 with sample number 10000 as an example, the sampling weight is 1/100, and the sampling weight of sample class G2 with sample number 100 is 1/10. The sample entry probability may be proportional to the corresponding sampling weight, for example, the sample entry probability of the aforementioned sample class G2 is 10 times the sample entry probability of the sample class G1, that is, when sampling is performed, the probability that the samples in the sample class G2 are extracted is 10 times that of the samples in the sample class G1.

Based on the sampling weight, sampling is carried out on each sample class in the first training sample set, and then a second training sample set can be obtained. Table 1 shows the sampling results after the above-described method is used to sample several sample classes unequally:

class numbering	0	1	2	3	4
						Number of samples	15122	4707	3115	7056	235
Sampling results	3860	1979	1507	2440	214

TABLE 1

As the result of the clustering can experience the actual situation of the system running in the actual scene to a certain extent, that is, the number of samples in the high pressure interval is usually obviously less than that in the low pressure interval, for example, in the above sample class with number 4, the number of samples in the higher pressure interval is the smallest in the first training sample set. The sampling weight is determined by adopting the mode, and after unequal sampling is carried out according to the corresponding sampling probability, the sample retention proportion in the sample class with the number of 4 is the highest, so that the second training sample set can retain the corresponding samples in the high-pressure area as much as possible, and the accuracy of baseline fitting is ensured.

And S103, performing baseline fitting based on the response indexes and the pressure indexes of the samples in the second training sample set to obtain a baseline for anomaly detection. The method is characterized in that historical data is used for training a regression model to predict response indexes needing to detect abnormity, for example, the read-write size of the pressure indexes, the read-write times per second and the throughput are used for predicting the read-write delay of the response indexes, and the predicted values of the corresponding response indexes under different pressure indexes are the base lines for abnormity detection.

And step S104, detecting a test sample according to the base line, and determining whether the system running state corresponding to the test sample is abnormal.

The test sample comprises a response index and a pressure index to be detected when the system runs. Based on the pressure index in the test sample, the predicted value of the response index under the pressure index can be predicted by combining the baseline, namely y _ hat, and the response index in the test sample is the real value of the response index under the pressure index, namely y.

Therefore, in some embodiments of the application, when a test sample is detected according to the baseline and it is determined whether the system operation state corresponding to the test sample is abnormal, it may be determined whether a response index corresponding to a pressure index to be detected in the test sample exceeds an alarm threshold of the baseline according to the baseline, and if the response index exceeds the alarm threshold, it is determined that the system operation state corresponding to the test sample is abnormal. Otherwise, if the system running state corresponding to the test sample is not normal, the system running state corresponding to the test sample is considered to be normal.

The alarm threshold of the base line can be set according to the requirements of the actual application scene, and the following alarm threshold of the centralized base line can be adopted:

(1) see if the true value of the response indicator deviates from the baseline by more than a times the conventional jitter amplitude. At this time, the alarm threshold y _ limit _1 is y _ hat _ Q3+ a × IQR, where y _ hat _ Q3 is a third quartile value of the response indicators in the second training sample set, that is, a value arranged at the 75 th% position in the response indicators, and IQR is a difference between the third quartile value and the first quartile value of the response indicators in the second training sample set, that is, a difference between a value arranged at the 75 th% position and a value arranged at the 25 th% position in the response indicators,

(2) see if the true value of the response indicator exceeds the absolute value b of the baseline. At this time, the alarm threshold y _ limit _2 is y _ hat + b.

(3) See if the true value of the response indicator exceeds the baseline ratio c. At this time, the alarm threshold y _ limit _3 is y _ hat × c.

Wherein, a, b, c are adjustable hyper-parameters, for example, initial values thereof can be set to 1.5,0,1, and then the target loss function of the baseline model can be minimized by continuous adjustment, so that the detection accuracy is better.

In practical use, the alarm threshold of the baseline may be determined according to a maximum value of a first threshold y _ limit _1, a second threshold y _ limit _2, and a third threshold y _ limit _3, where the first threshold y _ limit _1 is a sum of a fourth quantile value y _ hat _ Q3 of the response indicator in the second training sample set and a fourth operand value, the fourth operand value is a product of a fourth quantile distance IQR of the response indicator in the second training sample set and a first super-parameter a, the second threshold y _ limit _2 is a sum of a corresponding value y _ hat of the pressure indicator to be detected on the baseline and a second super-parameter b, and the third threshold y _ limit _3 is a product of a corresponding value y _ hat of the pressure indicator to be detected on the baseline and a third super-parameter c. That is, the alarm threshold y _ limit of the baseline actually used for detection is max (y _ limit _1, y _ limit _2, y _ limit _3), which is equivalent to that the system operation state corresponding to the test sample is considered to be abnormal only when the test sample triggers all the three alarm thresholds.

In an actual scene, the collected sample data is often accompanied with the problem of unequal variance, for example, the fluctuation of the index is large in a high-pressure interval, and the fluctuation is small in a low-pressure interval. Meanwhile, for the cloud computing system, due to the need of providing services for different businesses, the index fluctuation conditions of each business cluster are different, for example, the fluctuation of some business clusters is large in a high-pressure interval, and the fluctuation of some business clusters is large in a low-pressure interval. In order to solve the problem, in the anomaly detection method provided in the embodiment of the present application, before the test samples are detected, the test samples are subjected to numerical scaling so that the variances of the test samples are equal.

In some embodiments of the present application, the scaling may be by Box-Cox transformation of the test sample. The Box-Cox transformation is a generalized power transformation method, and is a data transformation commonly used in statistical modeling, and the transformation formula can be set as follows: when λ is not equal to 0, y ═ λ -1)/λ, and when λ is equal to 0, y ═ log (x), where λ is a parameter indicating a numerical compression method, and determines whether the numerical conversion is to compress high-value points or low-value points, and the degree of compression, and the like. The lambda can be used for estimating a most appropriate value by using a maximum likelihood method according to the characteristics of index values in different service clusters, so that an optimal lambda can be determined for each different service cluster, and the numerical variances of test samples from different service clusters after Box-Cox conversion are equal. Therefore, when the abnormity detection is carried out, the abnormity detection can be carried out by adopting a uniform global alarm threshold value without setting the adaptive alarm threshold value for different service clusters.

Since the anomaly detection performed by the method provided by the above embodiment is completely unsupervised, the detected anomaly result is statistically abnormal. In an actual scene, the statistical anomaly sometimes has a certain difference from the knowledge of operation and maintenance personnel and the tolerance of the system, and a false alarm condition occurs. Additional information can be obtained by introducing artificial knowledge through annotation, and the accuracy of anomaly detection can be optimized by utilizing the information. Therefore, in the anomaly detection method provided in some embodiments of the present application, the manual labeling result of a part of the test samples may also be obtained, and then the hyper-parameters are adjusted according to the manually labeled test sample detection result and the manually labeled result. Because the user only needs to label the part of samples to be detected manually, but not all samples participating in detection, the workload of labeling is very limited, and the labor cost can be effectively saved.

When the hyper-parameter is adjusted according to the artificially labeled test sample detection result and the artificially labeled result, the processing flow shown in fig. 4 may be adopted, which includes:

step S401, calculating the cost value of the hyper-parameters adjusted by the search algorithm according to the test sample detection result and the artificial labeling result which are artificially labeled and the test sample detection result which is not artificially labeled.

Assume that there are N1 test samples that are not labeled manually, and N2 test samples that are labeled manually, for example, N1 is 10000 and N2 is 10.

For the N1 samples that were not manually labeled, the results of the anomaly detection were all considered correct. If the alarm threshold y _ limit is changed after the hyper-parameters a, b, and c are changed, the detection result may also be changed accordingly, for example, a test sample that was previously determined to be normal may be determined to be abnormal after the hyper-parameters are changed, or a test sample that was previously determined to be abnormal may be determined to be normal after the hyper-parameters are changed. In this case, the number of samples detected as abnormal in the samples that are not artificially labeled before the super-parameter adjustment is denoted by NP1, and the number of samples detected as normal in the samples that are not artificially labeled before the super-parameter adjustment is denoted by NN1, for example, NP1 is 50, NN1 is 9950, and N1 is NP1+ NN1 in this embodiment. After the hyper-parameter is changed, it is possible to change the partial sample detection result, and the sample whose detection result is changed from abnormal to normal may be represented by FNN1, and the sample whose detection result is changed from normal to abnormal may be represented by FNP1, for example, FNN1 is 2 and FNP1 is 5 in the present embodiment. The cost value C1 + FNN1/NP1+ FNP1/NN 1+ 2/50+5/9950 for the unlabeled sample after the adjustment of the hyper-parameters can thus be calculated.

For the N2 manually labeled samples, the detection result obtained by using the adjusted hyper-parameters a, b, c may not be consistent with the manually labeled result. In this case, NP2 may be used to indicate the number of test samples with normal manual labeling result, and NN2 may be used to indicate the number of test samples with abnormal manual labeling result, for example, NP2 is 6, NN2 is 4, and N2 is NP2+ NN2 in this embodiment. After the hyper-parameter is changed, the partial sample detection result may not be consistent with the manual marking result, the number of samples whose manual marking result is abnormal but whose detection result is normal may be represented by FNN2, and the number of samples whose manual marking result is normal but whose detection result is abnormal may be represented by FNP2, for example, FNN2 is 1 and FNP2 is 2 in this embodiment. The cost value C2 + FNN2/NP2+ FNP2/NN 2+ 1/6+2/4 for the labeled sample after the adjustment of the hyper-parameters can thus be calculated.

By means of weighted summation, the cost value C of the super-parameter adjusted by the search algorithm can be calculated based on C1 and C2, that is, C1/N1+ C2/N2.

Step S402, a target loss function is set, wherein the target loss function is related to the cost value of the hyperparameter adjusted by adopting a search algorithm.

Step S403, determining a hyper-parameter that minimizes the target loss function according to the target loss function. For example, in the embodiment of the present application, a target loss function related to the cost value C after the hyper-parameter is adjusted by using the search algorithm may be set as:

wherein w is an adjustment value which can be set to 10^-10To 10^-6Constant in between, a0, b0, c0 are the hyperparameters before being adjusted, and y mean is the arithmetic mean of the response indicators in all samples.

Based on the objective loss function, a search algorithm may be utilized to encounter a set of optimal hyper-parameters such that the objective loss function is minimized. The found optimal hyper-parameter can be used for determining the alarm threshold value of the next detection so as to realize more accurate abnormity detection.

In an actual scenario, various applicable search algorithms may be used, for example, for the objective function set in this embodiment, since the gradient cannot be calculated, the Nelder-Mead algorithm may search for an optimal hyper-parameter, but since the Nelder-Mead algorithm is likely to fall into local optimality, the optimization may be performed using a plurality of initial values. For example, for each sample defined as normal, including labeled samples and unlabeled samples, the equation can be solved: y — y _ Q3+ a × IQR calculates a value of a, N values of a can be calculated by N samples, and then the maximum value of a is selected as the initial value of a in the search algorithm. In implementing the search algorithm, the optimal value may be searched for each time starting with an initial value of one or two hyper-parameters, and starting with the default value [1.5,0,1] or starting with the last value of the hyper-parameter. Therefore, the optimal hyper-parameter can be searched more accurately and efficiently.

Based on the same inventive concept, the embodiment of the application also provides an abnormality detection device, the corresponding method of the device is the abnormality detection method in the previous embodiment, and the principle of solving the problem is similar to the method.

The anomaly detection device provided by the embodiment of the application can eliminate the abnormal samples in the initial training sample set when implementing anomaly detection, so that the samples for training are all normal samples, the influence of the abnormal samples on baseline fitting is avoided, and the samples in the high pressure interval with less quantity are reserved as much as possible through unequal sampling, thereby avoiding the samples in the high pressure interval lacking in the training samples, and ensuring that the baseline obtained by fitting has better detection capability.

In an actual scenario, the anomaly detection apparatus may be a network device or an apparatus formed by integrating a user equipment and a network device through a network. The user equipment includes but is not limited to various terminal devices such as a personal computer, a smart phone, a tablet computer, and the like, and the network device includes but is not limited to implementations such as a network host, a single network server, multiple network server sets, or a computer set based on cloud computing. Here, the Cloud is made up of a large number of hosts or web servers based on Cloud Computing (Cloud Computing), which is a type of distributed Computing, one virtual computer consisting of a collection of loosely coupled computers.

Fig. 5 shows a structure of an anomaly detection apparatus provided in an embodiment of the present application, which includes a cleaning module 510, a sampling module 520, a training module 530, and a detection module 540. The cleaning module 510 is configured to remove abnormal samples from the initial training sample set to obtain a first training sample set. Wherein, the samples in the initial training sample set include historical data of the response index and the pressure index of the system, for example, the response index and the pressure index of the system when operating in the last half month can be taken as training samples. The response index may generally be a key index capable of reflecting the overall operation condition of the system, for example, the response index may be read-write delay, and the pressure index may be an index having a certain correlation with the response index and capable of causing a change in the response index, for example, for the read-write delay of the response index, the pressure index related thereto may be read-write size, read-write times per second, throughput, or the like.

In some embodiments of the present application, the cleaning module 510 may adopt a processing manner as shown in fig. 2 to remove the abnormal sample, and the processing flow includes the following processing steps:

step S201, a clustering unit of the cleaning module clusters the samples in the initial training sample set based on the response index and the pressure index, and determines a plurality of sample classes. For example, the samples may be clustered using a clustering algorithm such as K-means. In an actual scene, if the number of the indexes is large, the processing dimension is also high during clustering, and the complexity is also increased, so that the number of the pressure indexes can be reduced before clustering, and the complexity of clustering is reduced.

Step S202, after the clustering is finished, a cleaning unit of a cleaning module removes the sample class of which the class center does not meet the preset condition as an abnormal sample to obtain a first training sample set. The preset condition is used for excluding the pressure index rise caused by system abnormality, for example, in the embodiment of the present application, if it is considered that when the response index is greater than Y and the pressure index or the related index is smaller than X, the height of the response index cannot be interpreted as a normal rise caused by the pressure index or the related index rise. Therefore, the preset condition can be set to be related to the class center of the response index and the threshold value of the class center of the pressure index or the related index, when the class center of the response index is larger than one threshold value and the class center of the pressure index or the related index is smaller than the other threshold value, the class center of the sample class can be considered not to meet the preset condition, and the sample contained in the sample class is taken as an abnormal sample to be removed.

In some embodiments of the present application, the apparatus provided by the present application may further include a preprocessing module, and the preprocessing module may perform normalization processing on the samples in the initial sample training set before removing the abnormal samples. For example, in this embodiment, by the normalization process, the dimension of each index can be eliminated and the data can be more concentrated, so as to facilitate the subsequent process. For the normalized samples, when the abnormal samples are eliminated, samples of sample classes in which the class center of the response index exceeds 3sigma and the class center of the pressure index is within 3 (or more conservative 2 or 1.5) sigma may be regarded as abnormal samples, so that the sample classes are removed, and the samples in the remaining sample classes are the first training sample set.

The sampling module 520 is configured to perform unequal sampling on the first training sample set to obtain a second training sample set. The purpose of the scheme is to detect the abnormality of the system running state and ensure certain real-time performance, so that the sampling time interval of the monitoring data is short, a large number of samples can be generated, and if the monitoring data is directly processed based on all the samples, huge calculation cost is required, so that sampling is required to reduce the calculation cost.

In the scheme of the embodiment of the application, the unequal sampling has the sample entry probability that the sample during the operation of the high pressure section is greater than the sample entry probability of the sample during the operation of the low pressure section, wherein the high pressure section corresponds to the condition when the pressure index is higher during the operation of the system, and conversely, the low pressure section corresponds to the condition when the pressure index is lower during the operation of the system. In an actual scene, the situation that the system is in a high-pressure state during operation is often less, so that the number of samples in a high-pressure interval in historical data is often obviously less than that of samples in a low-pressure interval, and if simple random sampling is adopted, samples in an excessive low-pressure interval are probably reserved in a training sample set, only a small number of samples in a high-pressure interval are extracted, so that the training samples are unbalanced to influence baseline fitting.

Based on the sampling weight, sampling is carried out on each sample class in the first training sample set, and then a second training sample set can be obtained. Table 1 shows the sampling results obtained by sampling several sample classes in the manner described above.

The training module 530 is configured to perform baseline fitting based on the response indicator and the pressure indicator of the samples in the second training sample set, and obtain a baseline for anomaly detection. The method is characterized in that historical data is used for training a regression model to predict response indexes needing to detect abnormity, for example, the read-write size of the pressure indexes, the read-write times per second and the throughput are used for predicting the read-write delay of the response indexes, and the predicted values of the corresponding response indexes under different pressure indexes are the base lines for abnormity detection.

The detection module 540 is configured to detect a test sample according to the baseline, and determine whether a system operating state corresponding to the test sample is abnormal.

Therefore, in some embodiments of the application, when the detection module detects the test sample according to the baseline and determines whether the system operation state corresponding to the test sample is abnormal, it may determine whether the response index corresponding to the pressure index to be detected in the test sample exceeds the alarm threshold of the baseline according to the baseline, and if the response index exceeds the alarm threshold, it is determined that the system operation state corresponding to the test sample is abnormal. Otherwise, if the system running state corresponding to the test sample is not normal, the system running state corresponding to the test sample is considered to be normal.

In an actual scene, the collected sample data is often accompanied with the problem of unequal variance, for example, the fluctuation of the index is large in a high-pressure interval, and the fluctuation is small in a low-pressure interval. Meanwhile, for the cloud computing system, due to the need of providing services for different businesses, the index fluctuation conditions of each business cluster are different, for example, the fluctuation of some business clusters is large in a high-pressure interval, and the fluctuation of some business clusters is large in a low-pressure interval. In order to solve the problem, in the anomaly detection apparatus provided in the embodiment of the present application, the preprocessing module may also be configured to perform numerical scaling on the test samples before detecting the test samples, so as to equalize variances of the test samples.

In some embodiments of the present application, the scaling may be performed in a manner such that the test samples undergo a Box-Cox transformation. The Box-Cox transformation is a generalized power transformation method, and is a data transformation commonly used in statistical modeling, and the transformation formula can be set as follows: when λ is not equal to 0, y ═ λ -1)/λ, and when λ is equal to 0, y ═ log (x), where λ is a parameter indicating a numerical compression method, and determines whether the numerical conversion is to compress high-value points or low-value points, and the degree of compression, and the like. The lambda can be used for estimating a most appropriate value by using a maximum likelihood method according to the characteristics of index values in different service clusters, so that an optimal lambda can be determined for each different service cluster, and the numerical variances of test samples from different service clusters after Box-Cox conversion are equal. Therefore, when the abnormity detection is carried out, the abnormity detection can be carried out by adopting a uniform global alarm threshold value without setting the adaptive alarm threshold value for different service clusters.

Since the anomaly detection performed by the device provided in the above embodiment is completely unsupervised, the detected anomaly results in a statistical anomaly. In an actual scene, the statistical anomaly sometimes has a certain difference from the knowledge of operation and maintenance personnel and the tolerance of the system, and a false alarm condition occurs. Additional information can be obtained by introducing artificial knowledge through annotation, and the accuracy of anomaly detection can be optimized by utilizing the information. Therefore, in the anomaly detection device provided in some embodiments of the present application, the anomaly detection device may further include a closed-loop optimization module, where the closed-loop optimization module is configured to obtain an artificial labeling result of a part of the test samples, and then adjust the hyper-parameter according to the artificially labeled test sample detection result and the artificial labeling result. Because the user only needs to label the part of samples to be detected manually, but not all samples participating in detection, the workload of labeling is very limited, and the labor cost can be effectively saved.

When the hyper-parameters are adjusted according to the artificially labeled test sample detection result and the artificially labeled result, the closed-loop optimization module may adopt a processing flow as shown in fig. 4, including:

wherein w is an adjustment value which can be set to 10^-10To 10^-6A0, b0, c0 are before unadjustedY mean is the arithmetic mean of the response indicators in all samples.

In summary, in the anomaly detection scheme provided in the embodiment of the present application, first, the historical data of the system is used to perform baseline fitting, and then, the test sample is detected according to the baseline, so as to determine whether the system operating state corresponding to the test sample is abnormal. During training, abnormal samples are removed from an initial training sample set to obtain a first training sample set, then unequal-probability sampling is carried out on the first training sample set to obtain a second training sample set, base line fitting is carried out based on the second training sample set, due to the fact that the abnormal samples are removed, the training samples are all normal samples, the influence of the abnormal samples on the base line fitting is avoided, samples in a small number of high-pressure intervals are reserved as much as possible through the unequal-probability sampling, samples in the high-pressure intervals can be avoided being absent in the training samples, and the base lines obtained through fitting have more accurate detection capability.

In addition, some of the present application may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or techniques in accordance with the present application through the operation of the computer. Program instructions which invoke the methods of the present application may be stored on a fixed or removable recording medium and/or transmitted via a data stream on a broadcast or other signal-bearing medium and/or stored within a working memory of a computer device operating in accordance with the program instructions. Some embodiments according to the present application include a computing device as shown in fig. 6, which includes one or more memories 610 storing computer-readable instructions and a processor 620 for executing the computer-readable instructions, wherein when the computer-readable instructions are executed by the processor, the device is caused to perform the method and/or the technical solution according to the embodiments of the present application.

Furthermore, some embodiments of the present application also provide a computer readable medium, on which computer program instructions are stored, the computer readable instructions being executable by a processor to implement the methods and/or aspects of the foregoing embodiments of the present application.

It should be noted that the present application may be implemented in software and/or a combination of software and hardware, for example, implemented using Application Specific Integrated Circuits (ASICs), general purpose computers or any other similar hardware devices. In some embodiments, the software programs of the present application may be executed by a processor to implement the above steps or functions. Likewise, the software programs (including associated data structures) of the present application may be stored in a computer readable recording medium, such as RAM memory, magnetic or optical drive or diskette and the like. Additionally, some of the steps or functions of the present application may be implemented in hardware, for example, as circuitry that cooperates with the processor to perform various steps or functions.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned. Furthermore, it is obvious that the word "comprising" does not exclude other elements or steps, and the singular does not exclude the plural. A plurality of units or means recited in the apparatus claims may also be implemented by one unit or means in software or hardware. The terms first, second, etc. are used to denote names, but not any particular order.

Claims

1. An anomaly detection method, the method comprising:

2. The method of claim 1, wherein removing outlier samples from the initial set of training samples to obtain a first set of training samples comprises:

clustering samples in the initial training sample set based on the response indexes and the pressure indexes to determine a plurality of sample classes;

and removing the sample class of which the class center does not meet the preset condition as an abnormal sample to obtain a first training sample set.

3. The method of claim 2, wherein clustering samples in the initial training sample set based on the response indicator and the stress indicator to determine a plurality of sample classes comprises:

when the number of the pressure indexes is larger than a preset value, determining the most relevant associated indexes with the response indexes according to the pressure indexes, wherein the number of the associated indexes is smaller than or equal to the preset value;

and clustering the samples in the training sample set based on the correlation indexes and the pressure indexes to determine a plurality of sample classes.

4. The method of claim 2, wherein unequal sampling of the first set of training samples to obtain a second set of training samples comprises:

determining the sampling weight of the sample class according to the number of samples of each sample class in a first training sample set, wherein the sampling weight of the sample class is in negative correlation with the number of samples of the sample class and is in positive correlation with the sampling probability;

and sampling each sample class in the first training sample set based on the sampling weight to obtain a second training sample set.

5. The method of any of claims 1 to 4, wherein the method further comprises:

and carrying out standardization processing on the samples in the initial training sample set.

6. The method of claim 1, wherein detecting a test sample according to the baseline and determining whether a system operating state corresponding to the test sample is abnormal comprises:

judging whether a response index corresponding to the pressure index to be detected in the test sample exceeds an alarm threshold value of the baseline or not according to the baseline;

and if the system running state exceeds the alarm threshold, determining that the system running state corresponding to the test sample is abnormal.

7. The method according to claim 6, wherein the alarm threshold value of the baseline is determined according to a maximum value of a first threshold value, a second threshold value and a third threshold value, the first threshold value is a sum of a fourth fractional value and a fourth calculated value of the response indicators in the second training sample set, the fourth calculated value is a product of a fourth fractional distance of the response indicators in the second training sample set and a first hyperparameter, the second threshold value is a sum of a corresponding value of the pressure indicators to be detected on the baseline and a second hyperparameter, and the third threshold value is a product of a corresponding value of the pressure indicators to be detected on the baseline and a third hyperparameter.

8. The method of claim 7, wherein the method further comprises:

acquiring a manual labeling result of a part of test samples;

and adjusting the hyper-parameters according to the artificially marked test sample detection result and the artificially marked result.

9. The method of claim 8, wherein adjusting the hyper-parameter based on the manually labeled test sample detection results and the manually labeled results comprises:

calculating the cost value of the test sample which is not manually marked after the hyper-parameters are adjusted by adopting a search algorithm according to the test sample detection result and the manual marking result which are manually marked and the test sample detection result which is not manually marked;

setting a target loss function, wherein the target loss function is related to the cost value of the hyperparameter adjusted by adopting a search algorithm;

and determining a hyperparameter which enables the target loss function to be minimum according to the target loss function.

10. The method of any of claims 6 to 9, wherein the method further comprises:

the test samples are numerically scaled to equalize the variances of the test samples.

11. An abnormality detection device comprising:

12. The apparatus of claim 11, wherein the cleaning module comprises:

the clustering unit is used for clustering the samples in the initial training sample set based on the response indexes and the pressure indexes to determine a plurality of sample classes;

and the cleaning unit is used for removing the sample class of which the class center does not accord with the preset condition as an abnormal sample to obtain a first training sample set.

13. The apparatus of claim 12, wherein the clustering unit is configured to determine a correlation index most relevant to the response index according to the pressure indexes when the number of the pressure indexes is greater than a preset value, wherein the number of the correlation indexes is less than or equal to the preset value; and clustering the samples in the training sample set based on the correlation index and the pressure index to determine a plurality of sample classes.

14. The apparatus of claim 12, wherein the sampling module is configured to determine a sampling weight for each sample class according to a number of samples of the sample class in the first training sample set, wherein the sampling weight for the sample class is negatively correlated to the number of samples of the sample class and positively correlated to the sampling probability; and sampling each sample class in the first training sample set based on the sampling weight to obtain a second training sample set.

15. The apparatus of any one of claims 11 to 14, wherein the apparatus further comprises:

and the preprocessing module is used for carrying out standardization processing on the samples in the initial training sample set.

16. The device according to claim 11, wherein the detecting module is configured to determine, according to the baseline, whether a response indicator corresponding to a pressure indicator to be detected in the test sample exceeds an alarm threshold of the baseline; and if the system running state exceeds the alarm threshold, determining that the system running state corresponding to the test sample is abnormal.

17. The apparatus according to claim 16, wherein the alarm threshold of the baseline is determined according to a maximum value of a first threshold, a second threshold and a third threshold, the first threshold is a sum of a fourth fractional value and a fourth calculated value of the response indicator in the second training sample set, the fourth calculated value is a product of a fourth fractional distance of the response indicator in the second training sample set and a first hyperparameter, the second threshold is a sum of a corresponding value of the pressure indicator to be detected on the baseline and a second hyperparameter, and the third threshold is a product of a corresponding value of the pressure indicator to be detected on the baseline and a third hyperparameter.

18. The apparatus of claim 17, wherein the apparatus further comprises:

the closed loop optimization module is used for acquiring the manual labeling result of the sample with the abnormal detection result; and adjusting the hyper-parameters according to the artificially labeled test sample detection result and the artificially labeled result.

19. The apparatus of claim 18, wherein the closed-loop optimization module is configured to calculate a cost value of the artificially labeled test sample detection result and the artificially labeled test sample detection result after the search algorithm is adopted to adjust the hyper-parameters according to the artificially labeled test sample detection result and the artificially labeled test sample detection result; setting a target loss function, wherein the target loss function is related to the cost value of the hyperparameter adjusted by adopting a search algorithm; and determining a hyper-parameter that minimizes the objective loss function according to the objective loss function.

20. The apparatus of any of claims 16 to 19, wherein the method further comprises:

and the preprocessing module is used for carrying out numerical value scaling on the test samples so as to enable the variances of the test samples to be equal.

21. A computing device, wherein the device comprises a memory for storing computer program instructions and a processor for executing the computer program instructions, wherein the computer program instructions, when executed by the processor, trigger the apparatus to perform the method of any of claims 1 to 10.

22. A computer readable medium having stored thereon computer program instructions executable by a processor to implement the method of any one of claims 1 to 10.