CN111984442A

CN111984442A - Method and device for detecting abnormality of computer cluster system, and storage medium

Info

Publication number: CN111984442A
Application number: CN201910432460.4A
Authority: CN
Inventors: 刘文琴; 陈力; 刘建伟
Original assignee: ZTE Corp
Current assignee: ZTE Corp
Priority date: 2019-05-22
Filing date: 2019-05-22
Publication date: 2020-11-24

Abstract

The invention provides an anomaly detection method and device for a computer cluster system and a storage medium, wherein the method comprises the following steps: collecting a first log generated by a computer cluster system; analyzing error logs of error categories in the first log through an error log model, determining whether a computer cluster system is abnormal or not, obtaining a first determination result, obtaining a frequency component of the first log, determining whether the computer cluster system is abnormal or not according to whether the frequency component is within a preset range or not, and obtaining a second determination result, wherein the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: a log of error levels; according to the technical scheme, the problem that the detection result is single and the like in the related technology in a mode of marking the abnormal type in advance is solved.

Description

Method and device for detecting abnormality of computer cluster system, and storage medium

Technical Field

The present invention relates to the field of computers, and in particular, to a method and an apparatus for detecting an anomaly of a computer cluster system, and a storage medium.

Background

In a cloud computing environment, because cloud services are based on virtual resources, numerous hardware devices and software systems work in a cooperative mode, and in addition, the diversity and flexibility of user requirements under multiple tenants are added, the cloud computing system is far beyond a single system in scale and complexity and even exceeds a traditional distributed system, in addition, the complex component structure and software layers in the cloud computing environment bring about a plurality of system fault problems, and how to quickly and accurately identify and detect the system faults becomes a key problem worthy of discussion in order to guarantee the availability and reliability of upper-layer services.

During the operation of the system, a large amount of performance data and logs are generated, wherein the logs record the execution trace of the system, exist in all components of the system, are data which most directly represent the operation state of the system and the user behavior and contain a large amount of important and valuable information. The system fault detection based on log analysis becomes an important means for guaranteeing the usability of the software system, and is also one of important applications of log data. In the related art, features representing system operation abnormity are generally extracted, a fault detection model is established based on the features, and finally system fault detection is realized. Therefore, enough labeled abnormal data are needed for training, and only the labeled abnormal type can be detected during detection, and the abnormality which is not in the training data cannot be detected.

Aiming at the problems that the detection is carried out in a mode of marking the abnormal type in advance in the related technology, the detection result is single, and the like, an effective solution does not exist at present.

Disclosure of Invention

The embodiment of the invention provides an anomaly detection method and device for a computer cluster system and a storage medium, which are used for solving the problems that detection is carried out in a mode of marking a label for an anomaly type in advance in the related technology, the detection result is single and the like.

According to an embodiment of the present invention, there is provided an abnormality detection method of a computer cluster system, including: collecting a first log generated by a computer cluster system; analyzing error logs in the first log through an error log model, determining whether the computer cluster system is abnormal or not to obtain a first determination result, obtaining a frequency component of the first log, determining whether the computer cluster system is abnormal or not according to whether the frequency component is within a preset range or not to obtain a second determination result, wherein the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: a log of error levels; and determining whether the computer cluster system is abnormal according to the first determination result and the second determination result.

According to another embodiment of the present invention, there is also provided an abnormality detection apparatus of a computer cluster system, including: the acquisition module is used for acquiring a first log generated by the computer cluster system; the first determining module is configured to analyze logs of error categories in the first log through an error log model, determine whether the computer cluster system is abnormal, obtain a first determining result, obtain a frequency component of the first log, determine whether the computer cluster system is abnormal according to whether the frequency component is within a preset range, and obtain a second determining result, where the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data includes: a log of error levels; and the second determining module is used for determining whether the computer cluster system is abnormal or not according to the first determining result and the second determining result.

According to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.

According to the invention, the error log model is used for analyzing the error type log in the collected first log, determining whether the computer cluster system is abnormal or not, obtaining a first determination result, obtaining the frequency component of the first log, determining whether the computer cluster system is abnormal or not according to whether the frequency component is in a preset range or not, and obtaining a second determination result, wherein the error log model is trained by machine learning by using a plurality of groups of data, and each group of data in the plurality of groups of data comprises: a log of error levels; according to the technical scheme for determining whether the computer cluster system is abnormal or not according to the first determination result and the second determination result, the problems that detection is performed in a mode of marking a label for the abnormal type in advance, the detection result is single and the like in the related technology are solved, and whether the computer cluster system is abnormal or not can be comprehensively determined.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

fig. 1 is a block diagram of a hardware configuration of a terminal of an anomaly detection method of a computer cluster system according to an embodiment of the present invention;

FIG. 2 is a flow chart of an anomaly detection method of a computer cluster system according to an embodiment of the present invention;

fig. 3 is a block diagram of the structure of an abnormality detection apparatus of a computer cluster system according to an embodiment of the present invention;

FIG. 4 is another block diagram of an abnormality detection apparatus of a computer cluster system according to an embodiment of the present invention;

FIG. 5 is yet another flowchart of an anomaly detection method of a computer cluster system according to a preferred embodiment of the present invention;

FIG. 6 is a schematic structural diagram of an abnormality detection apparatus of a computer cluster system according to a preferred embodiment of the present invention;

FIG. 7 is a flow chart of an alternative method of detecting an error level log in accordance with a preferred embodiment of the present invention;

FIG. 8 is a diagram illustrating the results of an error log detection section in accordance with a preferred embodiment of the present invention;

FIG. 9 is a diagram illustrating still another result of the error log detection section in accordance with the preferred embodiment of the present invention;

FIG. 10 is a diagram illustrating the results of the log frequency detection section in accordance with a preferred embodiment of the present invention;

fig. 11 is a diagram illustrating still another result of the log frequency detecting part according to the preferred embodiment of the present invention.

Detailed Description

The invention will be described in detail hereinafter with reference to the accompanying drawings in conjunction with embodiments. It should be noted that the embodiments and features of the embodiments in the present application may be combined with each other without conflict.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

Example 1

The method provided by the embodiment 1 of the present application can be executed in a terminal or a similar computing device. Taking the operation on a terminal as an example, fig. 1 is a hardware structure block diagram of a terminal of an abnormality detection method of a computer cluster system according to an embodiment of the present invention. As shown in fig. 1, the terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration and is not intended to limit the structure of the terminal. For example, the terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration with equivalent functionality to that shown in FIG. 1 or with more functionality than that shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the navigation method of the networked car reservation in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a method for detecting an abnormality of a computer cluster system running on a terminal is provided, and fig. 2 is a flowchart of the method for detecting an abnormality of a computer cluster system according to the embodiment of the present invention, as shown in fig. 2, the flowchart includes the following steps:

step S202, collecting a first log generated by a computer cluster system;

step S204, analyzing error logs in the first log through an error log model, determining whether the computer cluster system is abnormal or not, obtaining a first determination result, obtaining a frequency component of the first log, determining whether the computer cluster system is abnormal or not according to whether the frequency component is within a preset range or not, and obtaining a second determination result, wherein the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: a log of error levels;

Step S206, determining whether the computer cluster system is abnormal according to the first determination result and the second determination result.

In an optional embodiment of the present invention, after a first log generated by a computer cluster system is collected, the first log is classified to obtain a plurality of log categories; determining the occurrence probability of each log category in a specified statistical period; adding a log category having a probability of occurrence greater than a first threshold to the error log model.

Optionally, obtaining a frequency component of the first log includes: determining a frequency component of each reporting point of the first log according to a preset time interval, wherein n reporting points exist, and n is a positive integer; forming a frequency component x of the first log by the frequency component of each reporting point_t＝[x₁,x₂,…,x_n]Wherein x is₁,x₂,…,x_nCorresponding to the frequency components of the different reporting points.

Specifically, the computer cluster system is determined according to whether the frequency component is within a preset range or notAnd whether the abnormality exists or not, obtaining a second determination result, comprising: according to the frequency component x_t＝[x₁,x₂,…,x_n]Frequency component x of the previous time to the current time_t-1Obtaining the ring ratio distance D_ring(ii) a According to the frequency component x_t＝[x₁,x₂,…,x_n]With said frequency component x_t＝[x₁,x₂,…,x_n]Central frequency sub-component x of_centerTo obtain the center ratio distance D_centerWherein the center frequency component x_centerIs determined by:

wherein the content of the first and second substances,

respectively correspond to x₁,x₂,…,x_nThe mean vector of (2); at the ring ratio distance D_ringGreater than a second threshold and the center is closer than a distance D_centerAnd when the value is larger than the third threshold value, determining that the computer cluster system is abnormal.

The step S206K can be implemented in various ways, and in an alternative embodiment, the determining whether the system is abnormal according to the first determination result and the second determination result includes:

when the first determination result indicates that the computer cluster system is abnormal and the second determination result indicates that the computer cluster system is abnormal, determining that the computer cluster system is abnormal;

when the first determination result indicates that the computer cluster system is not abnormal and the second determination result indicates that the computer cluster system is abnormal, determining that the computer cluster system is abnormal;

when the first determination result indicates that the computer cluster system is abnormal and the second determination result indicates that the computer cluster system is not abnormal, determining that the computer cluster system is abnormal;

and when the first determination result indicates that the computer cluster system is not abnormal and the second determination result indicates that the computer cluster system is not abnormal, determining that the computer cluster system is not abnormal.

Further, when the first determination result indicates that the computer cluster system is abnormal, setting a detection result of whether the computer cluster system is abnormal, which is determined according to the first determination result and the second determination result, as an importance level;

and when the first determination result indicates that the computer cluster system is not abnormal and the second determination result indicates that the computer cluster system is abnormal, setting a detection result of whether the computer cluster system is abnormal or not determined according to the first determination result and the second determination result as an alarm level.

To sum up, the embodiment of the present invention provides a method for detecting an anomaly according to a system log, which combines a detection algorithm based on an error level log and a detection algorithm based on a log frequency component, thereby not only considering the importance of the error level log, but also considering the rule of generating the log when the system operates normally, and without manually labeling training data.

Through the above description of the embodiments, those skilled in the art can clearly understand that the method according to the above embodiments can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware, but the former is a better implementation mode in many cases. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which is stored in a storage medium (e.g., ROM/RAM, magnetic disk, optical disk) and includes instructions for enabling a terminal device (e.g., a mobile phone, a computer, a server, or a network device) to execute the method according to the embodiments of the present invention.

In this embodiment, an anomaly detection apparatus for a computer cluster system is further provided, and the apparatus is used to implement the foregoing embodiments and preferred embodiments, and the description of which has been already made is omitted. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. Although the means described in the embodiments below are preferably implemented in software, an implementation in hardware, or a combination of software and hardware is also possible and contemplated.

Fig. 3 is a block diagram showing a configuration of an abnormality detection apparatus of a computer cluster system according to an embodiment of the present invention, as shown in fig. 3, the apparatus including:

the acquisition module 30 is used for acquiring a first log generated by the computer cluster system;

a first determining module 32, configured to analyze logs of error categories in the first log through an error log model, determine whether the computer cluster system is abnormal, obtain a first determination result, obtain a frequency component of the first log, determine whether the computer cluster system is abnormal according to whether the frequency component is within a preset range, and obtain a second determination result, where the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data includes: a log of error levels;

a second determining module 34, configured to determine whether the computer cluster system is abnormal according to the first determining result and the second determining result.

Fig. 4 is another block diagram of an abnormality detection apparatus of a computer cluster system according to an embodiment of the present invention, as shown in fig. 4, the apparatus including:

a classification module 36, configured to classify the first log to obtain a plurality of log categories;

a third determining module 38, configured to determine an occurrence probability of each log category within a specified statistical period;

an adding module 40, configured to add the log category with the occurrence probability greater than the first threshold to the error log model.

Optionally, the first determining module 32 is further configured to determine, according to a predetermined time interval, a frequency component of each reporting point of the first log, where n reporting points exist, and n is a positive integer; forming a frequency component x of the first log by the frequency component of each reporting point_t＝[x₁,x₂,…,x_n]Wherein x is₁,x₂,…,x_nCorresponding to the frequency components of the different reporting points.

Specifically, the first determining module 32 is further configured to: according to the frequency component x_t＝[x₁,x₂,…,x_n]Frequency component x of the previous time to the current time_t-1Obtaining the ring ratio distance D_ring(ii) a According to the frequency component x_t＝[x₁,x₂,…,x_n]With said frequency component x_t＝[x₁,x₂,…,x_n]Central frequency sub-component x of_centerTo obtain the center ratio distance D_centerWherein the center frequency component x_centerIs determined by:

wherein the content of the first and second substances,

A second determining module 34, configured to determine that the computer cluster system is abnormal when the first determination result indicates that the computer cluster system is abnormal and the second determination result indicates that the computer cluster system is abnormal;

It should be noted that, the above modules may be implemented by software or hardware, and for the latter, the following may be implemented, but not limited to: the modules are all positioned in the same processor; alternatively, the modules are respectively located in different processors in any combination.

The following technical solution of the preferred embodiment of the present invention is to detect an anomaly in a log generated by a system, and generally, the log generated by the system is divided into a plurality of levels, such as ERROR/INFO/DEBUG, etc., which represent the importance of the log, and in consideration of the difference in importance of the logs of different levels, the technical solution adopted by the preferred embodiment of the present invention is to combine a detection algorithm based on an ERROR level log and a detection algorithm based on log frequency, and to jointly use the detection algorithm for detecting the system anomaly, and a specific combining process is described in the following embodiments.

Based on the detection algorithm of the error level logs, removing unimportant logs through the algorithm, and reserving important logs for detecting system abnormality; and judging the deviation between the system log and the historical statistical value in real time by learning the log frequency rule when the system normally operates based on the log frequency detection algorithm, and if the deviation exceeds a set threshold value, determining that the system is abnormal. The two detection methods are complementary to each other, and can detect most system anomalies, and the following preferred embodiment describes a flow in which the two detection methods are complementary to each other, but is not limited to the technical solution of the embodiment of the present invention.

As shown in fig. 5, the technical solution adopted by the preferred embodiment of the present invention includes the following steps:

step 1: the method comprises the steps of collecting historical logs, carrying out structuralization processing to obtain a timestamp, a log level, a reporting service, a reporting instance and a reporting host of each log.

Step 2: extracting log entries with log level higher than or equal to ERROR from the historical log, and processing according to the following flow A; and performing a process B on the log complete set.

Wherein, the specific steps of the process A are as follows.

It should be noted that the following steps only represent operation steps required to be executed by the flow a, and do not represent the operation sequence of the flow executed by the flow a.

Step A1: a template for log classification is extracted from the log raw text, and a plurality of log classifications are obtained (C1, C2, … … Cn).

Step A2: and dividing the historical logs into a plurality of log sets according to the time granularity of hours, and classifying the error level logs in each log set by using a log classification template. Under each service, onehot encoding is performed for each class, e.g. (C1: 1, C2: 0, … … Cn:1)

Step A3: calculating the occurrence probability of each log classification for the whole statistical period, and assuming that the statistical period is N hours, the classification Ci is coded as 1 in the first hour, the classification Ci is coded as 1 in the second hour, and the codes in other time periods are all 0, then the occurrence probability of the classification Ci is calculated

If Pi is higher than the occurrence probability threshold, the log is classified as a log which appears for a long time and has no characterization effect on system abnormity, and the log is recorded into an error log model and marked as normal; the other classification label is abnormal.

Step A4: the content in the model can be modified by manual annotation. If a log is marked as abnormal, the classification is recorded into an error log model and marked as abnormal. Similarly, the log classification may also be modified to normal.

Step A5: regularly collecting logs generated by a system in real time; and extracting structural information from the logs within the current 5 minutes (or other times), and removing the logs with non-error levels. Under each service, a log classification template is used for classification, and onehot coding is carried out on each classification. For example, (C1: 1, C2: 0, … … Cn: 1).

Step A6: and inquiring an error log model corresponding to the service in the database, and eliminating the log classification marked as normal to obtain an effective log classification. And searching the log original text for fault analysis according to the effective log classification, wherein the service, instance and host corresponding to the log original text are fault points.

The procedure of scheme B is as follows.

Step B1: for each service, according to a certain timeAt intervals, the log frequency x reported by each report point (host) is counted. Assuming that there are n reporting points assigned to service1, the frequency vector dimension is equal to n, and the frequency vector x is ═ x₁,x₂,…,x_n]。

And calculating the mean vector of the frequency vectors of all time intervals to obtain a central frequency vector.

Step B2: calculating each frequency vector X_tAnd the frequency vector X of the previous time_t-1To obtain a ring ratio distance statistic. The larger the distance, the larger the change of the frequency distribution of the current point compared with the previous moment.

Step B3: calculating each frequency vector X_tAnd the central frequency order vector X_centerTo obtain a center-to-center distance statistic. The larger the distance, the larger the current point frequency distribution and the historical mean deviation.

Step B4: for the ring ratio distance at all the moments, a normal distance range is obtained by using a statistical method, namely a ring ratio distance threshold value T_ringExceeding the threshold is considered to be too great a deviation from the previous moment.

Step B5: for the distance of the center ratio at all the moments, a normal distance range is obtained by using a statistical method, namely, the distance threshold value T of the center ratio_centerExceeding the threshold is considered to be too large off-center.

Step B6: the central frequency vector X of each service_centerThreshold value T_ringAnd T_centerSave to the statistical model.

Step B7: and regularly collecting logs generated by the system in real time, and extracting log structured information. Obtaining a log frequency vector X under each service by statistics_t。

Step B8: calculating the frequency vector X_tAnd the frequency vector X of the last time_t-1The ring ratio distance Dring is obtained.

Calculating the frequency vectorX_tAnd the central frequency order vector X_centerDistance between them, the center ratio distance D_center。

Step B9: if the ring ratio is distance D_ringGreater than a threshold value T_ringAnd the center is closer than the distance D_centerGreater than a threshold value T_centerIf the current point deviates from the historical center, the frequency is too high, and the current point has mutation compared with the previous time, and the current point is recorded as the abnormal starting time; if the abnormality starts, only the deviation between the current time and the frequency of the historical center needs to be judged, and if the current center is closer than the distance D_centerGreater than a threshold value T_centerThe exception persists, otherwise the exception terminates.

Step B10: and if the abnormity is detected, scoring the abnormity of each dimension based on a distance calculation formula to obtain score. And sorting each contribution degree from large to small, wherein the dimension of which the sum of the first x contribution degrees is not less than the threshold value of the contribution degree is regarded as an abnormal dimension, namely an abnormal report point (host).

And step 3: and combining the detection results of the flow A and the flow B. If the process A detects abnormality, the process B judges the detection result as abnormal whether abnormal or not, sets the detection result as an important level, and displays an effective log original text for a user; if the process A is detected normally and the process B is detected abnormally, setting the detection result as an alarm level, and displaying a frequency curve graph of the log to a user; if no abnormity exists in the detection, the detection result is normal.

The technical scheme of the preferred embodiment of the invention provides the classification of the data sources by utilizing different use scenes and characteristics of the error logs and the log frequency, and adopts different detection modes to ensure that the detection result is more accurate. The method provided by the invention has high calculation efficiency and does not need to consume too much calculation resources. The requirement on training data is not high, manual labeling is not needed, and the practical application range is wide.

In summary, the above preferred embodiment provides an error log detection algorithm, in which the onehot coding is used to eliminate the noise floor, i.e. the insignificant error level log output by the system. See step a3 specifically: if Pi is higher than the threshold (the threshold can be set to be 0.8 in the embodiment), the log classification which is considered to be long-term and has no characterization effect on the system abnormity is recorded into an error log model and is marked as normal; the other classification label is abnormal; when detecting the common log, combining a plurality of related frequencies into a vector, and detecting the system abnormality by using the distance of the center ratio and the distance of the ring ratio of the vector, namely measuring the deviation between the current point and the history and measuring the deviation between the current point and the previous moment.

Fig. 6 is a schematic structural diagram of an abnormality detection apparatus of a computer cluster system according to a preferred embodiment of the present invention, and as shown in fig. 6, the entire apparatus is divided into an acquisition unit 60, a log storage unit 62, a model training unit 64, a real-time detection unit 66, and a display unit 68. The functions of the respective units are as follows:

the acquisition unit continuously acquires original logs from the system to be detected, extracts the structured information of each log, and classifies each log according to the log classification template to obtain a classification number. And writing the log original text, the classification number and the structured information into a log storage unit. The structured information comprises a timestamp, a log level, a report service, a report instance and a report host of each log.

The log storage unit stores the information output by the acquisition unit for reading and using subsequent logs. According to actual needs, logs can be backed up proportionally, or expired logs can be cleared periodically to guarantee availability.

The model training unit is divided into historical error level log modeling and historical log frequency modeling. The historical error level log modeling part reads logs in a certain historical period from the log storage unit, identifies whether each log classification represents system abnormity, and records the result to the model storage unit; the historical log frequency modeling part reads logs in a certain historical period from the log storage unit, converts log text information into frequency, models the frequency and records the model to the storage unit.

The model storage unit stores the modeling result of the model training unit for use in the detection.

The real-time detection unit is divided into a real-time error log detection part and a real-time log frequency detection part. The real-time error log detection part reads logs in the current time interval from the log storage unit and judges whether abnormal logs are classified currently or not according to the error log model; and the real-time log frequency detection reads the logs in the current time interval from the log storage unit, converts the logs into frequency, and calculates whether the log frequency deviates from a normal model or not according to the log frequency model.

The display unit presents the detection result to the user. The user can view the detection result and retrieve the corresponding log information. The user can mark whether the detection result is correct or not, and the display module correspondingly adjusts the corresponding model.

The following describes the implementation of the technical solution in further detail with reference to fig. 1-2, wherein fig. 8-9 are the onehot encoding results corresponding to each log classification in fig. 6, where the horizontal axis of fig. 8-9 is time and the vertical axis is vertical axis. Fig. 8 shows the result of encoding in step a 5. Fig. 9 shows the onehot encoding result after being removed in step a 6.

Fig. 10-11 are schematic diagrams of results of the log frequency detection part in fig. 6, in which JSD distances are used to measure distances between frequency vectors, wherein 10 reporting points form 10-dimensional frequency vectors, and a ring ratio JSD statistic and a center ratio JSD statistic de are obtained according to the frequency vectors.

The historical log frequency modeling part needs to calculate the distance between two frequency vectors, and a proper metering mode (KL) divergence is a common distance metering mode, and the method measures the difference between two probability distributions in the same event space from the angle of information entropy, namely relative entropy. However, due to the asymmetric KL divergence, etc., the embodiment of the present invention employs variations of KL divergence: (jensen-shannon) JSD distance.

Assuming two probability distributions P and Q, the JSD distance is defined as:

wherein the content of the first and second substances,

d (P M) is KL distance. For discrete probability distributions, let P ═ P₁,p₂,…,p_n]，Q＝[q₁,q₂,…,q_n]；

Then

Further, based on the JSD distance, the specific steps of giving the historical log frequent modeling part are as follows:

step C1: according to a certain time interval (which may be 5 minutes in this embodiment, or other suitable time), counting the log frequency x reported by each report point (host, instance). Assuming that there are n reporting points assigned to service1, the frequency vector dimension is equal to n, and the frequency vector x is ═ x₁,x₂,…,x_n]。

Step C2: converting the frequency vector into a frequency scale vector P (no conversion operation may be performed), the ith element of P being:

similarly, the central frequency vector is converted into a central frequency proportional vector P_center。

Step C3: calculating a per-frequency scale vector P_tAnd the frequency scale vector P of the previous time_t-1And obtaining JSD distance statistic of ring ratio.

Step C4: calculating a per-frequency scale vector P_tAnd center frequencySub-scale vector P_centerAnd obtaining the JSD distance statistic of the center ratio.

Step C5: calculating the ring ratio distance threshold value T of the ring ratio distance at all the time moments by using a KDE (Kernel Density estimation) algorithm_ring. In this embodiment, the probability threshold is set to 95%.

Step C6: for the center ratio distance of all the time instants, calculating a center ratio distance threshold value T by using a KDE (Kernel Density estimation) algorithm_center。

Step C7: the central frequency proportion vector P of each service_centerThreshold value T_ringAnd T_centerSave to the statistical model.

Further, based on the JSD distance, the specific steps of giving the real-time log frequency detection part are as follows:

step D1: logs generated by the system in real time are collected periodically, once in 5 minutes in this example. Structured information is extracted for logs within the current 5 minutes. Obtaining a log frequency vector X under each service by statistics _tConverted into a frequency scale vector P_t。

Step D2: calculating the frequency scale vector P_tAnd the frequency scale vector P of the last time_t-1The JSD distance between the two rings, the ring ratio JSD distance D_ring。

Calculating the frequency scale vector P_tProportional vector P to center frequency_centerThe JSD distance between the two, the center ratio JSD distance D_center。

Step D3: distance D if ring is greater than JSD_ringGreater than a threshold value T_ringAnd the center is closer than JSD distance D_centerGreater than a threshold value T_centerIf the time is the abnormal starting time, the abnormal starting time is the abnormal starting time; if the exception has started, and the current center is closer than the JSD distance D_centerGreater than a threshold value T_centerThe exception persists, otherwise the exception terminates.

Step D4: if an anomaly is detected, calculating an anomaly score for each dimension based on the center-to-JSD distance.

According to the calculation formula of the JSD distance, the contribution degree of the ith component to the JSD (P | | Q) is as follows:

wherein P ═ P_t，Q＝P_center。

Step D5: the contribution degrees are sorted from large to small, and the dimension in which the first x contribution degrees are not less than the contribution degree threshold (set to 80% in this embodiment) is considered as an anomaly dimension, i.e., an anomaly reporting point (host).

The historical error log modeling part comprises the following specific steps:

step E1: and dividing the historical logs into a plurality of log sets according to the time granularity of hours, and classifying the error level logs in each log set by using a log classification template. Under each service, onehot encoding is performed for each class. Example (C1: 1, C2: 0, … … Cn: 1);

step E2: for the whole statistical period, the occurrence probability of each log classification is calculated. Assuming that the statistical period is N hours, the class Ci is coded as 1 in the first hour, the class Ci is coded as 1 in the second hour, and the codes in other time periods are all 0, the occurrence probability of the class Ci is determined

If Pi is higher than the threshold (the threshold is set to be 0.8 in the embodiment), the log is classified as a log which appears for a long time and has no characteristic effect on system abnormity, and the log is recorded into an error log model and marked as normal; the other classification label is abnormal.

Step E3: manual annotation can modify content in the model. If a log is marked as abnormal, the classification is recorded into an error log model and marked as abnormal. Similarly, the log classification may also be modified to normal.

The real-time error log detection part comprises the following specific steps:

step F1: logs generated by the system in real time are collected periodically, once in 5 minutes in this example. And extracting structural information from the logs within the current 5 minutes, and removing the logs with non-error levels. Under each service, a log classification template is used for classification, and onehot coding is carried out on each classification. Examples (C1: 1, C2: 0, … … Cn: 1).

Step F2: and inquiring an error log model corresponding to the service in the database, and eliminating the log classification marked as normal to obtain an effective log classification. And searching the log original text for fault analysis according to the effective log classification, wherein the service, instance and host corresponding to the log original text are fault points.

In view of the consideration of efficiency and resource shortage of many commercial products, only the error level log is output, and due to the fact that the magnitude of the error level log is relatively small, frequent data does not have statistical significance, and report missing is easily caused during abnormal detection. Only the error level log based detection method may be used at this time. Fig. 7 shows a schematic flow chart of this embodiment, which includes the following specific steps:

step G1: the method comprises the steps of collecting historical error level logs, carrying out structuralization processing to obtain a timestamp, a log level, a reporting service, a reporting instance and a reporting host of each log.

Step G2: a template for log classification is extracted from the log raw text, and a plurality of log classifications are obtained (C1, C2, … … Cn).

Step G3: and dividing the historical logs into a plurality of log sets according to the time granularity of hours, and classifying the error level logs in each log set by using a log classification template. Under each service, onehot encoding is performed for each class. Example (C1: 1, C2: 0, … … Cn:1)

Step G4: for the whole statistical period, the occurrence probability of each log classification is calculated. Assuming that the statistical period is N hours, the class Ci is coded as 1 in the first hour, the class Ci is coded as 1 in the second hour, and the codes in other time periods are all 0, the occurrence probability of the class Ci is determined

Step G5: manual annotation can modify content in the model. If a log is marked as abnormal, the classification is recorded into an error log model and marked as abnormal. Similarly, the log classification may also be modified to normal.

Step G6: logs generated by the system in real time are collected periodically, once in 5 minutes in this example. And extracting structural information from the logs within the current 5 minutes, and removing the logs with non-error levels. Under each service, a log classification template is used for classification, and onehot coding is carried out on each classification. Examples (C1: 1, C2: 0, … … Cn: 1).

Step G7: and inquiring an error log model corresponding to the service in the database, and eliminating the log classification marked as normal to obtain an effective log classification. And searching the log original text for fault analysis according to the effective log classification, wherein the service, instance and host corresponding to the log original text are fault points.

According to the steps, the optimal implementation steps of the method do not need a large amount of computing resources, do not need manual data marking, can identify the system abnormity, find the hidden danger of the system in advance, and inform system maintenance personnel.

Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.

Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the steps of:

s1, collecting a first log generated by the computer cluster system;

s2, analyzing the error log in the first log by using an error log model, determining whether the computer cluster system is abnormal, obtaining a first determination result, obtaining a frequency component of the first log, determining whether the computer cluster system is abnormal according to whether the frequency component is within a preset range, and obtaining a second determination result, wherein the error log model is trained by machine learning using a plurality of groups of data, and each group of data in the plurality of groups of data includes: a log of error levels;

s3, determining whether the computer cluster system is abnormal according to the first determination result and the second determination result.

Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

Optionally, in this embodiment, the processor may be configured to execute the following steps by a computer program:

s1, collecting a first log generated by the computer cluster system;

Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.

It will be apparent to those skilled in the art that the modules or steps of the present invention described above may be implemented by a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and alternatively, they may be implemented by program code executable by a computing device, such that they may be stored in a storage device and executed by a computing device, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into individual integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present invention is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the principle of the present invention should be included in the protection scope of the present invention.

Claims

1. An anomaly detection method for a computer cluster system, comprising:

collecting a first log generated by a computer cluster system;

analyzing error logs in the first log through an error log model, determining whether the computer cluster system is abnormal or not to obtain a first determination result, obtaining a frequency component of the first log, determining whether the computer cluster system is abnormal or not according to whether the frequency component is within a preset range or not to obtain a second determination result, wherein the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data comprises: a log of error levels;

and determining whether the computer cluster system is abnormal according to the first determination result and the second determination result.

2. The method of claim 1, wherein after collecting a first log generated by a computer cluster system, the method further comprises:

classifying the first log to obtain a plurality of log categories;

determining the occurrence probability of each log category in a specified statistical period;

adding a log category having a probability of occurrence greater than a first threshold to the error log model.

3. The method of claim 1, wherein obtaining a frequency component of the first log comprises:

determining a frequency component of each reporting point of the first log according to a preset time interval, wherein n reporting points exist, and n is a positive integer;

frequency division by said each reporting pointMeasuring a frequency component x constituting the first log_t＝[x₁,x₂,…,x_n]Wherein x is₁,x₂,…,x_nCorresponding to the frequency components of the different reporting points.

4. The method of claim 3, wherein determining whether the computer cluster system is abnormal according to whether the frequency component is within a preset range, and obtaining a second determination result comprises:

according to the frequency component x_t＝[x₁,x₂,…,x_n]Frequency component x of the previous time to the current time_t-1Obtaining the ring ratio distance D_ring；

According to the frequency component x_t＝[x₁,x₂,…,x_n]With said frequency component x_t＝[x₁,x₂,…,x_n]Central frequency sub-component x of_centerTo obtain the center ratio distance D_centerWherein the center frequency component x_centerIs determined by:

wherein the content of the first and second substances,

respectively correspond to x₁,x₂,…,x_nThe mean vector of (2);

at the ring ratio distance D_ringGreater than a second threshold and the center is closer than a distance D_centerAnd when the value is larger than the third threshold value, determining that the computer cluster system is abnormal.

5. The method according to any one of claims 1 to 4, wherein determining whether a system is abnormal according to the first determination result and the second determination result includes:

6. The method of claim 5, further comprising:

when the first determination result indicates that the computer cluster system is abnormal, setting a detection result of whether the computer cluster system is abnormal or not, which is determined according to the first determination result and the second determination result, as an importance level;

7. An abnormality detection apparatus for a computer cluster system, comprising:

the acquisition module is used for acquiring a first log generated by the computer cluster system;

the first determining module is configured to analyze logs of error categories in the first log through an error log model, determine whether the computer cluster system is abnormal, obtain a first determining result, obtain a frequency component of the first log, determine whether the computer cluster system is abnormal according to whether the frequency component is within a preset range, and obtain a second determining result, where the error log model is trained through machine learning by using multiple groups of data, and each group of data in the multiple groups of data includes: a log of error levels;

and the second determining module is used for determining whether the computer cluster system is abnormal or not according to the first determining result and the second determining result.

8. The apparatus of claim 7, further comprising:

the classification module is used for classifying the first log to obtain a plurality of log categories;

the third determining module is used for determining the occurrence probability of each log category in a specified statistical period;

and the adding module is used for adding the log categories with the occurrence probability larger than a first threshold value into the error log model.

9. A storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the method of any of claims 1 to 6 when executed.

10. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 6.