CN117312098B

CN117312098B - Log abnormity alarm method and device

Info

Publication number: CN117312098B
Application number: CN202311560934.6A
Authority: CN
Inventors: 申志伟; 时文丰; 朱肖曼
Original assignee: 6th Research Institute of China Electronics Corp
Current assignee: 6th Research Institute of China Electronics Corp
Priority date: 2023-11-22
Filing date: 2023-11-22
Publication date: 2024-03-01
Anticipated expiration: 2043-11-22
Also published as: CN117312098A

Abstract

The application provides a log abnormality warning method and device, and relates to the technical field of data processing, wherein the method comprises the following steps: acquiring log data to be detected from different cloud platforms; determining whether an alarm rule matched with the service type, the server identifier and the cloud platform identifier exists in an alarm rule database, wherein the alarm rule comprises an alarm keyword; if the matched alarm rule exists, determining whether a data item matched with the alarm keyword exists in the log data to be detected; if the matched data items exist, carrying out log abnormal alarm according to log alarm records corresponding to alarm rules in an alarm state database; if no matched data item exists, carrying out log abnormality detection on log data to be detected by using a log abnormality detection model, and carrying out log abnormality alarm when detecting that the log is abnormal. By adopting the log abnormality alarming method and device, the problems of low processing efficiency and low accuracy rate of log alarming in the prior art are solved.

Description

Log abnormity alarm method and device

Technical Field

The application relates to the technical field of data processing, in particular to a log abnormality alarming method and device.

Background

Cloud computing has become increasingly popular as a new infrastructure model, and enterprises and institutions can greatly reduce the construction cost of own information technology infrastructure through a cloud service model. In order to meet the requirements of users in different areas, such as low time delay, a cloud service provider is usually built with a plurality of data centers, and provides services such as nearby distribution calculation and storage for the users. The cloud platform is used as a key information infrastructure and runs business application services of users, and stable running and timely fault early warning of the cloud platform are very important for users and cloud service operators. The cloud data center comprises a plurality of hardware devices such as a plurality of networks, servers, storage and the like, a plurality of software services such as a cloud management platform and the like, various faults can be inevitably generated in the running process, and logs become important supports for diagnosing the faults of the cloud platform. The high-efficiency log monitoring system is deployed in the cloud platform, so that operation and maintenance personnel can master the operation condition of the cloud platform, faults can be found out in time and can be checked, and the operation of the cloud platform is guaranteed to be reliable.

However, the current log alert service is usually only aimed at a single data center, when providing log alert services for a plurality of data centers, because the log data volume is large, if the log alert processing is performed by analyzing and checking the log data by the operation and maintenance personnel, the problem of low processing efficiency and accuracy is caused.

Disclosure of Invention

Accordingly, the present application is directed to a log abnormality alarm method and apparatus, so as to solve the problems of low processing efficiency and low accuracy of log alarm in the prior art.

In a first aspect, an embodiment of the present application provides a log anomaly alarm method, including:

obtaining log data to be detected from different cloud platforms, wherein the log data to be detected comprises a service type, a server identifier and a cloud platform identifier;

determining whether an alarm rule matched with the service type, the server identifier and the cloud platform identifier exists in an alarm rule database, wherein the alarm rule comprises an alarm keyword;

if the matched alarm rule exists, determining whether a data item matched with the alarm keyword exists in the log data to be detected;

if the matched data items exist, carrying out log abnormal alarm according to log alarm records corresponding to alarm rules in an alarm state database;

if no matched data item exists, carrying out log abnormality detection on log data to be detected by using a log abnormality detection model, and carrying out log abnormality alarm when detecting that the log is abnormal.

Optionally, performing log abnormal alarm according to log alarm records corresponding to alarm rules in the alarm state database includes: inquiring whether a log alarm record matched with an alarm rule exists in an alarm state database; if the matched log alarm records exist, adding one to the alarm times in the log alarm records, recording the latest alarm time, and carrying out alarm processing according to the alarm grade in the log alarm records; if no matched log alarm record exists, a new log alarm record is generated in the alarm state database, log alarm information corresponding to the log alarm record is generated, and the log alarm information is sent to operation and maintenance personnel.

Optionally, after determining whether the alert rule database has the alert rule matching the service type, the server identifier and the cloud platform identifier, the method further includes: if the matched alarm rules do not exist, the log abnormality detection model is utilized to carry out log abnormality detection on the log data to be detected, and when the log abnormality is detected, log abnormality alarm is carried out.

Optionally, the alarm processing is performed according to the alarm level in the log alarm record, including: determining whether the alert level is urgent; if the alarm is urgent, acquiring alarm times, comparing the alarm times in a preset time range with a set time threshold, and carrying out alarm processing according to a comparison result; if not, determining the interval time between the last alarm and the current alarm in the log alarm record, and carrying out alarm processing according to the interval time.

Optionally, the alarm processing is performed according to the comparison result, including: if the number of alarms within the preset time range is greater than or equal to a set number threshold, determining that the alarm is a blocking type fault, and sending a blocking type fault alarm; if the alarm times in the preset time range are smaller than the set times threshold, determining that the short-term fault exists.

Optionally, the method further comprises: if the log abnormality is determined to be a transient fault, changing the service running state of the target service with the log abnormality into a potential stop in an alarm state database; circularly detecting the service running state of the target service according to the first interval duration; if the service running state of the target service is the stop running state, or the service running state is the normal running state, but the service running state of the target service is the stop running state which is circularly detected within the preset time range according to the second interval duration, and the downtime fault of the target service is determined.

Optionally, the log anomaly detection model includes a word embedding layer, a convolution layer, a max pooling layer, a stitching layer, a bi-directional recurrent neural network layer, an attention layer, a classification layer, the convolution layer including a plurality of convolution kernels.

Optionally, obtaining log data to be detected from different cloud platforms includes: aiming at each cloud platform, collecting platform log data corresponding to different services in the cloud platform, and adding a log label for the platform log data, wherein the log label comprises a service type and a server identifier; the platform log data added with the log label is put into a message queue of the cloud platform; extracting platform log data from message queues of different cloud platforms, and performing label conversion processing on the platform log data to obtain log data to be detected carrying cloud platform identification.

Optionally, performing label conversion processing on the platform log data to obtain log data to be detected carrying the cloud platform identifier, including: and adding a cloud platform identifier for the platform log data to obtain log data to be detected.

In a second aspect, an embodiment of the present application further provides a log anomaly alarm device, where the device includes:

the data acquisition module is used for acquiring log data to be detected from different cloud platforms, wherein the log data to be detected comprises a service type, a server identifier and a cloud platform identifier;

the rule matching module is used for determining whether an alarm rule matched with the service type, the server identifier and the cloud platform identifier exists in the alarm rule database or not, wherein the alarm rule comprises an alarm keyword;

the keyword matching module is used for determining whether a data item matched with the alarm keyword exists in the log data to be detected if a matched alarm rule exists;

the rule alarm module is used for carrying out log abnormal alarm according to log alarm records corresponding to alarm rules in the alarm state database if matched data items exist;

and the model alarm module is used for carrying out log abnormality detection on log data to be detected by using the log abnormality detection model if no matched data item exists, and carrying out log abnormality alarm when the log abnormality is detected.

The embodiment of the application brings the following beneficial effects:

according to the log abnormality alarming method and device, the log data of different cloud platforms can be obtained from different cloud platforms, unified processing is conducted on the log data of the different cloud platforms, the log data are converted into log data to be detected, and the log abnormality is detected and alarming is conducted on the log abnormality from multiple dimensions by means of the alarming rules and the log abnormality detecting model.

In order to make the above objects, features and advantages of the present application more comprehensible, preferred embodiments accompanied with figures are described in detail below.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings that are needed in the embodiments will be briefly described below, it being understood that the following drawings only illustrate some embodiments of the present application and therefore should not be considered limiting the scope, and that other related drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 shows a flowchart of a log anomaly alert method provided by an embodiment of the present application;

FIG. 2 shows a schematic structural diagram of a log anomaly detection model provided in an embodiment of the present application;

fig. 3 shows a schematic structural diagram of a log abnormality warning apparatus provided in an embodiment of the present application.

Detailed Description

For the purposes of making the objects, technical solutions and advantages of the embodiments of the present application more clear, the technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the drawings in the embodiments of the present application, and it is apparent that the described embodiments are only some embodiments of the present application, but not all embodiments. The components of the embodiments of the present application, which are generally described and illustrated in the figures herein, may be arranged and designed in a wide variety of different configurations. Thus, the following detailed description of the embodiments of the present application, as provided in the accompanying drawings, is not intended to limit the scope of the application, as claimed, but is merely representative of selected embodiments of the application. Based on the embodiments of the present application, every other embodiment that a person skilled in the art would obtain without making any inventive effort is within the scope of protection of the present application.

Notably, cloud computing has become increasingly popular as a new infrastructure model before the application is proposed, and enterprises and institutions can greatly reduce the construction cost of their information technology infrastructure through the model of cloud services. To meet the application requirements of users, cloud service providers provide a wide variety of infrastructure resources, including heterogeneous computing resources such as X86, ARM, GPU, and middleware such as databases, message queues, and the like. In order to meet the requirements of users in different areas, such as low time delay, a cloud service provider is usually built with a plurality of data centers, and provides services such as nearby distribution calculation and storage for the users. The cloud platform is used as a key information infrastructure and runs business application services of users, and stable running and timely fault early warning of the cloud platform are very important for users and cloud service operators. The cloud data center comprises a plurality of hardware devices such as a plurality of networks, servers, storage and the like, a plurality of software services such as a cloud management platform and the like, various faults can be inevitably generated in the running process, and logs become important supports for diagnosing the faults of the cloud platform. The high-efficiency log monitoring system is deployed in the cloud platform, so that operation and maintenance personnel can master the operation condition of the cloud platform, faults can be found out in time and can be checked, and the operation of the cloud platform is guaranteed to be reliable. However, the current log alert service is usually only aimed at a single data center, when providing log alert services for a plurality of data centers, because the log data volume is large, if the log alert processing is performed by analyzing and checking the log data by the operation and maintenance personnel, the problem of low processing efficiency and accuracy is caused.

Based on the above, the embodiment of the application provides a log abnormality alarm method, so as to improve the processing efficiency and accuracy of log alarm.

Referring to fig. 1, fig. 1 is a flowchart of a log abnormality alarm method according to an embodiment of the present application. As shown in fig. 1, the log abnormality alarm method provided in the embodiment of the present application includes:

step S101, acquiring log data to be detected from different cloud platforms.

In this step, the cloud platform may refer to a platform capable of providing various services, and the services may be, for example, nova services for providing computing services for virtual machines, or cinder services for providing storage services for virtual machines.

Each cloud platform corresponds to a data center, each cloud platform comprises a plurality of servers, at least one service is arranged in each server, and the services exchange and process data through the data centers of the cloud platforms.

The log data to be detected are log data collected for different services, and each service is at least assigned to one cloud platform.

The log data to be detected comprises log information, service types, server identifiers and cloud platform identifiers.

In the embodiment of the application, the log abnormality alarm method is realized by a log abnormality alarm system, and the log abnormality alarm system comprises a log data acquisition module, a log data aggregation module and a log data alarm module.

The log data acquisition module is used for acquiring log data aiming at different services in each cloud platform, wherein a unified log acquisition component can be deployed in each server of the data center so as to acquire platform log data of different cloud platforms according to unified indexes.

In an alternative embodiment, step S101 includes: step a1, step a2 and step a3.

Step a1, aiming at each cloud platform, collecting platform log data corresponding to different services in the cloud platform, and adding a log label for the platform log data.

Here, the log tag may refer to a tag of log data, where the log tag is used to label information such as a service type and a server identifier corresponding to the log data of the platform. The log tag includes a service type and a server identification. And adding two labels of a service type and a server identifier for each piece of platform log data aiming at the collected platform log data.

The service type tag is used to identify which service the collected platform log data comes from, and the service type may be nova or cinder, for example. The server identification tag is used to distinguish from which server the platform log data originates, and the server identification may be, for example, a unique identification code (Universally Unique Identifier, UUID) of the server.

And a step a2, putting the platform log data added with the log label into a message queue of the cloud platform.

After the log label is added, the platform log data added with the log label is pushed to a log message queue in a data center of the cloud platform, so that the data of different cloud platforms can be managed uniformly.

And a3, extracting platform log data from the message queues of different cloud platforms, and performing label conversion processing on the platform log data to obtain log data to be detected carrying the cloud platform identification.

Here, each cloud platform corresponds to a log message queue. The log data aggregation module is used for extracting, cleaning, aggregating and adding log labels to the platform log data in the data center of each cloud platform. The log data aggregation module extracts the platform log data of each platform from the log message queues of different cloud platforms respectively, puts the platform log data into the unified log message queues, and then carries out data cleaning and label conversion processing on each platform log data so as to remove useless data and identify which cloud platform the platform log data comes from.

In addition, the log data of the platform to be detected can be classified according to the service type and divided into different topics, so that the log data of the platform to be detected under the same topic can be extracted and detected.

In an alternative embodiment, step a3 comprises: step a31.

And a step a31 of adding a cloud platform identifier to the platform log data to obtain log data to be detected.

And the label conversion processing is to add a cloud platform identifier on the basis of the log label to form log data to be detected carrying the cloud platform identifier. The label conversion processing can also be to add the data center identification on the basis of the log label, and the data center identification is in one-to-one correspondence with the cloud platform, so that the effect of identifying the cloud platform to which the log data of the platform belongs can be achieved by adding the data center identification.

And the log label classification is used for realizing the distinction of the log data of different cloud platforms, different servers and different service types. For example: the label of the log data to be detected after label conversion processing comprises the following steps: cloud platform identification, server identification and service type are three-level.

And (3) carrying out data cleaning on the platform log data while carrying out log label conversion, reserving keywords in the platform log data, and discarding other data except the keywords so as to reduce the network bandwidth pressure.

Step S102, determining whether an alarm rule matched with the service type, the server identifier and the cloud platform identifier exists in an alarm rule database.

In the step, the alarm rule database is used for storing alarm rules, and the user-defined alarm rules which are input by the user in advance are stored in the alarm rule database. The alarm rules of different services are different, the alarm rules are log alarm rules formed based on data such as alarm keywords, alarm grades and the like, and a user can set the alarm rules for the specified service through the front end.

Alert rules may refer to specific rules for characterizing log anomaly detection.

The alarm rule comprises an alarm rule key, an alarm keyword and an alarm grade. The alert rule key is determined according to the service type, server identification, cloud platform identification, for example: and connecting the service type, the server identifier and the cloud platform identifier through special characters to obtain an alarm rule key.

In the embodiment of the application, the log alarming is carried out through the log data alarming module, a unified log monitoring alarming system of the multi-cloud platform based on the combination of alarming rules and deep learning is established, unified alarming of the log data of the multi-cloud platform is supported, centralized monitoring and alarming management of the log of the cloud platform are realized, multi-dimensional and high-accuracy log alarming is realized, and operation and maintenance efficiency is improved.

The log alarms based on the alarm rule matching can realize the accurate matching aiming at specific fields in the log. And carrying out real-time processing on log data to be detected of each cloud platform in the unified log message queue by adopting a big data real-time (streaming) computing mode.

And (3) extracting deep short-distance log features and long-distance log features based on log alarming of a deep learning algorithm, determining important feature information according to contribution degree of each sequence to a final log abnormal result, finishing effective feature screening, and detecting log abnormal according to the short-distance log features, the long-distance log features and the important features.

It should be noted that, the target log data includes multiple entries of target log data, and the alarm rule and the log abnormality detection model may be utilized to perform log abnormality detection on each entry of target log data cycle, that is, after the first entry of target log data is detected and the log abnormality alarm is issued, the second entry of target log data is detected and the log abnormality alarm is issued again, and so on until the target log data of all entries is detected and the log abnormality alarm is processed. The following processes of detecting the target log abnormality and processing the log abnormality alarm are described by taking a piece of log data to be detected as an example.

When the log data A to be detected is detected, the service type, the server identification and the cloud platform identification in the log data A to be detected are extracted, an alarm rule key is determined according to the extracted service type, the server identification and the cloud platform identification, and then whether the index matched with the alarm rule key exists in an alarm rule database is determined.

In an alternative embodiment, after step S102, the method further includes: step b1.

And b1, if no matched alarm rule exists, carrying out log abnormality detection on log data to be detected by using a log abnormality detection model, and carrying out log abnormality alarm when detecting that the log is abnormal.

Here, the log abnormality detection model may refer to a model for log abnormality detection, which is a supplement to log alarms based on alarm rules. The log anomaly detection model is a model constructed based on a deep learning algorithm.

Specifically, if no matched index exists, log abnormality detection is carried out on the log data to be detected through a log abnormality detection model. And carrying out templating treatment on the log data to be detected to obtain target log data after the templating treatment, and then inputting the target log data into a log abnormality detection model to carry out log abnormality detection so as to obtain a log abnormality detection result.

Step S103, if the matched alarm rule exists, determining whether a data item matched with the alarm keyword exists in the log data to be detected.

In the step, if a matched index exists in the alarm rule database, the alarm rule under the index is used as the alarm rule matched with the log data A to be detected.

Aiming at the alarm keywords in the alarm rules, adopting regular expressions corresponding to the alarm keywords to perform pattern matching on log data to be detected. If the match is successful, step S104 is performed, and if the match is failed, step S105 is performed.

In the embodiment of the application, assuming that the alarm keyword is error, generating a regular expression corresponding to error, scanning log data to be detected by using the regular expression, and determining that a data item matched with the error exists in the log data to be detected.

Step S104, if the matched data items exist, carrying out log abnormal alarm according to the log alarm records corresponding to the alarm rules in the alarm state database.

In this step, the alarm state database may refer to a database for storing log alarm records, and each time a data item matching with an alarm keyword is detected in log data to be detected, the data in the alarm state database is updated.

In the embodiment of the application, if the matched data item (keyword) exists, it is indicated that the service corresponding to the log data to be detected has an error, that is, the log abnormality is detected, and the specific situation of the error can be further determined through the log alarm information. Therefore, the log alarm records corresponding to the alarm rules can be obtained through the alarm state database, so that log abnormal alarms can be carried out according to the log alarm records.

In an alternative embodiment, step S104 includes: step c1, step c2, step c3.

And step c1, inquiring whether log alarm records matched with alarm rules exist in an alarm state database.

The log alarm record comprises a cloud platform identifier, a server identifier, a service type, an alarm state, alarm time, alarm grade, alarm times and a service running state. The alarm state comprises the alarm, the middle alarm and the recovered state.

Because the log alarm record of the log data to be detected may already exist in the alarm state database, it is required to determine whether the log alarm record matched with the alarm rule exists in the alarm state database, if the matched log alarm record exists, processing according to the log alarm record is required, and if the matched log alarm record does not exist, a new log alarm record is required to be generated.

When determining whether the matched log alarm records exist, because each log alarm record also comprises a service type, a server identifier and a cloud platform identifier, whether the target log alarm records matched with the alarm rules exist can be determined by matching the service type, the server identifier and the cloud platform identifier.

And c2, if the matched log alarm records exist, adding one to the alarm times in the log alarm records, recording the latest alarm time, and carrying out alarm processing according to the alarm grade in the log alarm records.

If the matched log alarm records exist in the alarm state database, the alarm times and the alarm time are required to be updated, at the moment, the alarm times in the target log alarm records can be obtained, the current alarm times obtained by adding one to the alarm times are recorded, the current alarm time is recorded, and the target log alarm records are updated by the current alarm times and the current alarm time. In addition, the target log alarm record comprises alarm grades, and the alarm grades are set in alarm rules. Because the alarm processing methods corresponding to different alarm levels are different, corresponding alarm processing can be performed according to the alarm levels.

And c3, if no matched log alarm record exists, generating a new log alarm record in the alarm state database, generating log alarm information corresponding to the log alarm record, and transmitting the log alarm information to operation and maintenance personnel.

If the matching log alarm record does not exist in the alarm state database, the log alarm record is newly found abnormal, and the new log alarm record is needed to be newly built, so that the new log alarm record is generated in the alarm state database. And generating log alarm information corresponding to the log alarm record, storing the generated log alarm information into a log alarm information database, and transmitting the generated log alarm information to operation and maintenance personnel in a short message, a mail, a nail and the like. The log alarm information comprises a cloud platform identifier, a server identifier, a service type, alarm time, an alarm level, an alarm keyword, a fault type and an alarm state which correspond to the detected log abnormality. The alert information database may refer to a database for storing log alert information.

Step S105, if no matched data item exists, log abnormality detection is carried out on log data to be detected by using a log abnormality detection model, and log abnormality warning is carried out when log abnormality is detected.

In the step, if no matched data item exists, the log data to be detected is subjected to templated processing to obtain templated target log data, the target log data is input into a trained log abnormality detection model, log abnormality online detection is carried out, and the detection result is used as supplement of log alarm based on alarm rules.

If the log abnormality detection model detects that the log abnormality exists in the target log data, sending an alarm notification, defaulting the alarm grade to be a middle grade, marking the alarm type as the log abnormality detection model detection, and informing operation and maintenance personnel to conduct investigation in time. After the processing of the target log data, the processing of the next target log data is started.

In an alternative embodiment, the alarm processing in step c2 according to the alarm level in the log alarm record includes: step c21 to step c23.

Step c21, determining whether the alarm level is urgent.

Specifically, the alarm class is classified as emergency, medium, and normal.

And step c22, if the alarm is urgent, acquiring the alarm times, comparing the alarm times in a preset time range with a set time threshold value, and carrying out alarm processing according to a comparison result.

If the alarm level is urgent, inquiring the alarm times in the log alarm record, comparing the alarm times within a preset time range of 5 minutes with a set time threshold of 5 times, and carrying out alarm processing according to the comparison result. The specific values of the preset time range and the set frequency threshold can be selected by a person skilled in the art according to actual situations, and the application is not limited herein.

And c23, if not, determining the interval time between the last alarm and the current alarm in the log alarm record, and carrying out log alarm processing according to the interval time.

If the alarm level is not urgent, the alarm level is medium or common, the last alarm time can be obtained, and the interval time is obtained by subtracting the current alarm time from the last alarm time.

If the interval time is less than 5 minutes, log alarm processing is not performed, and log alarm information is not repeatedly sent to operation and maintenance personnel. If the interval time is greater than or equal to 5 minutes, the log alarm information is sent to the operation and maintenance personnel again, and the alarm time of the log alarm information is updated to be the current sending time. The next entry log data is then processed.

In an alternative embodiment, in step c22, the alarm processing is performed according to the comparison result, including: step c221 to step c222.

Step c221, if the number of alarms within the preset time range is greater than or equal to the set number threshold, determining that the alarm is a blocking type fault, and sending a blocking type fault alarm.

If the number of alarms within the preset time range of 5 minutes is greater than or equal to 5 times, determining that the fault type is a blocking type fault and processing the fault immediately is needed, immediately sending a log alarm indicating that the service is the blocking type fault, informing operation and maintenance personnel to immediately check the fault, and preventing severe conditions such as service stop operation caused by long-time abnormality. And simultaneously, setting the alarm state in the log alarm record as the alarm.

Step c222, if the number of alarms within the preset time range is less than the set number threshold, determining that the fault is a transient fault.

If the number of alarms within the preset time range of 5 minutes is less than 5 times, the fault type is tentatively a transient fault, the transient fault is generally a user login password error and the like, no blocking fault is reported at this time, and only an alarm notification is sent to operation and maintenance personnel.

In an alternative embodiment, the method further comprises: steps d1 to d3.

And d1, if the log abnormality is determined to be a transient fault, changing the service running state of the target service with the log abnormality into a potential stop in an alarm state database.

If the fault abnormality is a transient fault, a service survival detection mechanism is triggered, a service survival detection request is sent to a service corresponding to the service type, the server identifier and the cloud platform identifier, the service running state of the service is checked, and the service downtime caused by the fault is found in time, so that the risk of fault spreading is avoided.

After triggering the service survival detection mechanism, changing the service running state of the target service into a potential stopping state in the alarm state database.

And d2, circularly detecting the service running state of the target service according to the first interval duration.

Here, the operation state of the target service is detected every 5 minutes of the first interval duration, and the number of detections is recorded. Wherein a maximum number of detections may be set, for example: 500 times.

And d3, if the service running state of the target service is stop running, or if the service running state is normal running but the service running state of the target service is stop running according to the second interval duration cycle detection within the preset time range, determining that the target service has downtime fault.

If the detection result of the target service is normal operation, the detection interval is increased by 5 minutes, the detection is carried out once every 10 minutes according to the second interval duration within the preset time range of 30 minutes, if no error exists within 30 minutes, the serious alarm is considered to not influence the service operation, the service survival detection mechanism is stopped, the operation state of the target service is re-marked as normal operation, and then the next item of log data is processed.

If the detection result of the target service is that the operation is stopped in the process of detecting according to the first interval time length, immediately sending a service downtime fault to operation and maintenance personnel for timely processing. And then executing the alarm processing of the next entry log data.

In an alternative embodiment, the log anomaly detection model comprises a word embedding layer, a convolution layer, a max pooling layer, a stitching layer, a bi-directional recurrent neural network layer, an attention layer, a classification layer, the convolution layer comprising a plurality of convolution kernels.

The log anomaly detection model is described below with reference to fig. 2.

Fig. 2 shows a schematic structural diagram of a log anomaly detection model provided in an embodiment of the present application.

As shown in fig. 2, the log anomaly detection model includes a word embedding layer, a convolution layer, a max pooling layer, a stitching layer, a bi-directional recurrent neural network layer, an attention layer, and a classification layer, and the convolution layer includes three convolution kernels. The word embedding layer is used for converting the log data to be detected into a vector matrix; the convolution layer is used for obtaining short-distance features, namely first feature vectors; the bidirectional recurrent neural network layer is used for acquiring long-distance characteristics, namely second characteristic vectors.

The log abnormality detection model constructed based on the deep learning algorithm is formed by combining a Convolutional Neural Network (CNN), a bidirectional cyclic neural network and an attention mechanism, deep short-distance key log features are extracted by the convolutional neural network, serialized feature learning is performed by the bidirectional cyclic neural network, long-distance log features are extracted, contribution degree of each sequence to a final log abnormality result is calculated by the attention mechanism, important feature information is highlighted, effective feature screening is completed, and log abnormality detection is finally achieved.

Specifically, for each piece of log data to be detected in 5 pieces of log data to be detected in a period of time (for example, within 5 seconds), the piece of log data to be detected is subjected to templating processing to obtain templated target log data, the target log data is input into a word embedding layer, and log key mapping is performed through word2vec to obtain a vector matrix.

And then inputting the vector matrix corresponding to each piece of log data to be detected into a CNN convolution layer for convolution operation, wherein three groups of convolution neural networks with different convolution kernel sizes are used in the convolution layer, the convolution window of each convolution kernel is respectively set to 2, 3 and 4, the convolution step length is set to 2, and adjacent key features are extracted through the convolution kernels. And then, carrying out feature compression on the extracted features through maximum pooling, and splicing the compressed features in a splicing layer to obtain a spliced first feature vector corresponding to the log data to be detected.

And respectively inputting 5 first feature vectors corresponding to the 5 log data to be detected into a bidirectional cyclic neural network layer to perform long-distance feature extraction to obtain 5 second feature vectors, wherein the second feature vectors are hidden vectors.

And 5 second feature vectors are simultaneously input into the attention layer to perform weighted average of feature expression, the feature vectors after weighted average are input into the classification layer, and whether the abnormal level of the log is urgent, medium or common is determined. The log anomaly detection model is trained by firstly training an initial log anomaly detection model in an offline training mode, and then the trained log anomaly detection model is deployed in a system.

When training the initial log abnormality detection model, firstly collecting historical log data in a cloud platform within 24 hours as training data, wherein the training data comprise nova service, cander service and the like, and then manually marking labels by operation and maintenance personnel, wherein the labels are classified into abnormality and normal so as to train the initial log abnormality detection model by using the marked training data.

For the log data marked with the labels, a log extraction tool Drain is adopted to extract a template, a log key and log parameters are resolved by using a regular expression based on the fields such as punctuation marks and spaces, the fields such as INFO and EBUG numbers without distinction are deleted, a structured log set is constructed after merging, preprocessing of an original log data set is realized, and a training data set and an abnormality detection data set are generated.

Log1 is: deBUG nova.objects.instance [ req-bd848-ca65hedf2-default ] Lazy-loading 'flag' on Instance uuid b da7ba-288d-40de-4b916e1a3.

Log2 is: INFO oslo.message driver.imal_rabit [ req-bd854i8-8d57-ca62-default ] [52d2d-7806-61c6799d82f]Reconnected to AMQP server on 192.168.1.11:5672 via [ amqp ] client with port 23914.

Extraction from template 1: nova objects instance<>Lazy-loadingflavoron Instance uuid</>>. Extracting from the template 2: oslo messaging drivers impl rabbit</>>Reconnected to AMQP server on</>>via</>>with port</>>. The extracted content is the target log data after the templated.

The following describes a log alarm method by taking a log alarm of a data center monitoring component telegraf as an example.

Step one: and configuring log alarm acquisition of a telegraf monitoring component. Specifically, when the deployment function is initialized, configuring a log catalog of the telegraf into a log acquisition plug-in agent, configuring a server identification into a log label of log data to be detected, distinguishing which server the virtual machine originates from, and pushing acquired log information into a message queue of the data center.

Step two: the unified monitoring center data convergence node extracts log data to be detected in log message queues of all data centers, adds a data center label to each log data to be detected, is used for distinguishing the data center from which the data is sourced, and stores the processed log data into the log storage component.

Step three: and (3) taking the error state as an illustration of an alarm process, generating an alarm rule according to the log alarm keyword and the alarm type of the telegraf service set in the front-end system by a user, and storing the alarm rule in an alarm rule database. When log data to be detected acquired by telegraf are received, searching whether a matched alarm rule exists in an alarm rule database according to the service type, the server identification and the cloud platform identification (data center identification).

Step four: if the matched alarm rule does not exist, calling a log abnormality detection model, carrying out templated processing on log data to be detected, generating target log data which accords with the log abnormality detection model, inputting the target log data into the log abnormality detection model which is trained and deployed, and carrying out on-line recognition of abnormal conditions to be used as supplement of log alarm based on rules.

Step five: it is assumed that there are matching alarm rules, and that matching alarm keywords error occur in log data to be detected, and because the alarm level in the corresponding alarm rules is urgent, an alarm notification is immediately sent. Meanwhile, triggering a service survival detection mechanism, changing a service running state into a potential stopping state in an alarm state database, detecting the service running state of a target service kafka corresponding to the alarm every certain time, detecting every 5 minutes, and recording detection times. If the detection result is normal operation, the detection interval is increased by 5 minutes, and if the detection interval reaches 30 minutes in the mode, the serious alarm is considered to not influence the service operation, the detection is canceled, and the service state is re-marked as normal operation. The next piece of log data is then processed.

Step six: when the error exists in the log data to be detected again, checking whether the number of times received in five minutes exceeds five times, and if the number of times in five minutes does not exceed five times, not sending the obstruction type fault alarm.

Step seven: if the kafka service is not recovered, after the error log is received for the sixth time, the fault is judged to be a blocking type fault, and a blocking type error is immediately reported to inform operation and maintenance personnel to check immediately. If the kafka service is restored, the existence of the alarm keyword error in the log data to be detected is no longer detected, and the alarm state is set to be restored.

Compared with the log abnormality alarming method in the prior art, the method and the device can acquire the platform log data from different cloud platforms, uniformly process the log data of different cloud platforms, convert the log data into log data to be detected, detect log abnormality and alarm the log abnormality from multiple dimensions by using the alarm rule and the log abnormality detection model, and solve the problems of low processing efficiency and low accuracy of the log alarm.

Based on the same inventive concept, the embodiment of the present application further provides a log abnormality alarm device corresponding to the log abnormality alarm method, and since the principle of solving the problem by the device in the embodiment of the present application is similar to that of the log abnormality alarm method in the embodiment of the present application, the implementation of the device may refer to the implementation of the method, and the repetition is not repeated.

Referring to fig. 3, fig. 3 is a schematic structural diagram of a log abnormality alarm device according to an embodiment of the present application. As shown in fig. 3, the log abnormality warning apparatus 200 includes:

the data acquisition module 201 is configured to acquire log data to be detected from different cloud platforms, where the log data to be detected includes a service type, a server identifier, and a cloud platform identifier;

the rule matching module 202 is configured to determine whether an alarm rule matching the service type, the server identifier, and the cloud platform identifier exists in the alarm rule database, where the alarm rule includes an alarm keyword;

the keyword matching module 203 is configured to determine whether a data item matching the alert keyword exists in the log data to be detected if there is a matching alert rule;

the rule alarm module 204 is configured to perform log exception alarm according to the log alarm record corresponding to the alarm rule in the alarm state database if there is a matched data item;

and the model alarm module 205 is configured to perform log abnormality detection on log data to be detected by using a log abnormality detection model if no matched data item exists, and perform log abnormality alarm when log abnormality is detected.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided in this application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be other manners of division in actual implementation, and for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be through some communication interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a non-volatile computer readable storage medium executable by a processor. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

Finally, it should be noted that: the foregoing examples are merely specific embodiments of the present application, and are not intended to limit the scope of the present application, but the present application is not limited thereto, and those skilled in the art will appreciate that while the foregoing examples are described in detail, the present application is not limited thereto. Any person skilled in the art may modify or easily conceive of the technical solution described in the foregoing embodiments, or make equivalent substitutions for some of the technical features within the technical scope of the disclosure of the present application; such modifications, changes or substitutions do not depart from the spirit and scope of the technical solutions of the embodiments of the present application, and are intended to be included in the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A log anomaly alarm method, comprising:

determining whether an alarm rule matched with the service type, the server identifier and the cloud platform identifier exists in an alarm rule database or not, wherein the alarm rule comprises an alarm keyword;

If a matched alarm rule exists, determining whether a data item matched with the alarm keyword exists in the log data to be detected;

if the matched data items exist, carrying out log abnormal alarm according to log alarm records corresponding to the alarm rules in the alarm state database;

if no matched data item exists, carrying out log abnormality detection on the log data to be detected by using a log abnormality detection model, and carrying out log abnormality alarm when detecting that the log is abnormal;

the step of carrying out log abnormal alarm according to the log alarm record corresponding to the alarm rule in the alarm state database comprises the following steps:

inquiring whether a log alarm record matched with the alarm rule exists in the alarm state database;

if the matched log alarm records exist, adding one to the alarm times in the log alarm records, recording the latest alarm time, and carrying out alarm processing according to the alarm grade in the log alarm records;

if no matched log alarm record exists, generating a new log alarm record in the alarm state database, generating log alarm information corresponding to the log alarm record, and sending the log alarm information to an operation and maintenance personnel.

2. The method of claim 1, further comprising, after said determining whether an alert rule exists in the alert rule database that matches the service type, the server identification, and the cloud platform identification:

and if the matched alarm rule does not exist, carrying out log abnormality detection on the log data to be detected by using a log abnormality detection model, and carrying out log abnormality alarm when detecting that the log is abnormal.

3. The method of claim 1, wherein the performing alarm processing according to the alarm level in the log alarm record comprises:

determining whether the alert level is urgent;

if the alarm is urgent, acquiring alarm times, comparing the alarm times in a preset time range with a set time threshold, and carrying out alarm processing according to a comparison result;

if not, determining the interval time between the last alarm and the current alarm in the log alarm record, and carrying out alarm processing according to the interval time.

4. A method according to claim 3, wherein said alarm processing based on the comparison result comprises:

if the number of alarms within the preset time range is greater than or equal to a set number threshold, determining that the alarm is a blocking type fault, and sending a blocking type fault alarm;

If the alarm times in the preset time range are smaller than the set times threshold, determining that the short-term fault exists.

5. The method according to claim 4, wherein the method further comprises:

if the log abnormality is determined to be a transient fault, changing the service running state of the target service with the log abnormality into a potential stop in the alarm state database;

circularly detecting the service running state of the target service according to the first interval duration;

and if the service running state of the target service is stop running, or the service running state is normal running but the service running state of the target service is stop running in a preset time range according to the second interval time period, determining that the target service has downtime fault.

6. The method of claim 1, wherein the log anomaly detection model comprises a word embedding layer, a convolution layer, a max pooling layer, a stitching layer, a bi-directional recurrent neural network layer, an attention layer, a classification layer, the convolution layer comprising a plurality of convolution kernels.

7. The method of claim 1, wherein the obtaining log data to be detected from different cloud platforms comprises:

Aiming at each cloud platform, collecting platform log data corresponding to different services in the cloud platform, and adding a log label for the platform log data, wherein the log label comprises a service type and a server identifier;

the platform log data added with the log label is put into a message queue of the cloud platform;

extracting platform log data from the message queues of different cloud platforms, and performing label conversion processing on the platform log data to obtain log data to be detected carrying cloud platform identification.

8. The method of claim 7, wherein the performing label conversion processing on the platform log data to obtain log data to be detected carrying a cloud platform identifier comprises:

and adding a cloud platform identifier to the platform log data to obtain log data to be detected.

9. A log abnormality warning apparatus, comprising:

the rule matching module is used for determining whether an alarm rule matched with the service type, the server identifier and the cloud platform identifier exists in an alarm rule database or not, and the alarm rule comprises an alarm keyword;

the rule alarm module is used for carrying out log abnormal alarm according to log alarm records corresponding to the alarm rule in the alarm state database if matched data items exist;

the model alarm module is used for carrying out log abnormality detection on the log data to be detected by using a log abnormality detection model if no matched data item exists, and carrying out log abnormality alarm when log abnormality is detected;

the rule alarm module is specifically configured to: