CN116820884A

CN116820884A - Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance

Info

Publication number: CN116820884A
Application number: CN202310785759.4A
Authority: CN
Inventors: 杨华为
Original assignee: Shandong Tiandi Yitong Technology Co ltd
Current assignee: Shandong Tiandi Yitong Technology Co ltd
Priority date: 2023-06-30
Filing date: 2023-06-30
Publication date: 2023-09-29

Abstract

The invention relates to a method for monitoring an IT system state by utilizing intelligent operation and maintenance, which comprises the following steps: firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not; and secondly, performing log anomaly detection by using the deep learning LSTM network. The invention uses the deep learning framework to solve the problem of KPI anomaly detection in the operation and maintenance of the data center, uses the VAE algorithm to solve the problem of unbalance of the data set, uses the CNN network to combine with the GRU network to realize model training, automatically learns the data characteristics to realize anomaly detection, has better adaptability and universality compared with the prior art, and can adapt to detection of various KPI indexes which are newly added in a changeable way.

Description

Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance

Technical Field

The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for monitoring an abnormal state of an IT system by utilizing intelligent operation and maintenance.

Background

The intelligent operation and maintenance, namely AIOps (Artificial Intelligence for IT Operations), is to apply artificial intelligence to the operation and maintenance field, and further solve the problem that automatic operation and maintenance cannot be solved by a machine learning mode based on the existing operation and maintenance data (logs, monitoring information, application information and the like).

The monitoring detection method needs to manually set thresholds according to experience for different KPI indexes, and has the advantages of high maintenance cost, poor scene adaptability and low detection accuracy. Hundreds or even thousands of KPIs are completely insufficient in many complex business scenes in a manual mode. The intelligent monitoring without threshold value is realized through the AI algorithm, the intelligent monitoring can be carried out on indexes such as system CPU, memory, network, system service KPI and the like, the threshold value is not required to be manually configured, and the system change is automatically adapted.

Journaling is an important means of monitoring the state of IT systems, and many monitoring and alarm and problem analysis of the systems rely on journaling. In daily operation and maintenance, log analysis generally requires high demands on operation and maintenance personnel, and high proficiency in business data and system architecture is required, so that a large number of machine logs make it difficult and heavy to process problems and track logs. The log data of the historical faults are learned by means of a machine learning method, and the learning model is used for automatically alarming the faults and analyzing the reasons of the faults, so that the alarming and fault processing efficiency is improved, manpower is liberated, meanwhile, certain rules of the faults can be analyzed, and further prediction of the faults and the like are realized.

Early operation and maintenance work was mostly done manually by operation and maintenance personnel, which is called manual operation and maintenance. This fall-behind production method has been difficult to maintain in the age of rapid expansion of internet business and high labor cost. At this time, automatic operation and maintenance occurs, and common and repetitive operation and maintenance work is executed by using scripts with predefined rules which can be automatically triggered, so that the labor cost is reduced, and the operation and maintenance efficiency is improved. An automated operation and maintenance can be considered an expert system based on industry domain knowledge and operation and maintenance scenario domain knowledge. With the development of informatization, data center business is rapidly expanding, and the types of services are complex and diverse, expert systems based on manually specified rules are becoming increasingly frustrating. The defects of automatic operation and maintenance are increasingly remarkable, and the main defects are that the operation and maintenance can only be processed according to manual experience preset rules, the service change needs manual intervention for adjustment, and the self-adaption cannot be achieved.

Disclosure of Invention

The invention aims to solve the problems that the traditional method relies on manpower to judge logs, and has low efficiency and poor adaptability, and further provides a method and a device for monitoring the state of an IT system by utilizing intelligent operation and maintenance.

The invention discloses a method for monitoring an IT system state by utilizing intelligent operation and maintenance, which comprises the following steps:

firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;

and secondly, performing log anomaly detection by using the deep learning LSTM network.

Further, in step one, the data preprocessing is to cut out data points by using sliding windows before the prediction is continued, and the time series data is converted into supervised learning problems, and the expected output of each sliding window is the time step after the window ends.

Further, in step one, anomaly data is generated using a VAE algorithm such that the data is balanced.

In the first step, the data training method comprises the steps that a deep learning architecture adopts CNN and LSTM to be matched; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.

Further, in the first step, judging whether the input KPI data is abnormal or not, setting n output nodes, wherein n is the number of categories, each sample, the neural network obtains an n-dimensional array as an output result, and each output node corresponds to one category; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.

In the second step, firstly, the logs generated in the normal running state are processed, and a log keyword sequence and a log variable are extracted from the logs and are respectively used for training a log keyword abnormality detection model and a log variable abnormality detection model, wherein the log keyword sequence is also used for training a workflow model and is used for subsequent understanding and diagnosis of the abnormalities;

for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a certain sequence, recording the state of a system step by step, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting the vector as the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.

The invention also relates to a device for monitoring the state of the IT system by utilizing the intelligent operation and maintenance, which comprises the calculation module of the method for monitoring the state of the IT system by utilizing the intelligent operation and maintenance.

Advantageous effects

The invention uses the deep learning framework to solve the problem of KPI anomaly detection in the operation and maintenance of the data center, uses the VAE algorithm to solve the problem of unbalance of the data set, uses the CNN network to combine with the GRU network to realize model training, automatically learns the data characteristics to realize anomaly detection, has better adaptability and universality compared with the prior art, and can adapt to detection of various KPI indexes which are newly added in a changeable way.

The invention uses GRU network to analyze and detect the log, divides the log detection into two detection categories of keyword detection and variable parameter detection, uses the log keyword detection as a multi-classification problem, and carries out abnormality judgment according to the occurrence probability of the next keyword. The variable parameter abnormality is judged by using the error in a high confidence region of Gaussian distribution, so that the problem of low efficiency of judgment of the log by the traditional relying manual work is solved.

Drawings

FIG. 1 is a flowchart of KPI anomaly detection according to the present invention.

FIG. 2 is a data training flow chart for KPI anomaly detection.

Detailed Description

The present embodiment will be specifically described with reference to fig. 1 to 2.

As shown in fig. 1, a method for monitoring an IT system state by using intelligent operation and maintenance of the present invention includes the following steps:

the data preprocessing is performed by cutting out data points by using sliding windows before the prediction is continued, converting the time series data into supervised learning problems, and the expected output of each sliding window is the time step after the window is finished.

The anomaly data is generated using a VAE algorithm such that the data is balanced.

The data training method comprises the steps that a deep learning architecture adopts CNN and LSTM to be matched; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.

Judging whether the input KPI data is classified abnormally or not, setting n output nodes, wherein n is the number of the classes, each sample is obtained by the neural network, an n-dimensional array is obtained as an output result, and each output node corresponds to one class; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.

Secondly, performing log anomaly detection by using a deep learning LSTM network;

firstly, processing a log generated in a normal running state, extracting a log keyword sequence and a log variable from the log, and respectively training a log keyword abnormality detection model and a log variable abnormality detection model, wherein the log keyword sequence is also used for training a workflow model for subsequent understanding and diagnosis of the abnormality;

As shown in fig. 2, the input data set of the present invention first uses a CNN neural network, where CNN contains conv representing multiple convolutional layers and Max-pooling representing the maximum pooling layer. And then using the GRU neural network to carry out N classification on the output result. The method for solving the multi-classification problem by the neural network is to set N output nodes, wherein N is the number of the classes, each sample can obtain an N-dimensional array as an output result, namely, each output node corresponds to one class. And finally, adopting cross entropy as a loss function, wherein the cross entropy characterizes the distance between two probability distributions, and when the output of the neural network is not one probability distribution, introducing a Softmax algorithm to change the output of the neural network into one probability distribution.

Effect verification

And training and verifying a result of the neural network model by adopting an abnormality detection data set disclosed by Yahoo for KPI abnormality detection. The log anomaly detection adopts a log collected by an Amazon cloud real data center environment, wherein correct and error logs are marked. The detection result is measured by adopting the indexes of Accuracy (Accuracy), precision (Precision), recall (Recall) and F-measure, and the result is better by transversely comparing the detection result with the traditional algorithm result of the same type.

The foregoing is merely illustrative of the present invention and is not intended to limit the embodiments of the present invention, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present invention, so that the protection scope of the present invention shall be defined by the claims.

Claims

1. The method for monitoring the abnormal state of the IT system by utilizing the intelligent operation and maintenance is characterized by comprising the following steps of:

2. The method of claim 1, wherein in step one, the data preprocessing is performed by cutting out data points using sliding windows, converting time series data into supervised learning problems, and the expected output of each sliding window is a time step after the window ends, before the prediction is continued.

3. The method of claim 1, wherein in step one, the anomaly data is generated using a VAE algorithm such that the data is balanced.

4. The method for monitoring IT system state by intelligent operation and maintenance according to claim 1, wherein in the first step, the data training method comprises the steps of adopting CNN and LSTM to cooperate with each other for deep learning architecture; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.

5. The method for monitoring the state of an IT system by using intelligent operation and maintenance according to claim 1, wherein in the first step, IT is determined whether the input KPI data is classified abnormally, n output nodes are set, where n is the number of the classes, each sample, the neural network obtains an n-dimensional array as an output result, and each output node corresponds to one class; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.

6. The method for monitoring the state of an IT system by using intelligent operation and maintenance according to claim 1, wherein in the second step, firstly, a log generated in a normal operation state is processed, and a log keyword sequence and a log variable are extracted therefrom and are respectively used for training a log keyword anomaly detection model and a log variable anomaly detection model, wherein the log keyword sequence is also used for training a workflow model for subsequent understanding and diagnosis of anomalies;

for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a sequence, recording the state of a system in steps, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting a vector which is the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.

7. An apparatus for monitoring an IT system status using intelligent operation and maintenance, the apparatus comprising a computing module of the method for monitoring an IT system status using intelligent operation and maintenance according to any one of claims 1 to 6.