CN116820884A - Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance - Google Patents

Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance Download PDF

Info

Publication number
CN116820884A
CN116820884A CN202310785759.4A CN202310785759A CN116820884A CN 116820884 A CN116820884 A CN 116820884A CN 202310785759 A CN202310785759 A CN 202310785759A CN 116820884 A CN116820884 A CN 116820884A
Authority
CN
China
Prior art keywords
data
log
keyword
maintenance
anomaly detection
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310785759.4A
Other languages
Chinese (zh)
Inventor
杨华为
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong Tiandi Yitong Technology Co ltd
Original Assignee
Shandong Tiandi Yitong Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong Tiandi Yitong Technology Co ltd filed Critical Shandong Tiandi Yitong Technology Co ltd
Priority to CN202310785759.4A priority Critical patent/CN116820884A/en
Publication of CN116820884A publication Critical patent/CN116820884A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/24Classification techniques
    • G06F18/241Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/0463Neocognitrons
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Software Systems (AREA)
  • General Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Quality & Reliability (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention relates to a method for monitoring an IT system state by utilizing intelligent operation and maintenance, which comprises the following steps: firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not; and secondly, performing log anomaly detection by using the deep learning LSTM network. The invention uses the deep learning framework to solve the problem of KPI anomaly detection in the operation and maintenance of the data center, uses the VAE algorithm to solve the problem of unbalance of the data set, uses the CNN network to combine with the GRU network to realize model training, automatically learns the data characteristics to realize anomaly detection, has better adaptability and universality compared with the prior art, and can adapt to detection of various KPI indexes which are newly added in a changeable way.

Description

Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for monitoring an abnormal state of an IT system by utilizing intelligent operation and maintenance.
Background
The intelligent operation and maintenance, namely AIOps (Artificial Intelligence for IT Operations), is to apply artificial intelligence to the operation and maintenance field, and further solve the problem that automatic operation and maintenance cannot be solved by a machine learning mode based on the existing operation and maintenance data (logs, monitoring information, application information and the like).
The monitoring detection method needs to manually set thresholds according to experience for different KPI indexes, and has the advantages of high maintenance cost, poor scene adaptability and low detection accuracy. Hundreds or even thousands of KPIs are completely insufficient in many complex business scenes in a manual mode. The intelligent monitoring without threshold value is realized through the AI algorithm, the intelligent monitoring can be carried out on indexes such as system CPU, memory, network, system service KPI and the like, the threshold value is not required to be manually configured, and the system change is automatically adapted.
Journaling is an important means of monitoring the state of IT systems, and many monitoring and alarm and problem analysis of the systems rely on journaling. In daily operation and maintenance, log analysis generally requires high demands on operation and maintenance personnel, and high proficiency in business data and system architecture is required, so that a large number of machine logs make it difficult and heavy to process problems and track logs. The log data of the historical faults are learned by means of a machine learning method, and the learning model is used for automatically alarming the faults and analyzing the reasons of the faults, so that the alarming and fault processing efficiency is improved, manpower is liberated, meanwhile, certain rules of the faults can be analyzed, and further prediction of the faults and the like are realized.
Early operation and maintenance work was mostly done manually by operation and maintenance personnel, which is called manual operation and maintenance. This fall-behind production method has been difficult to maintain in the age of rapid expansion of internet business and high labor cost. At this time, automatic operation and maintenance occurs, and common and repetitive operation and maintenance work is executed by using scripts with predefined rules which can be automatically triggered, so that the labor cost is reduced, and the operation and maintenance efficiency is improved. An automated operation and maintenance can be considered an expert system based on industry domain knowledge and operation and maintenance scenario domain knowledge. With the development of informatization, data center business is rapidly expanding, and the types of services are complex and diverse, expert systems based on manually specified rules are becoming increasingly frustrating. The defects of automatic operation and maintenance are increasingly remarkable, and the main defects are that the operation and maintenance can only be processed according to manual experience preset rules, the service change needs manual intervention for adjustment, and the self-adaption cannot be achieved.
Disclosure of Invention
The invention aims to solve the problems that the traditional method relies on manpower to judge logs, and has low efficiency and poor adaptability, and further provides a method and a device for monitoring the state of an IT system by utilizing intelligent operation and maintenance.
The invention discloses a method for monitoring an IT system state by utilizing intelligent operation and maintenance, which comprises the following steps:
firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;
and secondly, performing log anomaly detection by using the deep learning LSTM network.
Further, in step one, the data preprocessing is to cut out data points by using sliding windows before the prediction is continued, and the time series data is converted into supervised learning problems, and the expected output of each sliding window is the time step after the window ends.
Further, in step one, anomaly data is generated using a VAE algorithm such that the data is balanced.
In the first step, the data training method comprises the steps that a deep learning architecture adopts CNN and LSTM to be matched; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.
Further, in the first step, judging whether the input KPI data is abnormal or not, setting n output nodes, wherein n is the number of categories, each sample, the neural network obtains an n-dimensional array as an output result, and each output node corresponds to one category; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.
In the second step, firstly, the logs generated in the normal running state are processed, and a log keyword sequence and a log variable are extracted from the logs and are respectively used for training a log keyword abnormality detection model and a log variable abnormality detection model, wherein the log keyword sequence is also used for training a workflow model and is used for subsequent understanding and diagnosis of the abnormalities;
for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a certain sequence, recording the state of a system step by step, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting the vector as the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.
The invention also relates to a device for monitoring the state of the IT system by utilizing the intelligent operation and maintenance, which comprises the calculation module of the method for monitoring the state of the IT system by utilizing the intelligent operation and maintenance.
Advantageous effects
The invention uses the deep learning framework to solve the problem of KPI anomaly detection in the operation and maintenance of the data center, uses the VAE algorithm to solve the problem of unbalance of the data set, uses the CNN network to combine with the GRU network to realize model training, automatically learns the data characteristics to realize anomaly detection, has better adaptability and universality compared with the prior art, and can adapt to detection of various KPI indexes which are newly added in a changeable way.
The invention uses GRU network to analyze and detect the log, divides the log detection into two detection categories of keyword detection and variable parameter detection, uses the log keyword detection as a multi-classification problem, and carries out abnormality judgment according to the occurrence probability of the next keyword. The variable parameter abnormality is judged by using the error in a high confidence region of Gaussian distribution, so that the problem of low efficiency of judgment of the log by the traditional relying manual work is solved.
Drawings
FIG. 1 is a flowchart of KPI anomaly detection according to the present invention.
FIG. 2 is a data training flow chart for KPI anomaly detection.
Detailed Description
The present embodiment will be specifically described with reference to fig. 1 to 2.
As shown in fig. 1, a method for monitoring an IT system state by using intelligent operation and maintenance of the present invention includes the following steps:
firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;
the data preprocessing is performed by cutting out data points by using sliding windows before the prediction is continued, converting the time series data into supervised learning problems, and the expected output of each sliding window is the time step after the window is finished.
The anomaly data is generated using a VAE algorithm such that the data is balanced.
The data training method comprises the steps that a deep learning architecture adopts CNN and LSTM to be matched; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.
Judging whether the input KPI data is classified abnormally or not, setting n output nodes, wherein n is the number of the classes, each sample is obtained by the neural network, an n-dimensional array is obtained as an output result, and each output node corresponds to one class; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.
Secondly, performing log anomaly detection by using a deep learning LSTM network;
firstly, processing a log generated in a normal running state, extracting a log keyword sequence and a log variable from the log, and respectively training a log keyword abnormality detection model and a log variable abnormality detection model, wherein the log keyword sequence is also used for training a workflow model for subsequent understanding and diagnosis of the abnormality;
for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a certain sequence, recording the state of a system step by step, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting the vector as the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.
As shown in fig. 2, the input data set of the present invention first uses a CNN neural network, where CNN contains conv representing multiple convolutional layers and Max-pooling representing the maximum pooling layer. And then using the GRU neural network to carry out N classification on the output result. The method for solving the multi-classification problem by the neural network is to set N output nodes, wherein N is the number of the classes, each sample can obtain an N-dimensional array as an output result, namely, each output node corresponds to one class. And finally, adopting cross entropy as a loss function, wherein the cross entropy characterizes the distance between two probability distributions, and when the output of the neural network is not one probability distribution, introducing a Softmax algorithm to change the output of the neural network into one probability distribution.
Effect verification
And training and verifying a result of the neural network model by adopting an abnormality detection data set disclosed by Yahoo for KPI abnormality detection. The log anomaly detection adopts a log collected by an Amazon cloud real data center environment, wherein correct and error logs are marked. The detection result is measured by adopting the indexes of Accuracy (Accuracy), precision (Precision), recall (Recall) and F-measure, and the result is better by transversely comparing the detection result with the traditional algorithm result of the same type.
The foregoing is merely illustrative of the present invention and is not intended to limit the embodiments of the present invention, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present invention, so that the protection scope of the present invention shall be defined by the claims.

Claims (7)

1. The method for monitoring the abnormal state of the IT system by utilizing the intelligent operation and maintenance is characterized by comprising the following steps of:
firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;
and secondly, performing log anomaly detection by using the deep learning LSTM network.
2. The method of claim 1, wherein in step one, the data preprocessing is performed by cutting out data points using sliding windows, converting time series data into supervised learning problems, and the expected output of each sliding window is a time step after the window ends, before the prediction is continued.
3. The method of claim 1, wherein in step one, the anomaly data is generated using a VAE algorithm such that the data is balanced.
4. The method for monitoring IT system state by intelligent operation and maintenance according to claim 1, wherein in the first step, the data training method comprises the steps of adopting CNN and LSTM to cooperate with each other for deep learning architecture; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.
5. The method for monitoring the state of an IT system by using intelligent operation and maintenance according to claim 1, wherein in the first step, IT is determined whether the input KPI data is classified abnormally, n output nodes are set, where n is the number of the classes, each sample, the neural network obtains an n-dimensional array as an output result, and each output node corresponds to one class; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.
6. The method for monitoring the state of an IT system by using intelligent operation and maintenance according to claim 1, wherein in the second step, firstly, a log generated in a normal operation state is processed, and a log keyword sequence and a log variable are extracted therefrom and are respectively used for training a log keyword anomaly detection model and a log variable anomaly detection model, wherein the log keyword sequence is also used for training a workflow model for subsequent understanding and diagnosis of anomalies;
for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a sequence, recording the state of a system in steps, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting a vector which is the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.
7. An apparatus for monitoring an IT system status using intelligent operation and maintenance, the apparatus comprising a computing module of the method for monitoring an IT system status using intelligent operation and maintenance according to any one of claims 1 to 6.
CN202310785759.4A 2023-06-30 2023-06-30 Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance Pending CN116820884A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310785759.4A CN116820884A (en) 2023-06-30 2023-06-30 Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310785759.4A CN116820884A (en) 2023-06-30 2023-06-30 Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance

Publications (1)

Publication Number Publication Date
CN116820884A true CN116820884A (en) 2023-09-29

Family

ID=88114171

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310785759.4A Pending CN116820884A (en) 2023-06-30 2023-06-30 Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance

Country Status (1)

Country Link
CN (1) CN116820884A (en)

Similar Documents

Publication Publication Date Title
CN111007799B (en) Numerical control equipment remote diagnosis system based on neural network
CN111885059B (en) Method for detecting and positioning abnormal industrial network flow
CN110335168B (en) Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU
CN111507376A (en) Single index abnormality detection method based on fusion of multiple unsupervised methods
CN109871002B (en) Concurrent abnormal state identification and positioning system based on tensor label learning
CN111879349A (en) Sensor data deviation self-adaptive correction method
CN117421684B (en) Abnormal data monitoring and analyzing method based on data mining and neural network
CN115237717A (en) Micro-service abnormity detection method and system
CN117041029A (en) Network equipment fault processing method and device, electronic equipment and storage medium
CN111666978B (en) Intelligent fault early warning system for IT system operation and maintenance big data
CN113687972A (en) Method, device and equipment for processing abnormal data of business system and storage medium
CN111275136B (en) Fault prediction system based on small sample and early warning method thereof
CN117411703A (en) Modbus protocol-oriented industrial control network abnormal flow detection method
CN114962390A (en) Hydraulic system fault diagnosis method and system and working machine
CN113093695A (en) Data-driven SDN controller fault diagnosis system
CN110727669B (en) Electric power system sensor data cleaning device and cleaning method
CN111352820A (en) Method, equipment and device for predicting and monitoring running state of high-performance application
CN116720095A (en) Electrical characteristic signal clustering method for optimizing fuzzy C-means based on genetic algorithm
CN116820884A (en) Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance
CN113656323B (en) Method for automatically testing, positioning and repairing faults and storage medium
CN115659135A (en) Anomaly detection method for multi-source heterogeneous industrial sensor data
CN115456092A (en) Real-time monitoring method for abnormal data of power system
CN114565051A (en) Test method of product classification model based on neuron influence degree
CN115080286A (en) Method and device for discovering log exception of network equipment
CN118014373B (en) Risk identification model based on data quality monitoring and construction method thereof

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination