CN116820884A - Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance - Google Patents
Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance Download PDFInfo
- Publication number
- CN116820884A CN116820884A CN202310785759.4A CN202310785759A CN116820884A CN 116820884 A CN116820884 A CN 116820884A CN 202310785759 A CN202310785759 A CN 202310785759A CN 116820884 A CN116820884 A CN 116820884A
- Authority
- CN
- China
- Prior art keywords
- data
- log
- keyword
- maintenance
- anomaly detection
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000012423 maintenance Methods 0.000 title claims abstract description 34
- 238000000034 method Methods 0.000 title claims abstract description 21
- 238000012544 monitoring process Methods 0.000 title claims abstract description 21
- 230000002159 abnormal effect Effects 0.000 title claims abstract description 15
- 238000001514 detection method Methods 0.000 claims abstract description 40
- 238000012549 training Methods 0.000 claims abstract description 20
- 238000013135 deep learning Methods 0.000 claims abstract description 13
- 238000004422 calculation algorithm Methods 0.000 claims abstract description 11
- 238000007781 pre-processing Methods 0.000 claims abstract description 7
- 238000013528 artificial neural network Methods 0.000 claims description 11
- 238000013527 convolutional neural network Methods 0.000 claims description 9
- 238000009826 distribution Methods 0.000 claims description 7
- 230000006870 function Effects 0.000 claims description 7
- 238000003745 diagnosis Methods 0.000 claims description 3
- 239000000284 extract Substances 0.000 claims description 3
- 238000000605 extraction Methods 0.000 claims description 3
- 238000005520 cutting process Methods 0.000 claims description 2
- 208000018910 keratinopathic ichthyosis Diseases 0.000 description 14
- 230000005856 abnormality Effects 0.000 description 10
- 238000013473 artificial intelligence Methods 0.000 description 3
- 230000008859 change Effects 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 230000007547 defect Effects 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000010801 machine learning Methods 0.000 description 2
- 238000011176 pooling Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000004364 calculation method Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000004519 manufacturing process Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003062 neural network model Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 230000003252 repetitive effect Effects 0.000 description 1
- 238000013515 script Methods 0.000 description 1
- 230000001960 triggered effect Effects 0.000 description 1
- 238000012795 verification Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3065—Monitoring arrangements determined by the means or processing involved in reporting the monitored data
- G06F11/3072—Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/044—Recurrent networks, e.g. Hopfield networks
- G06N3/0442—Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/04—Architecture, e.g. interconnection topology
- G06N3/0463—Neocognitrons
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Computation (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Biophysics (AREA)
- Health & Medical Sciences (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Software Systems (AREA)
- General Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Quality & Reliability (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Biology (AREA)
- Debugging And Monitoring (AREA)
Abstract
The invention relates to a method for monitoring an IT system state by utilizing intelligent operation and maintenance, which comprises the following steps: firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not; and secondly, performing log anomaly detection by using the deep learning LSTM network. The invention uses the deep learning framework to solve the problem of KPI anomaly detection in the operation and maintenance of the data center, uses the VAE algorithm to solve the problem of unbalance of the data set, uses the CNN network to combine with the GRU network to realize model training, automatically learns the data characteristics to realize anomaly detection, has better adaptability and universality compared with the prior art, and can adapt to detection of various KPI indexes which are newly added in a changeable way.
Description
Technical Field
The invention belongs to the technical field of artificial intelligence, and particularly relates to a method and a device for monitoring an abnormal state of an IT system by utilizing intelligent operation and maintenance.
Background
The intelligent operation and maintenance, namely AIOps (Artificial Intelligence for IT Operations), is to apply artificial intelligence to the operation and maintenance field, and further solve the problem that automatic operation and maintenance cannot be solved by a machine learning mode based on the existing operation and maintenance data (logs, monitoring information, application information and the like).
The monitoring detection method needs to manually set thresholds according to experience for different KPI indexes, and has the advantages of high maintenance cost, poor scene adaptability and low detection accuracy. Hundreds or even thousands of KPIs are completely insufficient in many complex business scenes in a manual mode. The intelligent monitoring without threshold value is realized through the AI algorithm, the intelligent monitoring can be carried out on indexes such as system CPU, memory, network, system service KPI and the like, the threshold value is not required to be manually configured, and the system change is automatically adapted.
Journaling is an important means of monitoring the state of IT systems, and many monitoring and alarm and problem analysis of the systems rely on journaling. In daily operation and maintenance, log analysis generally requires high demands on operation and maintenance personnel, and high proficiency in business data and system architecture is required, so that a large number of machine logs make it difficult and heavy to process problems and track logs. The log data of the historical faults are learned by means of a machine learning method, and the learning model is used for automatically alarming the faults and analyzing the reasons of the faults, so that the alarming and fault processing efficiency is improved, manpower is liberated, meanwhile, certain rules of the faults can be analyzed, and further prediction of the faults and the like are realized.
Early operation and maintenance work was mostly done manually by operation and maintenance personnel, which is called manual operation and maintenance. This fall-behind production method has been difficult to maintain in the age of rapid expansion of internet business and high labor cost. At this time, automatic operation and maintenance occurs, and common and repetitive operation and maintenance work is executed by using scripts with predefined rules which can be automatically triggered, so that the labor cost is reduced, and the operation and maintenance efficiency is improved. An automated operation and maintenance can be considered an expert system based on industry domain knowledge and operation and maintenance scenario domain knowledge. With the development of informatization, data center business is rapidly expanding, and the types of services are complex and diverse, expert systems based on manually specified rules are becoming increasingly frustrating. The defects of automatic operation and maintenance are increasingly remarkable, and the main defects are that the operation and maintenance can only be processed according to manual experience preset rules, the service change needs manual intervention for adjustment, and the self-adaption cannot be achieved.
Disclosure of Invention
The invention aims to solve the problems that the traditional method relies on manpower to judge logs, and has low efficiency and poor adaptability, and further provides a method and a device for monitoring the state of an IT system by utilizing intelligent operation and maintenance.
The invention discloses a method for monitoring an IT system state by utilizing intelligent operation and maintenance, which comprises the following steps:
firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;
and secondly, performing log anomaly detection by using the deep learning LSTM network.
Further, in step one, the data preprocessing is to cut out data points by using sliding windows before the prediction is continued, and the time series data is converted into supervised learning problems, and the expected output of each sliding window is the time step after the window ends.
Further, in step one, anomaly data is generated using a VAE algorithm such that the data is balanced.
In the first step, the data training method comprises the steps that a deep learning architecture adopts CNN and LSTM to be matched; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.
Further, in the first step, judging whether the input KPI data is abnormal or not, setting n output nodes, wherein n is the number of categories, each sample, the neural network obtains an n-dimensional array as an output result, and each output node corresponds to one category; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.
In the second step, firstly, the logs generated in the normal running state are processed, and a log keyword sequence and a log variable are extracted from the logs and are respectively used for training a log keyword abnormality detection model and a log variable abnormality detection model, wherein the log keyword sequence is also used for training a workflow model and is used for subsequent understanding and diagnosis of the abnormalities;
for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a certain sequence, recording the state of a system step by step, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting the vector as the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.
The invention also relates to a device for monitoring the state of the IT system by utilizing the intelligent operation and maintenance, which comprises the calculation module of the method for monitoring the state of the IT system by utilizing the intelligent operation and maintenance.
Advantageous effects
The invention uses the deep learning framework to solve the problem of KPI anomaly detection in the operation and maintenance of the data center, uses the VAE algorithm to solve the problem of unbalance of the data set, uses the CNN network to combine with the GRU network to realize model training, automatically learns the data characteristics to realize anomaly detection, has better adaptability and universality compared with the prior art, and can adapt to detection of various KPI indexes which are newly added in a changeable way.
The invention uses GRU network to analyze and detect the log, divides the log detection into two detection categories of keyword detection and variable parameter detection, uses the log keyword detection as a multi-classification problem, and carries out abnormality judgment according to the occurrence probability of the next keyword. The variable parameter abnormality is judged by using the error in a high confidence region of Gaussian distribution, so that the problem of low efficiency of judgment of the log by the traditional relying manual work is solved.
Drawings
FIG. 1 is a flowchart of KPI anomaly detection according to the present invention.
FIG. 2 is a data training flow chart for KPI anomaly detection.
Detailed Description
The present embodiment will be specifically described with reference to fig. 1 to 2.
As shown in fig. 1, a method for monitoring an IT system state by using intelligent operation and maintenance of the present invention includes the following steps:
firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;
the data preprocessing is performed by cutting out data points by using sliding windows before the prediction is continued, converting the time series data into supervised learning problems, and the expected output of each sliding window is the time step after the window is finished.
The anomaly data is generated using a VAE algorithm such that the data is balanced.
The data training method comprises the steps that a deep learning architecture adopts CNN and LSTM to be matched; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.
Judging whether the input KPI data is classified abnormally or not, setting n output nodes, wherein n is the number of the classes, each sample is obtained by the neural network, an n-dimensional array is obtained as an output result, and each output node corresponds to one class; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.
Secondly, performing log anomaly detection by using a deep learning LSTM network;
firstly, processing a log generated in a normal running state, extracting a log keyword sequence and a log variable from the log, and respectively training a log keyword abnormality detection model and a log variable abnormality detection model, wherein the log keyword sequence is also used for training a workflow model for subsequent understanding and diagnosis of the abnormality;
for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a certain sequence, recording the state of a system step by step, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting the vector as the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.
As shown in fig. 2, the input data set of the present invention first uses a CNN neural network, where CNN contains conv representing multiple convolutional layers and Max-pooling representing the maximum pooling layer. And then using the GRU neural network to carry out N classification on the output result. The method for solving the multi-classification problem by the neural network is to set N output nodes, wherein N is the number of the classes, each sample can obtain an N-dimensional array as an output result, namely, each output node corresponds to one class. And finally, adopting cross entropy as a loss function, wherein the cross entropy characterizes the distance between two probability distributions, and when the output of the neural network is not one probability distribution, introducing a Softmax algorithm to change the output of the neural network into one probability distribution.
Effect verification
And training and verifying a result of the neural network model by adopting an abnormality detection data set disclosed by Yahoo for KPI abnormality detection. The log anomaly detection adopts a log collected by an Amazon cloud real data center environment, wherein correct and error logs are marked. The detection result is measured by adopting the indexes of Accuracy (Accuracy), precision (Precision), recall (Recall) and F-measure, and the result is better by transversely comparing the detection result with the traditional algorithm result of the same type.
The foregoing is merely illustrative of the present invention and is not intended to limit the embodiments of the present invention, and those skilled in the art can easily make corresponding variations or modifications according to the main concept and spirit of the present invention, so that the protection scope of the present invention shall be defined by the claims.
Claims (7)
1. The method for monitoring the abnormal state of the IT system by utilizing the intelligent operation and maintenance is characterized by comprising the following steps of:
firstly, performing anomaly detection on KPI by adopting a deep learning architecture, preprocessing data, performing data balance and data training, and judging whether the input KPI data is abnormal or not;
and secondly, performing log anomaly detection by using the deep learning LSTM network.
2. The method of claim 1, wherein in step one, the data preprocessing is performed by cutting out data points using sliding windows, converting time series data into supervised learning problems, and the expected output of each sliding window is a time step after the window ends, before the prediction is continued.
3. The method of claim 1, wherein in step one, the anomaly data is generated using a VAE algorithm such that the data is balanced.
4. The method for monitoring IT system state by intelligent operation and maintenance according to claim 1, wherein in the first step, the data training method comprises the steps of adopting CNN and LSTM to cooperate with each other for deep learning architecture; the standard CNN network extracts important data features, and the rear end is the feature of LSTM extraction timing rule; and selecting the GRU network for training.
5. The method for monitoring the state of an IT system by using intelligent operation and maintenance according to claim 1, wherein in the first step, IT is determined whether the input KPI data is classified abnormally, n output nodes are set, where n is the number of the classes, each sample, the neural network obtains an n-dimensional array as an output result, and each output node corresponds to one class; the cross entropy is used as a loss function, and the output of the neural network is changed into a probability distribution by using a Softmax algorithm.
6. The method for monitoring the state of an IT system by using intelligent operation and maintenance according to claim 1, wherein in the second step, firstly, a log generated in a normal operation state is processed, and a log keyword sequence and a log variable are extracted therefrom and are respectively used for training a log keyword anomaly detection model and a log variable anomaly detection model, wherein the log keyword sequence is also used for training a workflow model for subsequent understanding and diagnosis of anomalies;
for log template anomaly detection, LSTM is used for classification, the number of categories is the number of templates, and cross entropy is a loss function; extracting a keyword sequence of a log, outputting the log according to a sequence, recording the state of a system in steps, converting the problem of abnormal detection of the log keyword into a multi-classification problem, supposing that the log has n keywords in total, inputting the model into the keyword sequence in a time window, and outputting a vector which is the probability of occurrence of all the keywords after the keyword sequence; and if the keyword corresponding to the new log is not the keyword with a larger probability of occurrence next, the keyword is regarded as abnormal.
7. An apparatus for monitoring an IT system status using intelligent operation and maintenance, the apparatus comprising a computing module of the method for monitoring an IT system status using intelligent operation and maintenance according to any one of claims 1 to 6.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310785759.4A CN116820884A (en) | 2023-06-30 | 2023-06-30 | Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310785759.4A CN116820884A (en) | 2023-06-30 | 2023-06-30 | Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance |
Publications (1)
Publication Number | Publication Date |
---|---|
CN116820884A true CN116820884A (en) | 2023-09-29 |
Family
ID=88114171
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310785759.4A Pending CN116820884A (en) | 2023-06-30 | 2023-06-30 | Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN116820884A (en) |
-
2023
- 2023-06-30 CN CN202310785759.4A patent/CN116820884A/en active Pending
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN111007799B (en) | Numerical control equipment remote diagnosis system based on neural network | |
CN111885059B (en) | Method for detecting and positioning abnormal industrial network flow | |
CN110335168B (en) | Method and system for optimizing power utilization information acquisition terminal fault prediction model based on GRU | |
CN111507376A (en) | Single index abnormality detection method based on fusion of multiple unsupervised methods | |
CN109871002B (en) | Concurrent abnormal state identification and positioning system based on tensor label learning | |
CN111879349A (en) | Sensor data deviation self-adaptive correction method | |
CN117421684B (en) | Abnormal data monitoring and analyzing method based on data mining and neural network | |
CN115237717A (en) | Micro-service abnormity detection method and system | |
CN117041029A (en) | Network equipment fault processing method and device, electronic equipment and storage medium | |
CN111666978B (en) | Intelligent fault early warning system for IT system operation and maintenance big data | |
CN113687972A (en) | Method, device and equipment for processing abnormal data of business system and storage medium | |
CN111275136B (en) | Fault prediction system based on small sample and early warning method thereof | |
CN117411703A (en) | Modbus protocol-oriented industrial control network abnormal flow detection method | |
CN114962390A (en) | Hydraulic system fault diagnosis method and system and working machine | |
CN113093695A (en) | Data-driven SDN controller fault diagnosis system | |
CN110727669B (en) | Electric power system sensor data cleaning device and cleaning method | |
CN111352820A (en) | Method, equipment and device for predicting and monitoring running state of high-performance application | |
CN116720095A (en) | Electrical characteristic signal clustering method for optimizing fuzzy C-means based on genetic algorithm | |
CN116820884A (en) | Method and device for monitoring abnormal state of IT system by utilizing intelligent operation and maintenance | |
CN113656323B (en) | Method for automatically testing, positioning and repairing faults and storage medium | |
CN115659135A (en) | Anomaly detection method for multi-source heterogeneous industrial sensor data | |
CN115456092A (en) | Real-time monitoring method for abnormal data of power system | |
CN114565051A (en) | Test method of product classification model based on neuron influence degree | |
CN115080286A (en) | Method and device for discovering log exception of network equipment | |
CN118014373B (en) | Risk identification model based on data quality monitoring and construction method thereof |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |