CN116225760A - Real-time root cause analysis method based on operation and maintenance knowledge graph - Google Patents

Real-time root cause analysis method based on operation and maintenance knowledge graph Download PDF

Info

Publication number
CN116225760A
CN116225760A CN202310069681.6A CN202310069681A CN116225760A CN 116225760 A CN116225760 A CN 116225760A CN 202310069681 A CN202310069681 A CN 202310069681A CN 116225760 A CN116225760 A CN 116225760A
Authority
CN
China
Prior art keywords
knowledge graph
data
root cause
alarm
constructing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202310069681.6A
Other languages
Chinese (zh)
Inventor
刘柏嵩
胡测
金建国
黄瑞强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo University
Original Assignee
Ningbo University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo University filed Critical Ningbo University
Priority to CN202310069681.6A priority Critical patent/CN116225760A/en
Publication of CN116225760A publication Critical patent/CN116225760A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y04INFORMATION OR COMMUNICATION TECHNOLOGIES HAVING AN IMPACT ON OTHER TECHNOLOGY AREAS
    • Y04SSYSTEMS INTEGRATING TECHNOLOGIES RELATED TO POWER NETWORK OPERATION, COMMUNICATION OR INFORMATION TECHNOLOGIES FOR IMPROVING THE ELECTRICAL POWER GENERATION, TRANSMISSION, DISTRIBUTION, MANAGEMENT OR USAGE, i.e. SMART GRIDS
    • Y04S10/00Systems supporting electrical power generation, transmission or distribution
    • Y04S10/50Systems or methods supporting the power network operation or management, involving a certain degree of interaction with the load-side end user applications

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Quality & Reliability (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Biomedical Technology (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • General Health & Medical Sciences (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

The invention discloses a real-time root cause analysis method based on an operation and maintenance knowledge graph, which comprises the steps of constructing an operation and maintenance knowledge graph body, extracting the contents such as entities, relations and the like of concepts by using natural language processing and machine learning technologies, and constructing a knowledge graph basic frame; constructing an equipment knowledge graph, and establishing an updatable and maintainable equipment knowledge graph based on knowledge and data by utilizing a topological structure of the digital equipment and a calling relation among system applications; constructing a fault knowledge graph, and providing a backup support for subsequent fault root cause analysis; real-time fault convergence and root cause analysis, and obtaining a causal relationship according to the corresponding category of the alarm information and the pointed sequence mode to obtain a root cause path; the invention provides a thought for real-time fault root cause positioning analysis of the digital basic operation and maintenance facility, and has the characteristics of strong expandability, high root cause positioning result availability and the like.

Description

Real-time root cause analysis method based on operation and maintenance knowledge graph
Technical Field
The invention relates to the technical field of fault analysis of data service support, in particular to a real-time root cause analysis method based on an operation and maintenance knowledge graph.
Background
With the rapid development of new digital technologies such as artificial intelligence, cloud computing, 5G mobile communication, etc., the scale of digital infrastructure is continually increasing, and the status of the digital infrastructure in national economy and social development is becoming more and more important. However, the digital infrastructure often brings problems of distributed data acquisition areas, poor remote operation and maintenance effects, high dependence on manual work in inspection, non-uniform monitoring management mode and the like, and brings higher risk and higher cost pressure to a main body unit. In the process of basic operation, management and maintenance of daily digital facility equipment, once infrastructure, network links and the like are damaged by faults, operation and maintenance personnel are required to consume a great deal of time and effort to check and repair one by one, and normal service and production operation activities are affected by low manual maintenance and reaction speed. Particularly, the operation and maintenance problems of digital infrastructures related to key fields of medical treatment, government affairs, finance and the like are more important, and how to diagnose and analyze facility faults with high efficiency faces a great challenge. Therefore, the construction of an intelligent fault diagnosis operation and maintenance system oriented to a new digital infrastructure becomes a research hotspot for enterprises just needed and at home and abroad.
At present, main research work in the field of fault diagnosis and root cause analysis is focused on the current running state of the system, and the related analysis method or system is single in specific application. When a system has a fault or abnormality, the detection capability and root cause analysis capability for a target object are insufficient. The traditional network fault diagnosis method based on data has the problems of poor interpretability, low application performance and the like.
Disclosure of Invention
Aiming at the defects and the shortcomings of the prior art, the invention provides a real-time root cause analysis method based on the operation and maintenance knowledge graph.
In order to achieve the above object, the present invention provides the following technical solutions.
A real-time root cause analysis method based on an operation and maintenance knowledge graph is characterized in that: the method comprises the following steps:
step S1: establishing a knowledge graph application framework in the field of digital infrastructure, constructing an operation and maintenance model structure body, and organically combining a plurality of operation and maintenance objects;
step S2: constructing an operation and maintenance knowledge graph body, completing extraction by using a natural language processing and machine learning technology, and constructing a knowledge graph basic frame;
step S3: knowledge extraction, which is to adopt a joint extraction algorithm for simultaneously carrying out entity extraction and relation extraction on the existing data, so as to solve the potential problems of transmission errors, information redundancy and neglecting the connection between subtasks;
step S4: knowledge fusion, namely eliminating knowledge redundancy through entity alignment and attribute alignment, establishing an association relationship, performing iterative learning on the entities and the relationships, and automatically capturing deep knowledge features;
step S5: constructing an equipment knowledge graph, and constructing an updatable and maintainable equipment knowledge graph based on knowledge and data;
step S6: constructing a fault knowledge graph, and providing a backup support for subsequent fault root cause analysis;
step S7: based on the real-time root cause analysis of the operation and maintenance knowledge graph, the method aims at the convergence of the alarm data and the real-time root cause positioning.
Compared with the prior art, the invention realizes the rapid fault location and root cause analysis of the real-time alarm information by constructing the operation and maintenance knowledge graph consisting of the equipment knowledge graph and the fault knowledge graph; the invention combines the knowledge graph technology to build the operation and maintenance knowledge graph oriented to fault diagnosis and root cause analysis, effectively improves the model reasoning capability, increases the fault positioning efficiency and improves the applicability in actual engineering.
Further, in the step S1, an application architecture of the knowledge graph in the digital infrastructure field includes a data layer, a core layer and an application layer; the data layer is responsible for gathering data from different data sources and performing deep analysis fusion, and the core layer constructs an ontology according to expert knowledge in the digital infrastructure diagnosis field and the requirements of knowledge graph application and determines entities and relationship types contained in the equipment knowledge graph and the fault knowledge graph.
The knowledge graph representation technology not only can realize representation of related entities, relations and attributes, but also can represent real-time fault time sequences, and provides guarantee for subsequent real-time root cause analysis.
Further, in the step S2, the method for constructing the operation and maintenance knowledge graph body includes:
step S21: inputting various types of sentences, and analyzing the input sentences to obtain effective syntax information data;
step S22: extracting entity, relation and attribute content of the syntactic information data through natural language processing and machine learning technology;
step S23: manually screening and manually supervising the processed ontology, relationship and attribute contents, and adding control into the quality of data in a knowledge base;
step S24: after content selection and auditing and checking, the information is stored in a designated database.
The method greatly reduces labor cost and remarkably improves the overall construction efficiency when the knowledge graph basic framework data is constructed.
Further, in step S3, the entity extraction means specifically that the entity extraction module adopts a self-attention mechanism to represent the intra-sentence entity connection, obtains the encoded feature vector, extracts the entity by using the full connection layer and the convolutional neural network, and forms the candidate entity set from the extracted entity.
Further, in the step S3, the relation extraction refers to that after the feature is extracted by the entity module, the extracted entity feature vector is used as input, the relation between the entities is predicted through the self-attention layer and the full-connection layer, the model training adopts a random sampling method to obtain training data, and the Adam algorithm is used to optimize the model parameters.
The text sentence pre-training model in the digital basic fault field is used, and the entity and the relation between the entity and the entity are extracted, so that the target triples can be directly obtained, and the problem of the assembly line method is solved.
Further, in the step S5, the data of the device knowledge graph includes configuration management database data, call chain data and physical device network connection data; the construction of the equipment knowledge graph starts with data, and specifically comprises the following steps:
step S51: constructing a relation map according to the configuration management database data, extracting key variables of the log, and then carrying out remote labeling and manual screening to semi-automatically generate assignment management data so as to obtain a software knowledge map;
step S52: constructing a knowledge graph according to call chains or physical equipment network connection data, clustering log information based on a theme by adopting a method similar to the step S51, then recognizing variables by using a Smith-Waterman algorithm on a text, extracting high-confidence variables, and generating a hardware knowledge graph;
step S53: and (3) merging the software knowledge graph obtained in the step (S51) and the hardware knowledge graph generated in the step (S52) through network x, and storing the merged knowledge graph into a graph database to obtain a final equipment knowledge graph.
Further, in the step S6, the specific steps of constructing the fault knowledge graph are as follows:
step S61: firstly, word segmentation is carried out on alarm information based on a convolutional neural network method, then word vectors are calculated, the word vectors are used as an input training model, and alarm data are classified;
step S62: taking all alarm classifications as causal nodes, taking each virtual machine alarm record as a center, giving an alarm time slice, and searching a relevant alarm record set in each virtual machine alarm time slice as a causal discovery sample;
step S63: and calculating the causal edge weight, wherein the causal edge weight is based on the ratio of the number of alarms generated by the fruit node to the total number of alarms generated by the nodes under the condition that the alarms are generated by the nodes.
Figure SMS_1
wherein ,mi The number of alarms generated by the fruit node under the condition that the causal edge i alarms by the node is M is the total number of alarms generated by the node, W i The weight of the cause node i.
Further, in the step S61, the input training model of the alarm data includes an input layer, a convolution layer, a pooling layer and a full connection layer;
each input data vector of the input layer can be trained in advance or can be obtained by training a current neural network model; the convolution layers are core parts of the alarm data classification model, the sizes of the three convolution layers are sequentially set to be 2, 3 and 4, gradually rich characteristic information is extracted, and the mathematical expression is as follows:
Figure SMS_2
wherein wi(i,j) Representing the weight of an i-th node convolution kernel input node in the output matrix; denoted by y i,j The value of the node in the convolution kernel; h is a i As a result of the final convolutional layer.
The pooling layer enables the model to fully pay attention to certain needed characteristics, reduces the size of characteristic vectors and parameters, and realizes dimension reduction; all output vectors in the full connection layer are input into a Softmax classifier, and the classification task of a final result is completed; 0 indicates that the signal is not disturbed and 1 indicates that the signal is disturbed.
The alarm classification method based on the convolutional neural network has high accuracy and effective alarm data classification.
Further, in the step S7, the specific steps of convergence and real-time root cause positioning of the alarm data are as follows:
step S71: setting the granularity of the time slice, and acquiring alarm data in the time slice in real time;
step S72: aiming at the original alarm data, specific alarm information and monitoring items are combined, and the original alarm data is classified from HOST, VM, SOFTWARE according to a trained classification model;
step S73: inquiring the knowledge graph of the software and hardware to converge the alarm by taking the system as a unit;
step S74: inquiring connection subgraphs among all nodes under each system in a graph database according to system level based on an alarm convergence result to obtain an alarm causal graph among all nodes under a certain system;
step S75: the suspected paths are calculated based on the alarm causal graph generated in step S74 and the weights, and the ranking gives the root cause paths.
Drawings
FIG. 1 is a diagram of an operational knowledge graph architecture based on the digital infrastructure fault domain;
FIG. 2 is a flow chart of architecture setup;
FIG. 3 is a flowchart of the construction of an operation and maintenance knowledge graph ontology;
fig. 4 is a knowledge extraction flow chart.
Detailed Description
In order to enable those skilled in the art to better understand the technical solution of the present invention, the technical solution of the present invention is further described below with reference to the accompanying drawings and the present embodiment.
As shown in fig. 1, fig. 1 is an operation and maintenance knowledge graph structure diagram based on the digital infrastructure fault field, and fig. 2 is a building flow chart of the structure, specifically comprising the following steps:
step S1: the invention relates to a method for establishing a digital infrastructure field knowledge graph application framework.
The data layer is responsible for gathering data from different data sources and performing deep analysis fusion, the core layer data layer constructs a body according to expert knowledge in the digital infrastructure diagnosis field and the requirements of knowledge graph application, and determines the entity and relationship types contained in the equipment knowledge graph and the fault knowledge graph, so that the data layer is a key component part of three hierarchical structures;
step S2: and constructing an operation and maintenance knowledge graph body, completing extraction by using a natural language processing and machine learning technology, and constructing a knowledge graph basic frame.
The invention adopts a semi-automatic construction mode, and the construction method of the operation and maintenance knowledge graph body is shown in figure 3:
step S21: inputting various types of sentences, and analyzing the input sentences to obtain effective syntax information data;
step S22: extracting entity, relation and attribute content of the syntactic information data through natural language processing and machine learning technology;
step S23: manually screening and manually supervising the processed ontology, relationship and attribute contents, and adding control to the quality of data in a knowledge base;
step S24: after content selection and auditing and checking, the information is stored in a designated database.
The method greatly reduces labor cost and remarkably improves the overall construction efficiency when the knowledge graph basic framework data is constructed.
Step S3: knowledge extraction, after building a body, adopting a joint extraction algorithm for simultaneously carrying out entity extraction and relation extraction on the existing data, and solving the potential problems of transmission errors, information redundancy and neglecting the connection among subtasks;
as shown in fig. 1, the data sources of knowledge extraction are wide, and mainly comprise structured data, semi-structured data and unstructured data, and the distributed data packet content is small; for the extraction of the structured data, the data mapping direct extraction can be carried out according to the metadata and the constructed ontology; for the semi-structured data or unstructured data, a corresponding extraction algorithm is designed for extraction according to the corpus content and the structure of the semi-structured data or unstructured data; the invention adopts the joint extraction algorithm for simultaneously carrying out entity extraction and relation extraction on the existing data after the ontology is constructed, can directly obtain the target triples, and solves the problems of the assembly line method.
The entity extraction module adopts a self-attention mechanism to represent the entity connection in the sentence, obtains the coded feature vector, extracts the entity by using the full connection layer and the convolutional neural network, and forms the extracted entity into a candidate entity set.
The model adopts a pointer network of QANet to extract the entity, and for the output sequence h= (h) of the attention layer 1 ,h 2 ,…,h n ) Inputting the target sequence into a double full-connection layer, and obtaining a final predicted sequence P of the head and the tail of the entity through a sigmoid activation function s =(s 1 ,s 2 ,…,s n) and Pe =(e 1 ,e 2 ,…,e n )。
The module takes the cross entropy of the two classes as a loss function:
Loss=L(P s )+L(P e )
wherein ,L(Ps ) For the two classes of cross entropy of entity header and real result, L (P e ) Two classes of cross entropy are entity tails and real results, the sum of which is the total loss.
After extracting the features by the entity module, taking the extracted entity feature vector as input, predicting the relation between the entities through the self-attention layer and the full-connection layer, wherein the prediction logic is as follows:
P(s,r,o)=P(s)P(s|r)P(o|s,r)
wherein r represents the corresponding relation, s represents the entity extracted by the previous module, and o represents the entity of s under the corresponding relation p. Then, the authenticity of the relation existence is calculated by using the activation function:
a=sigmoid(w T H)
and sequentially detecting whether each relation r exists in the input text, and judging whether the corresponding relation exists between the corresponding entities according to calculation. The model training adopts a random sampling method to obtain training data, and an Adam algorithm is used for optimizing model parameters.
Step S4: and eliminating knowledge redundancy through entity alignment and attribute alignment, and carrying out knowledge fusion.
The invention provides a knowledge fusion framework oriented to the field of digital infrastructures, which can iteratively learn entities and relations, automatically capture deep knowledge features and improve knowledge fusion accuracy; as shown in fig. 4, the steps of knowledge fusion are as follows:
step S41: screening the knowledge database into a set of entities to be aligned and a small number of alignment entity pairs;
step S42: the knowledge data after screening is combined with the atlas to learn, one part forms a training set under the image representation, and the other part forms an alignment set under the image representation;
step S43: the training set under the graph representation can be divided into a training set at the attribute level and a training set at the relationship level; continuously aligning the training set of the attribute layer to form an attribute layer model, and then continuously learning to form a set to be aligned under the graph representation;
step S44: the to-be-aligned set under the graph representation automatically captures deep knowledge features to form a new attribute-level training set and a new relationship-level training set;
and step S45, continuing learning after the training set of the relation layer is formed, and forming a new training set of the relation layer.
Step S5: constructing an equipment knowledge graph, and constructing an updatable and maintainable equipment knowledge graph based on knowledge and data;
the invention constructs a device knowledge graph by starting with data, and comprises the following specific steps:
step S51: constructing a relation map according to configuration management database data, firstly accessing original equipment and collecting network equipment syslog data; then clustering the log information based on the topic type, wherein the clustering model joint probability based on the topic type is specifically expressed as follows:
Figure SMS_3
wherein d represents a certain log file, w represents a certain keyword, K represents the total number of topics, and z k Representing the kth topic.
Clustering the log information of the configuration management database is completed through a clustering algorithm based on the topic type, then a Smith-Waterman algorithm is adopted to identify variables, and confidence variables with high similarity in a plurality of sequences are found out; assume that the sequences to be aligned are x=x 1 x 2 ,Y=y 1 y 2 …y m Wherein n and m represent the length of the sequences X, Y, respectively.
The scoring matrix H is recreated and its top row and column initialized. The matrix size is n+1 rows and m+1 columns. The matrix is scored from left to right, top to bottom, filling the remainder of matrix H.
H k0 =H 0l =0(0≤k≤n,0≤l≤m)
Figure SMS_4
wherein ,Hi-1,j-1 +s(x i ,y i ) Representing x i and yi Similarity score for alignment, H i-k,j -W k Represents x i Score at the end of a length k deletion, H i,j-l -W l Representing y j A score at the end of a deletion of length l, 0 representing x i and yi There is no similarity. Step S51 is repeated until an element with a score of 0 is encountered.
After the extraction of the key variables of the log is completed, remote labeling and manual screening are carried out on the key variables of the log, the attached management data is generated in a semi-automatic mode, and a software knowledge graph is generated.
Step S52: constructing a knowledge graph according to call chains or physical equipment network connection data, wherein the call chain data is mainly used for acquiring a distribution unit, a call relationship among systems, a mapping relationship between the distribution unit and an IP address and a logic relationship of middleware; the physical equipment mainly comprises a physical machine, a switch and a router; the section adopts a method similar to the step S51 to extract high confidence variables and generate a hardware knowledge graph.
Step S53: and merging the obtained configuration management database, the call chain and the physical equipment knowledge graph through network x, storing the merged configuration management database, call chain and physical equipment knowledge graph into a graph database Neo4j, and finally obtaining the equipment knowledge graph, wherein the configuration management database, the call chain and the physical equipment knowledge graph are mainly divided into single system and inter-system knowledge graph.
Step S6: the specific steps of constructing the fault knowledge graph are as follows:
step S61: firstly, word segmentation is carried out on alarm information based on a convolutional neural network method, then word vectors are calculated, the word vectors are used as an input training model, and alarm data are classified; firstly, word segmentation is carried out on alarm information by adopting a convolutional neural network-based method, then word vectors are calculated, and the word vectors are used as an input training model; the model comprises an input layer, a convolution layer, a pooling layer and a full connection layer;
each input data vector in the input layer can be trained in advance, and can also be obtained by training the current neural network model;
the convolution layers are the core part of the alarm data classification model, the sizes of the three convolution layers are sequentially set to be 2, 3 and 4, the gradually rich characteristic information is extracted, and the mathematical expression is as follows:
Figure SMS_5
wherein wi(i,j) Representing the weight of an i-th node convolution kernel input node in the output matrix; denoted by y i,j The value of the node in the convolution kernel; h is a i As a result of the final convolutional layer.
The pooling layer has the main functions of enabling the model to fully pay attention to certain needed characteristics, reducing the size of characteristic vectors and parameters and achieving the purpose of dimension reduction.
The output vectors of the full-connection layer are all input into the Softmax classifier, so that the classification task of the final result is completed; 0 indicates that the signal is not disturbed and 1 indicates that the signal is disturbed.
The experimental result shows that the alarm classification method based on the convolutional neural network has high accuracy and effective alarm data classification.
Step S62: taking all alarm classifications as causal nodes, taking each virtual machine alarm record as a center, giving an alarm time slice, and searching a relevant alarm record set in each virtual machine alarm time slice as a causal discovery sample;
the invention adopts a causal algorithm based on scores, and the algorithm carries out condition test on variables and variable sets so as to obtain possible causal edges; the first step of the score-based causal algorithm is to perform a traversing edge-pruning operation from a completely connected undirected graph and finally obtain a graph skeleton G, and the algorithm principle depends on the feature that the conditional independence among variables under the faithful assumption is equivalent to the directed separation of corresponding vertexes in the graph model. All condition independence possibility between any two variables is calculated, and the directional separability property is judged according to the condition independence possibility, so that the purpose of deleting the false edge is achieved, and the steps are as follows:
and connecting all variables in the variable set to construct a complete undirected graph.
Step S621: starting from the undirected graph, arbitrarily selecting a pair of adjacent variables X and Y, and deleting the edge X-Y from the directed acyclic graph if X is inverted and Y is Q separated, namely X and Y are unconditionally independent; wherein Q is a node set with an adjacency relationship;
step S622: under the condition of a given node Q, whether a directional edge is connected with each node or not, and all other nodes are in a condition independent state on variables, namely X, T, Y and Q; thus, step S621 may be repeated to remove all edges not connected to node Q;
step S623: in the undirected graph with all edges which are not in the Q node removed, for any variable node X, Y and Z, the connecting path is X-Z-Y, and the edge orientation is carried out on the connecting path; if X and Y are not adjacent nodes in the loop-free graph and Z is not in the condition set of separation of (X, Y), the direction of the X-Z-Y path edge is X-Z+.Y;
step S624: x, Z is an adjacent variable, X, Y is not an adjacent variable; after step S623, if it is determined that the path direction of X-Z is x→z, then the directions of all the remaining sides can be determined according to directional propagation; the path direction of Z-Y is Z-Y;
step S63: and (5) calculating the weight of the causal edge, wherein the weight of the causal edge is calculated by adopting conditional probability. Namely: based on the causal discovery sample data and causal edges (comprising two causal nodes) given by the causal discovery algorithm, the ratio of the number of alarms generated by the causal nodes under the condition that the nodes generate alarms to the total number of alarms generated by the causal nodes is used as the weight of the causal edges.
Figure SMS_6
wherein ,mi The number of alarms generated by the fruit node under the condition that the causal edge i alarms by the node is M is the total number of alarms generated by the node, W i The weight of the cause node i.
Step S7: real-time root cause analysis based on operation and maintenance knowledge maps; the method aims at the convergence and real-time root cause positioning of alarm data, and comprises the following specific steps:
step S71: setting the granularity of the time slice, and acquiring alarm data in the time slice in real time;
step S72: the alarm classification, aiming at the original alarm data, combining specific alarm information, monitoring items and other information, classifying the original alarm data according to a trained classification model from HOST, VM, SOFTWARE aspects;
step S73: the alarm is converged, the software and hardware knowledge graph is inquired to converge the alarm by taking the system as a unit;
step S74: constructing an alarm causal graph, inquiring connection subgraphs among all nodes under each system in a graph database according to a system level based on an alarm convergence result, and inputting the obtained result into a network to obtain a final connection relationship among all nodes under a certain system, namely the alarm causal graph;
step S75: and (3) describing root cause paths, calculating suspected paths based on the alarm causal graph generated in the step S74 and the weights, and sequencing to give the root cause paths.
After the steps S1 to S7 are completed, determining that the root cause of the application system is abnormal according to the located corresponding instance or service abnormality, then calling a fault solution knowledge base, and outputting a solution corresponding to the root cause.
The above is only a preferred embodiment of the present invention, the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention; it should be noted that modifications and adaptations to the present invention may occur to one skilled in the art without departing from the principles of the present invention and are intended to be within the scope of the present invention.

Claims (10)

1. A real-time root cause analysis method based on an operation and maintenance knowledge graph is characterized in that: the method comprises the following steps:
step S1: establishing a knowledge graph application framework in the field of digital infrastructure, constructing an operation and maintenance model structure body, and organically combining a plurality of operation and maintenance objects;
step S2: constructing an operation and maintenance knowledge graph body, completing extraction by using a natural language processing and machine learning technology, and constructing a knowledge graph basic frame;
step S3: knowledge extraction, which is to adopt a joint extraction algorithm for simultaneously carrying out entity extraction and relation extraction on the existing data, so as to solve the potential problems of transmission errors, information redundancy and neglecting the connection between subtasks;
step S4: knowledge fusion, namely eliminating knowledge redundancy through entity alignment and attribute alignment, establishing an association relationship, performing iterative learning on the entities and the relationships, and automatically capturing deep knowledge features;
step S5: constructing an equipment knowledge graph, and constructing an updatable and maintainable equipment knowledge graph based on knowledge and data, so as to strengthen the relation processing capacity and root cause positioning capacity;
step S6: constructing a fault knowledge graph, and providing a backup support for subsequent fault root cause analysis;
step S7: based on the real-time root cause analysis of the operation and maintenance knowledge graph, the method aims at the convergence of the alarm data and the real-time root cause positioning.
2. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in the step S1, an application architecture of the knowledge graph in the digital infrastructure field includes a data layer, a core layer and an application layer; the data layer is responsible for gathering data from different data sources and performing deep analysis fusion, and the core layer constructs an ontology according to expert knowledge in the digital infrastructure diagnosis field and the requirements of knowledge graph application and determines entities and relationship types contained in the equipment knowledge graph and the fault knowledge graph.
3. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in the step S2, the method for constructing the operation and maintenance knowledge graph body includes:
step S21: inputting various types of sentences, and analyzing the input sentences to obtain effective syntax information data;
step S22: extracting entity, relation and attribute content of the syntactic information data through natural language processing and machine learning technology;
step S23: manually screening and manually supervising the processed ontology, relationship and attribute contents, and adding control to the quality of data in a knowledge base;
step S24: after content selection and auditing and checking, the information is stored in a designated database.
4. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in step S3, the entity extraction means specifically that the entity extraction module adopts a self-attention mechanism to represent the entity connection in the sentence, obtains the encoded feature vector, extracts the entity by using the full connection layer and the convolutional neural network, and forms the candidate entity set from the extracted entity.
5. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in step S3, the relation extraction refers to that after the feature is extracted by the entity module, the extracted entity feature vector is used as input, the relation between the entities is predicted through the self-attention layer and the full-connection layer, the model training adopts a random sampling method to obtain training data, and Adam algorithm is used to optimize model parameters.
6. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in the step S5, the data of the device knowledge graph includes configuration management database data, call chain data and physical device network connection data; the construction of the equipment knowledge graph starts with data, and specifically comprises the following steps:
step S51: constructing a relation map according to the configuration management database data, extracting key variables of the log, and then carrying out remote labeling and manual screening to semi-automatically generate assignment management data so as to obtain a software knowledge map;
step S52: constructing a knowledge graph according to call chains or physical equipment network connection data, clustering log information based on a theme by adopting a method similar to the step S51, then recognizing variables by using a Smith-Waterman algorithm on a text, extracting high-confidence variables, and generating a hardware knowledge graph;
step S53: and (3) merging the software knowledge graph obtained in the step (S51) and the hardware knowledge graph generated in the step (S52) through network x, and storing the merged knowledge graph into a graph database to obtain a final equipment knowledge graph.
7. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in the step S6, the specific steps of constructing the fault knowledge graph are as follows:
step S61: firstly, word segmentation is carried out on alarm information based on a convolutional neural network method, then word vectors are calculated, the word vectors are used as an input training model, and alarm data are classified;
step S62: taking all alarm classifications as causal nodes, taking each virtual machine alarm record as a center, giving an alarm time slice, and searching a relevant alarm record set in each virtual machine alarm time slice as a causal discovery sample;
step S63: calculating causal edge weight, namely taking the ratio of the number of alarms generated by the fruit node under the condition that the alarms are generated by the nodes and the total number of alarms generated by the nodes as the causal edge weight;
Figure QLYQS_1
wherein ,
Figure QLYQS_2
is a causal edgeiThe number of times of alarm of the fruit node under the condition of alarm of the node, < ->
Figure QLYQS_3
For the total number of alarms that occur due to the node, +.>
Figure QLYQS_4
For the reason node edgeiIs a weight of (2).
8. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 7, wherein the method comprises the following steps: in the step S61, the input training model of the alarm data includes an input layer, a convolution layer, a pooling layer and a full connection layer;
each input data vector of the input layer can be trained in advance or can be obtained by training a current neural network model; the convolution layers are core parts of the alarm data classification model, the sizes of the three convolution layers are sequentially set to be 2, 3 and 4, gradually rich characteristic information is extracted, and the mathematical expression is as follows:
Figure QLYQS_5
wherein
Figure QLYQS_6
Representing the weight of an i-th node convolution kernel input node in the output matrix; indicated is +.>
Figure QLYQS_7
The value of the node in the convolution kernel; />
Figure QLYQS_8
As a result of the final convolutional layer.
9. The pooling layer enables the model to fully pay attention to certain needed characteristics, reduces the size of characteristic vectors and parameters, and realizes dimension reduction; all output vectors in the full connection layer are input into a Softmax classifier, and the classification task of a final result is completed; 0 indicates that the signal is not disturbed and 1 indicates that the signal is disturbed.
10. The method for analyzing the root cause in real time based on the operation and maintenance knowledge graph according to claim 1, wherein the method comprises the following steps: in the step S7, the specific steps of the convergence and real-time root cause positioning of the alarm data are as follows:
step S71: setting the granularity of the time slice, and acquiring alarm data in the time slice in real time;
step S72: aiming at the original alarm data, specific alarm information and monitoring items are combined, and the original alarm data is classified from HOST, VM, SOFTWARE according to a trained classification model;
step S73: inquiring the knowledge graph of the software and hardware to converge the alarm by taking the system as a unit;
step S74: inquiring connection subgraphs among all nodes under each system in a graph database according to system level based on an alarm convergence result to obtain an alarm causal graph among all nodes under a certain system;
step S75: the suspected paths are calculated based on the alarm causal graph generated in step S74 and the weights, and the ranking gives the root cause paths.
CN202310069681.6A 2023-02-07 2023-02-07 Real-time root cause analysis method based on operation and maintenance knowledge graph Pending CN116225760A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310069681.6A CN116225760A (en) 2023-02-07 2023-02-07 Real-time root cause analysis method based on operation and maintenance knowledge graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310069681.6A CN116225760A (en) 2023-02-07 2023-02-07 Real-time root cause analysis method based on operation and maintenance knowledge graph

Publications (1)

Publication Number Publication Date
CN116225760A true CN116225760A (en) 2023-06-06

Family

ID=86572463

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310069681.6A Pending CN116225760A (en) 2023-02-07 2023-02-07 Real-time root cause analysis method based on operation and maintenance knowledge graph

Country Status (1)

Country Link
CN (1) CN116225760A (en)

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611593A (en) * 2023-07-21 2023-08-18 蘑菇物联技术(深圳)有限公司 Method, device and medium for predicting failure of air compressor
CN116719665A (en) * 2023-08-11 2023-09-08 国家气象信息中心(中国气象局气象数据中心) Intelligent judging and identifying method for abnormal state of meteorological numerical mode
CN117557244A (en) * 2023-09-27 2024-02-13 国网江苏省电力有限公司信息通信分公司 Electric power operation and maintenance warning system based on knowledge graph

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116611593A (en) * 2023-07-21 2023-08-18 蘑菇物联技术(深圳)有限公司 Method, device and medium for predicting failure of air compressor
CN116719665A (en) * 2023-08-11 2023-09-08 国家气象信息中心(中国气象局气象数据中心) Intelligent judging and identifying method for abnormal state of meteorological numerical mode
CN116719665B (en) * 2023-08-11 2023-11-28 国家气象信息中心(中国气象局气象数据中心) Intelligent judging and identifying method for abnormal state of meteorological numerical mode
CN117557244A (en) * 2023-09-27 2024-02-13 国网江苏省电力有限公司信息通信分公司 Electric power operation and maintenance warning system based on knowledge graph

Similar Documents

Publication Publication Date Title
CN113723632B (en) Industrial equipment fault diagnosis method based on knowledge graph
CN112100369B (en) Semantic-combined network fault association rule generation method and network fault detection method
CN116225760A (en) Real-time root cause analysis method based on operation and maintenance knowledge graph
CN112633010B (en) Aspect-level emotion analysis method and system based on multi-head attention and graph convolution network
CN112217674B (en) Alarm root cause identification method based on causal network mining and graph attention network
CN109902301B (en) Deep neural network-based relationship reasoning method, device and equipment
CN111881290A (en) Distribution network multi-source grid entity fusion method based on weighted semantic similarity
Lin et al. Deep structured scene parsing by learning with image descriptions
CN111737432A (en) Automatic dialogue method and system based on joint training model
CN114330966A (en) Risk prediction method, device, equipment and readable storage medium
CN114296975A (en) Distributed system call chain and log fusion anomaly detection method
CN116126569A (en) Intelligent operation and maintenance method and device
CN115659966A (en) Rumor detection method and system based on dynamic heteromorphic graph and multi-level attention
CN115328782A (en) Semi-supervised software defect prediction method based on graph representation learning and knowledge distillation
CN114996936A (en) Equipment operation and maintenance method, equipment operation and maintenance device, equipment operation and maintenance equipment and storage medium
CN116611813B (en) Intelligent operation and maintenance management method and system based on knowledge graph
CN117221087A (en) Alarm root cause positioning method, device and medium
CN116467459A (en) Internet of things equipment fault reporting method and device, computer equipment and storage medium
CN113239143B (en) Power transmission and transformation equipment fault processing method and system fusing power grid fault case base
Zhang et al. LogPrompt: A Log-based Anomaly Detection Framework Using Prompts
CN115278752A (en) AI (Artificial intelligence) detection method for abnormal logs of 5G (third generation) communication system
CN114707508A (en) Event detection method based on multi-hop neighbor information fusion of graph structure
Zhou et al. What happens next? Combining enhanced multilevel script learning and dual fusion strategies for script event prediction
Zhong et al. Aspect-level sentiment analysis incorporating multidimensional feature
Chaima et al. Extracting and Exploiting the Behavior Business Process Graph through Transition-Centric Event-Log data

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination