CN113254254A

CN113254254A - Root cause positioning method and device of system fault, storage medium and electronic device

Info

Publication number: CN113254254A
Application number: CN202110792399.1A
Authority: CN
Inventors: 弄庆鹏; 李忠良; 屠要峰; 周祥生; 高洪
Original assignee: Nanjing ZTE New Software Co Ltd
Current assignee: Nanjing ZTE New Software Co Ltd
Priority date: 2021-07-14
Filing date: 2021-07-14
Publication date: 2021-08-13
Anticipated expiration: 2041-07-14
Also published as: CN113254254B

Abstract

The embodiment of the application provides a root cause positioning method and device of system faults, a storage medium and an electronic device, wherein the method comprises the following steps: constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample; training the constructed network system fault root cause positioning model by using the fault root cause node positioning sample and the fault root cause alarm positioning sample to obtain a trained network system fault root cause positioning model; according to the trained network system fault root cause location model, the root cause location of the current fault is predicted, the problems that in the related technology, operation and maintenance personnel screen and locate from a large amount of alarm information, the alarm of the fault is time-consuming and labor-consuming, the network system cannot be quickly recovered after service interruption, and the operation and maintenance pressure of the system is higher when the system is more complex can be solved, the time for system fault location and system recovery is greatly shortened, the operation and maintenance efficiency of the system is improved, the consumption of operation and maintenance resources is reduced, and the maintenance difficulty of the model is reduced.

Description

Root cause positioning method and device of system fault, storage medium and electronic device

Technical Field

The embodiment of the application relates to the field of communication, in particular to a root cause positioning method and device for system faults, a storage medium and an electronic device.

Background

In a complex network system, there are usually service interaction calls among sites, systems, servers and application components, and many thousands of module nodes are often present in the system, and a large amount of logs are usually generated in the system operation process. When a service node in the system fails, the failure is propagated along a call link between the system nodes, so that a large amount of alarm log information, commonly called an alarm storm, is generated. Therefore, the alarm information is submerged in the massive alarm information, and the alarm of the fault caused by screening and positioning from the massive alarm information is very time-consuming and labor-consuming for operation and maintenance personnel, so that the network system cannot be quickly recovered after service interruption, and the more complex the system, the higher the operation and maintenance pressure of the system is.

Aiming at the problems that in the related technology, operation and maintenance personnel screen and position a large amount of alarm information to cause that the alarm of a fault is time-consuming and labor-consuming, so that the service of a network system cannot be quickly recovered after being interrupted, and the more complex the system, the higher the operation and maintenance pressure of the system is, a solution is not provided.

Disclosure of Invention

The embodiment of the application provides a root cause positioning method and device of a system fault, a storage medium and an electronic device, so as to solve the problems that in the related technology, operation and maintenance personnel filter and position from a large amount of alarm information, so that the alarm of the fault is time-consuming and labor-consuming, the network system cannot be quickly recovered after service interruption, and the operation and maintenance pressure of the system is higher when the system is more complex.

According to an embodiment of the present application, a method for root cause location of a system fault is provided, including:

constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample;

training the constructed network system fault root cause positioning model by using the fault root cause node positioning sample and the fault root cause alarm positioning sample to obtain a trained network system fault root cause positioning model;

and predicting the root cause positioning of the current fault according to the trained network system fault root cause positioning model.

In an exemplary embodiment, predicting the root cause location of the current fault according to the trained network system fault root cause location model includes:

constructing a fault root node graph sample for the fault data of the current fault network system;

inputting the constructed fault root node graph sample into the trained network system fault root node positioning model to obtain a fault root node prediction result of the current fault output by the trained network system fault root node positioning model;

and determining a fault root cause alarm prediction result of the current fault according to the fault root cause node prediction result.

In an exemplary embodiment, determining the fault root cause alarm prediction result of the current fault according to the fault root cause node prediction result includes:

if the fault root cause node prediction result is that the root cause node only contains one alarm log, determining the current alarm log as the fault root cause alarm prediction result;

if the fault root cause node prediction result is that a root cause node comprises a plurality of alarm logs, splitting the plurality of alarm logs in the root cause node into single alarm logs, respectively constructing a fault root cause alarm pattern book for each single alarm log, and inputting the constructed fault root cause alarm pattern sample into the system fault root cause positioning model to obtain the fault root cause alarm prediction result output by the system fault root cause positioning model.

In an exemplary embodiment, after predicting the root cause location of the current fault according to the trained network system fault root cause location model, the method further includes:

and sending the fault root cause node prediction result and the fault root cause alarm prediction result to a network system, wherein the network system is used for displaying the fault root cause node prediction result and the fault root cause alarm prediction result through a system fault interactive interface.

In an exemplary embodiment, constructing the fault root cause node location sample and the fault root cause alarm location sample comprises:

vectorizing the collected alarm log and Key Performance Indicator (KPI) to obtain an alarm log state vector matrix and a KPI state vector matrix;

fusing the alarm log state vector matrix and the KPI state vector matrix to obtain a system fault state hybrid representation vector matrix, and using the system fault state hybrid representation vector matrix as a fault state representation of a network system node;

constructing a fault map sample according to the fault state characterization and the collected topological data;

and constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample according to the fault map sample.

In an exemplary embodiment, constructing the fault root cause node location sample and the fault root cause alarm location sample according to the fault map sample comprises:

carrying out noise node cleaning and sample graph convergence on the fault graph sample to obtain a cleaned and converged pattern book;

constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample based on the pattern book after cleaning convergence;

and storing the fault root cause node positioning sample and the fault root cause alarm positioning sample into a sample pool.

In an exemplary embodiment, constructing the fault root cause node location sample and the fault root cause alarm location sample based on the pattern book after cleaning convergence includes:

fusing a plurality of alarm log vectors of each node in the graph sample after N-order cleaning convergence; assigning the system fault state mixed characterization vector of each node to a corresponding node characteristic value in an N-order fault pattern book, and marking a root cause node label to form a fault root cause node positioning sample;

if the root cause node comprises a plurality of alarm logs, splitting the plurality of alarm logs in the root cause node into a single alarm log; and assigning the system fault state mixed characterization vector of each node to a corresponding node characteristic value in the N-order fault pattern book, and marking a root cause alarm label on a root cause alarm to form the fault root cause alarm positioning sample.

In an exemplary embodiment, the performing noise node cleaning and sample graph convergence on the fault graph sample to obtain a cleaned and converged pattern book includes:

performing attribute labeling on whether the alarm logs exist in the nodes in the fault map sample according to the system fault state mixed characterization vector;

cleaning nodes without alarm logs in the fault map sample;

and carrying out sample convergence on the cleaned fault image sample to obtain the cleaned converged image book.

In an exemplary embodiment, before vectorizing the collected alarm logs and KPIs to obtain an alarm log state vector matrix and a KPI state vector matrix, the method further includes:

and after receiving the topology data, the alarm log and the KPI data collected by the network system, carrying out root cause node and root cause alarm labeling to obtain a training sample.

According to another embodiment of the present application, there is also provided a root cause locating device for a system fault, including:

the construction module is used for constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample;

the training module is used for training the constructed network system fault root cause positioning model by utilizing the fault root cause node positioning sample and the fault root cause alarm positioning sample to obtain a trained network system fault root cause positioning model;

and the prediction module is used for predicting the root cause positioning of the current fault according to the trained network system fault root cause positioning model.

In an exemplary embodiment, the prediction module comprises:

the first construction submodule is used for constructing a fault root node graph sample of the fault data of the current fault network system;

the input submodule is used for inputting the constructed fault root cause node diagram sample into the trained network system fault root cause positioning model to obtain a fault root cause node prediction result of the current fault output by the trained network system fault root cause positioning model;

and the determining submodule is used for determining a fault root cause alarm prediction result of the current fault according to the fault root cause node prediction result.

In an exemplary embodiment, the determining sub-module includes:

the determining unit is used for determining the current alarm log as the fault root cause alarm prediction result if the fault root cause node prediction result is that the root cause node only comprises one alarm log;

and the input unit is used for splitting the plurality of alarm logs in the root cause node into single alarm logs if the fault root cause node prediction result is that the root cause node comprises a plurality of alarm logs, respectively carrying out fault root cause alarm pattern local construction on each single alarm log, and inputting the constructed fault root cause alarm pattern sample into the system fault root cause positioning model to obtain the fault root cause alarm prediction result output by the system fault root cause positioning model.

In an exemplary embodiment, the apparatus further comprises:

and the sending module is used for sending the fault root cause node prediction result and the fault root cause alarm prediction result to a network system, wherein the network system is used for displaying the fault root cause node prediction result and the fault root cause alarm prediction result through a system fault interactive interface.

In an exemplary embodiment, the building module includes:

vectorizing the acquired alarm log and KPI to obtain an alarm log state vector matrix and a KPI state vector matrix;

the fusion submodule is used for fusing the alarm log state vector matrix and the KPI state vector matrix to obtain a system fault state hybrid representation vector matrix, and taking the system fault state hybrid representation vector matrix as a fault state representation of a network system node;

the second construction submodule is used for constructing a fault map sample according to the fault state characterization and the collected topological data;

and the third constructing submodule is used for constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample according to the fault map sample.

In an exemplary embodiment, the third building submodule includes:

the cleaning convergence unit is used for cleaning noise nodes and converging the sample graph on the fault graph sample to obtain a pattern book after cleaning convergence;

the construction unit is used for constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample based on the pattern book after cleaning convergence;

and the storage unit is used for storing the fault root cause node positioning sample and the fault root cause alarm positioning sample into a sample pool.

In an exemplary embodiment, the building unit is further configured to

In an exemplary embodiment, the cleaning convergence unit is further configured to

cleaning nodes without alarm logs in the fault map sample;

In an exemplary embodiment, the apparatus further comprises:

and the receiving module is used for receiving a training sample obtained by root cause node and root cause alarm labeling after the network system collects topology data, alarm logs and KPI data.

According to a further embodiment of the application, there is also provided a computer-readable storage medium, in which a computer program is stored, wherein the computer program is arranged to perform the steps of any of the above-mentioned method embodiments when executed.

According to yet another embodiment of the present application, there is also provided an electronic device, comprising a memory in which a computer program is stored and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.

According to the embodiment of the application, a fault root cause node positioning sample and a fault root cause alarm positioning sample are constructed; training the constructed network system fault root cause positioning model by using the fault root cause node positioning sample and the fault root cause alarm positioning sample to obtain a trained network system fault root cause positioning model; according to the trained network system fault root cause positioning model for predicting the root cause positioning of the current fault, the problems that in the related technology, operation and maintenance personnel screen and position from a large amount of alarm information to cause that the alarm of the fault is time-consuming and labor-consuming, so that the network system cannot be quickly recovered after service interruption are solved, and the operation and maintenance pressure of the more complex system is higher are solved.

Drawings

Fig. 1 is a block diagram of a hardware structure of a mobile terminal according to a method for locating a root cause of a system failure according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for root cause location of a system fault according to an embodiment of the present application;

FIG. 3 is a schematic diagram of an IT network system fault root cause location system according to the present embodiment;

FIG. 4 is a schematic diagram of system fault data collection module information collection according to the present embodiment;

FIG. 5 is a schematic diagram of system fault status quantification according to the present embodiment;

FIG. 6 is a schematic diagram of a system fault map sample node cleaning convergence module flow according to the present embodiment;

FIG. 7 is a schematic diagram of a fault root cause node location sample construction according to the present embodiment;

FIG. 8 is a schematic diagram of a fault root cause alarm location sample construction according to the present embodiment;

FIG. 9 is a flow diagram of a system fault root cause location model module according to the present embodiment;

FIG. 10 is a first system block diagram of IT network system fault root localization according to the present embodiment;

FIG. 11 is a system block diagram two of IT network system fault root note localization according to the present embodiment;

fig. 12 is a block diagram of a cause location device of a system failure according to the present embodiment.

Detailed Description

Embodiments of the present application will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

The method embodiments provided in the embodiments of the present application may be executed in a mobile terminal, a computer terminal, or a similar computing device. Taking a mobile terminal as an example, fig. 1 is a hardware structure block diagram of the mobile terminal of the root cause locating method of the system failure according to the embodiment of the present application, and as shown in fig. 1, the mobile terminal may include one or more processors 102 (only one is shown in fig. 1) (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA), and a memory 104 for storing data, where the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.

The memory 104 may be used to store a computer program, for example, a software program and a module of an application software, such as a computer program corresponding to the root cause location method of the system failure in the embodiment of the present application, and the processor 102 executes various functional applications and the service chain address pool slicing process by running the computer program stored in the memory 104, thereby implementing the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal over a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.

In this embodiment, a root cause positioning method for a system fault operating in the mobile terminal or the network architecture is provided, and is applied to a terminal, where the terminal accesses a current primary node MN cell and a current secondary node SN cell of a source area through Dual Connection (abbreviated DC), and fig. 2 is a flowchart of the root cause positioning method for a system fault according to an embodiment of the present application, where as shown in fig. 2, the flowchart includes the following steps:

step S202, a fault root cause node positioning sample and a fault root cause alarm positioning sample are constructed;

step S204, training the constructed network system fault root cause positioning model by using the fault root cause node positioning sample and the fault root cause alarm positioning sample to obtain a trained network system fault root cause positioning model;

and step S206, predicting the root cause positioning of the current fault according to the trained network system fault root cause positioning model.

Through the steps S202 to S206, the problems that in the related technology, the alarm of faults is very time-consuming and labor-consuming due to the fact that operation and maintenance personnel screen and position from a large amount of alarm information, the service of a network system cannot be recovered quickly after being interrupted, and the operation and maintenance pressure of the system is higher when the system is more complex can be solved.

In this embodiment, the step S206 may specifically include:

s2061, constructing a fault root node graph sample for the fault data of the current fault network system;

s2062, inputting the constructed fault root cause node diagram sample into the trained network system fault root cause positioning model to obtain a fault root cause node prediction result of the current fault output by the trained network system fault root cause positioning model;

s2063, determining the fault root cause alarm prediction result of the current fault according to the fault root cause node prediction result.

Further, the step S2063 may specifically include:

In an optional embodiment, after predicting the root cause location of the current fault according to the trained network system fault root cause location model, sending the fault root cause node prediction result and the fault root cause alarm prediction result to a network system, where the network system is configured to display the fault root cause node prediction result and the fault root cause alarm prediction result through a system fault interaction interface.

In this embodiment, the step S202 may specifically include:

s2021, vectorizing the acquired alarm logs and KPIs to obtain an alarm log state vector matrix and a KPI state vector matrix;

s2022, fusing the alarm log state vector matrix and the KPI state vector matrix to obtain a system fault state hybrid characterization vector matrix, and using the system fault state hybrid characterization vector matrix as a fault state characterization of a network system node;

s2023, constructing a fault map sample according to the fault state characterization and the collected topological data;

s2024, a fault root cause node positioning sample and a fault root cause alarm positioning sample are constructed according to the fault map sample.

Further, the S2024 may specifically include:

s1, performing noise node cleaning and sample graph convergence on the fault graph sample to obtain a cleaned and converged pattern sample, and further, the S1 may specifically include: performing attribute labeling on whether the alarm logs exist in the nodes in the fault map sample according to the system fault state mixed characterization vector; cleaning nodes without alarm logs in the fault map sample; carrying out sample convergence on the cleaned fault image sample to obtain a cleaned and converged image book;

s2, constructing a fault root cause node positioning sample and a fault root cause alarm positioning sample based on the pattern book after cleaning convergence, and further performing fusion processing on a plurality of alarm log vectors of each node in the pattern sample after N-order cleaning convergence; assigning the system fault state mixed characterization vector of each node to a corresponding node characteristic value in an N-order fault pattern book, and marking a root cause node label to form a fault root cause node positioning sample; if the root cause node comprises a plurality of alarm logs, splitting the plurality of alarm logs in the root cause node into a single alarm log; assigning the system fault state mixed characterization vector of each node to a corresponding node characteristic value in the N-order fault pattern book, and marking a root cause alarm label on a root cause alarm to form a fault root cause alarm positioning sample;

and S3, storing the fault root cause node positioning sample and the fault root cause alarm positioning sample into a sample pool.

In another optional embodiment, before vectorizing the acquired alarm logs and KPIs to obtain an alarm log state vector matrix and a KPI state vector matrix, after receiving topology data, alarm logs and KPI data acquired by a network system, performing a training sample obtained by root cause node and root cause alarm labeling, that is, acquiring the topology data, alarm logs and KPI data by the network system, then performing root cause node and root cause alarm labeling on the topology data, alarm logs and KPI data to obtain a training sample, and reporting the training sample.

In this embodiment, constructing a network system fault state hybrid representation and a fault alarm diagram sample includes:

acquiring network fault sample acquisition information including but not limited to system fault log and KPI information, system topology information, fault root cause node and root cause alarm tag information.

Firstly, constructing fault state representation of network system nodes: system fault log information (including but not limited to error log information, alarm log information) is obtained. And correspondingly cleaning the alarms existing in the current sample non-root cause node and in the root cause alarm set. And vectorizing the fault log in each node of the network system to obtain a state vector matrix of the fault log information of the nodes of the network system. The method comprises the steps of obtaining record values in a system KPI information time slice, wherein the KPI information can be but is not limited to index information such as Central Processing Unit (CPU) utilization rate, memory utilization rate, input and output rate, network flow and the like, and then counting feature vectors of each KPI of network system nodes to obtain a network system node KPI state vector matrix. And fusing the system fault log state vector matrix and the KPI state vector matrix to obtain a system fault state hybrid characterization vector matrix, and performing pooling operation on the fault state hybrid characterization vector matrix in the same dimension to obtain a fault state hybrid characterization vector which is used as a network system node fault state characterization.

Secondly, constructing a topological graph of the network system: constructing a topological node: the nodes are fine-grained with minimum positioning for the network system fault root, and can be, but are not limited to, servers, application services, or components. Constructing a topological edge: an edge is a relationship with directional attributes in a network system, and may be represented as, but not limited to, a relationship such as a service call, a data flow, and the like between nodes. The topology nodes and edges are abstracted into a specified data structure, such as a dictionary or the like. And constructing a basic topological graph 0 of the network system according to the system topological data.

And finally, constructing a fault topology sample of the network system: and constructing a system fault map sample according to the basic topological graph 0 of the system network system and the fault state mixed characterization vector matrix, and obtaining a system fault original map sample 1. And cleaning the nodes without fault information in the system fault original graph sample 1 to obtain a system fault original graph sample 2. And acquiring an N-order subgraph of each fault root node in the original system fault graph sample 2, wherein the subgraph is used as a fault graph sample-an N-order fault graph sample of the corresponding fault of the system.

Based on the above constructed fault alarm graph sample, the root cause node & root cause alarm positioning of multiple fault alarms (multiple alarms exist in one node) of the network system includes:

and training a network system fault root cause positioning model, specifically, constructing a fault root cause node positioning graph sample, fusing alarm log vectors of each node in the N-order graph sample, and marking a root cause node label to form the fault root cause node positioning sample. And constructing a fault root cause alarm positioning graph sample, if the root cause node only contains one alarm, constructing the fault root cause alarm positioning graph sample of the current sample, otherwise, splitting a plurality of alarms in the root cause node into single alarms, wherein each alarm corresponds to the fault root cause alarm positioning graph sample, and marking a root cause alarm label on the root cause alarm to form a fault root cause alarm positioning sample storage. And uniformly storing the fault root cause node positioning sample and the fault root cause alarm positioning sample into the same sample pool. And establishing a network system fault root cause positioning model based on the graph neural network, taking a root cause node positioning graph sample and a root cause alarm positioning graph sample as model input, and taking whether a node is a root cause node label as output to train the graph neural network model.

Specifically, when a system fails, collecting and vectorizing system topology data, fault alarm logs and KPI information in a specified time segment to obtain a system fault original graph sample 1, and then performing node cleaning convergence on the system fault original graph sample 1. And fusing the multi-alarm vectors of each node in the system fault original graph sample 1 to generate a fault root cause node positioning sample, and inputting the fault root cause node positioning sample into a fault root cause positioning model to obtain a fault root cause prediction node. After the fault root cause node is positioned, if the predicted root cause node has more alarms, the alarms are split, fault root cause alarm positioning graph samples are respectively constructed and input into a fault root cause positioning model to obtain fault root cause prediction alarms, and the alarm with the maximum root cause probability is the fault root cause alarm.

Fig. 3 is a schematic diagram of an IT network system fault root cause locating system according to the present embodiment, as shown in fig. 3, the system includes:

and the system fault data acquisition module is used for acquiring topology data and alarm logs & KPI data of a network system and marking root cause nodes and root cause alarms of the training samples.

And the system fault state quantization module is used for vectorizing the fault alarm log text and KPI.

And the system fault pattern local node cleaning convergence module is used for constructing a fault pattern local, cleaning noise nodes and converging a sample graph.

And the system fault root cause positioning sample construction module is used for constructing fault root cause node positioning samples and fault root cause alarm positioning samples.

And the system fault root cause positioning module is used for constructing and optimizing a network system fault root cause positioning model based on the graph neural network and predicting the fault root cause positioning.

And after the network system fault occurs, automatically or manually triggering a system fault data acquisition module to acquire topological data, alarm logs and KPI data of the network system, marking root cause nodes and root cause alarms of the training samples, and uploading the root cause nodes and the root cause alarms to a network system fault root cause positioning system. And vectorizing the uploaded network system fault alarm log and KPI by a system fault state vectorization module, fusing a system fault log state vector matrix and a KPI state vector matrix, and acquiring a system fault state hybrid representation vector matrix as the fault state representation of the network system nodes. And constructing a fault pattern book, cleaning noise nodes and converging a sample graph by using the fault state representation and the network topology data through a system fault pattern book node cleaning convergence module. Based on the pattern book after cleaning convergence, a fault root cause node positioning sample and a fault root cause alarm are constructed by a system fault root cause positioning sample construction module, and the sample is stored in a sample pool. And the system fault root cause positioning model module constructs a network system fault root cause positioning model, trains the network system fault root cause positioning model by using samples in a sample pool, or predicts the root cause positioning of the current fault and feeds back the result.

The method comprises the steps of carrying out multi-mode representation on a network system fault state by combining modules or topological information among nodes in an IT system and utilizing log Error or Warming or KPI (Key Performance indicator) or other alarm information of the IT system in the operation process, building a regularized IT system fault root cause positioning end-to-end model with knowledge fusion and reasoning by combining a graph neural network and a system fault label sample, and realizing automatic rapid and accurate positioning of multi-level root causes (including two levels of root cause nodes and root cause alarms) in the fault state in the operation process of the IT system by using the same model, thereby greatly shortening the time of IT system fault positioning and system recovery, improving the operation and maintenance efficiency of the IT system, reducing the consumption of operation and maintenance resources, simultaneously greatly improving user experience, and simultaneously reducing the maintenance difficulty of the model by end-to-end training and prediction of the model.

Fig. 4 is a schematic diagram of information acquisition of the system fault data acquisition module according to this embodiment, and as shown in fig. 4, the system fault data acquisition module acquires topology data and alarm logs & KPI data of a network system after being triggered automatically or manually by a system fault, and performs root cause node and root cause alarm tagging on a sample, where the alarm log acquisition information includes but is not limited to: the method comprises the steps of obtaining an alarm log text, alarm time, a node to which the alarm belongs, and label information of whether the alarm is a root cause alarm or not if the alarm is a training sample; KPI data information includes, but is not limited to: the method comprises the steps that a CPU (central processing unit), a memory utilization rate, node data packet throughput rate information and node information of KPIs (Key performance indicators) of nodes are obtained; the topology data comprises the ID and the calling relation information of the nodes; the root cause node and root cause alarm are labeled as 1, and the non-root cause node and non-root cause alarm are labeled as 0.

The fault vectorization module is responsible for vectorizing the uploaded network system fault alarm logs and KPIs, fusing a system fault log state vector matrix and a KPI state vector matrix to obtain a system fault state hybrid characterization vector matrix as a fault state characterization of a network system node, and fig. 5 is a schematic diagram of system fault state quantization according to the embodiment, where as shown in fig. 5, firstly, the system fault state vectorization module performs word segmentation on the alarm logs to vectorize each alarm log and perform vectorization on KPI information of the node, and then performs log vector fusion and KPI vector fusion (the fusion operation may be vector splicing or fusion through a machine learning model), so as to obtain a node fault state hybrid characterization vector.

The system fault pattern node cleaning convergence module is used for constructing a fault pattern, cleaning noise nodes and converging a sample graph, when a network system fault occurs, if a node without an alarm log is a normal node, the node without the alarm log is not a root node, so that the node is required to be cleaned, the problem of quantity difference between the fault node and the normal node in the network system is relieved, and the root positioning accuracy of a model is improved, FIG. 6 is a schematic diagram of the system fault graph sample node cleaning convergence module flow according to the embodiment, as shown in FIG. 6, firstly, the system fault graph sample node is cleaned and converged, a module constructs the system topology graph according to network topology data, and then, whether the node has the alarm log attribute is marked according to node fault state mixed characterization information, wherein a green node is a node without the alarm log, a yellow node is a node with the alarm log, and a red node is a node with the fault root node, and cleaning the green alarm-free log nodes, wherein the cleaned nodes are shown as the right graph.

Fig. 7 is a schematic diagram of a fault root cause node location sample construction according to this embodiment, and as shown in fig. 7, firstly, the fault root cause node location graph sample construction is performed, an alarm log vector of each node in an N-order graph sample is subjected to fusion processing (if a plurality of alarms exist in the node, fusion is performed, otherwise, no operation is required), then, a fault state mixed characterization vector of each node is assigned to a corresponding node characteristic value in an N-order fault pattern sample, and a root cause node label is printed, so that a fault root cause node location sample is formed.

Fig. 8 is a schematic diagram of a fault root cause alarm positioning sample construction according to the present embodiment, as shown in fig. 8, next, a fault root cause alarm positioning graph sample construction is performed, if only one alarm is included in a root cause node, a fault root cause alarm graph sample of a current sample does not need to be constructed, otherwise, a plurality of alarms in the root cause node are split into single alarms, each alarm corresponds to the fault root cause alarm positioning graph sample, then, a fault state mixed characterization vector of each node is assigned to a corresponding node characteristic value in an N-order fault pattern sample, and a root cause alarm label is marked on the root cause alarm, so that the fault root cause alarm positioning sample is formed.

And finally, uniformly storing the fault root cause node positioning sample and the fault root cause alarm positioning sample into the same sample pool.

And the system fault root cause positioning model module is used for constructing, optimizing and training a network system fault root cause positioning model based on the graph neural network and predicting the fault root cause positioning. Fig. 9 is a flowchart of the system fault root cause location model module according to the present embodiment, and as shown in fig. 9, the training process includes: firstly, a system fault root cause positioning model is established, then samples in a sample pool are loaded to train the system fault root cause positioning model, and finally a learning model is exported and stored. The prediction process comprises the following steps: loading a trained network system fault root cause positioning model, firstly constructing a fault root cause node graph sample on network system fault data, inputting the constructed fault root cause node graph sample into the network system fault root cause positioning model to predict fault root cause nodes, then if the predicted root cause nodes only have one alarm log, the current alarm log is a fault root cause alarm, otherwise, splitting a plurality of alarm logs in the root cause nodes, respectively constructing a fault root cause alarm graph sample, inputting the constructed fault root cause alarm log into the system fault root cause positioning model to predict the fault root cause alarm, and finally feeding back the root cause nodes and root cause alarm prediction results.

The present embodiment will be described in detail below with reference to the accompanying drawings.

Fig. 10 is a first system block diagram of IT network system fault root positioning according to the present embodiment, and as shown in fig. 10, the system framework includes an IT network system 1001 and a network system fault root positioning server 1004, where the IT network system 1001 includes: a system fault data acquisition module 1002 and a system fault interface 1003; the network system fault root cause positioning server 1004 is responsible for operating the network system fault root cause positioning device 1005.

The system fault data acquisition module 1002 is responsible for acquiring and uploading fault alarm logs and network topology data of the IT network system 1001.

The system fault interactive interface 1003 is responsible for triggering system fault location and analyzing and displaying a fault location result.

The network system fault root cause positioning device 1005 is responsible for analyzing and cleaning the uploaded system fault information, constructing a sample, training a model, positioning and predicting the fault root cause, and feeding back a fault positioning result.

System fault data acquisition, including: the system fault interactive interface 1003 triggers the system fault data acquisition module 1002 to acquire alarm logs and system topology data within a specified time segment (for example, 10 minutes before and after a fault occurs), and complete root cause node and root cause alarm labeling. The collected alarm log information includes, but is not limited to, ERROR, WARNING, FATAL and other level alarm logs, the alarm logs of each node are exported to the same text file, and the file is named by using a node ID _ log; the system topology data includes node IDs and service invocation relations between system nodes, and is stored using a dictionary data structure, for example, node 1 invokes services of node 2 and node 3, in the topology, there is an edge pointed to by node 1 to node 2 and node 3, and then in the dictionary, the data is expressed as { node 1: [ node 2, node 3] }, the key node of the dictionary is the calling node, and the value node of the dictionary is the called node; the fault marking information comprises root cause nodes and root cause alarm information and is stored in a text file with a designated name; then, the alarm log files, the system topology data files, and the fault labeling files of all the nodes of the system are packaged and uploaded to the network system fault root cause positioning device 1005.

System failure pattern this node cleaning convergence, includes: the network system fault root cause positioning device 1005 analyzes the uploaded system fault information data packet, and respectively extracts the alarm log, the system topology data and the fault marking information of each node. Aiming at the node alarm logs, firstly, cleaning useless fields of the alarm logs of each node to converge the node alarm logs, then performing word segmentation and word vectorization on the converged alarm logs, and generating a word vector model; and extracts the presence alarm log node ID. And performing word vectorization on each alarm day of the node by using the generated word vector model to obtain an alarm log embedded matrix, and then performing summation operation on the same dimension of the alarm log embedded matrix to obtain a characterization vector of each log, namely the node fault state characterization of the network system. And aiming at the fault marking data, extracting the ID of the fault root cause node and the root cause alarm. A system base topology graph 0 is created for the system topology data using the graph tool and the system topology data. And assigning the generated node fault state representation to each node of the generated system basic topological graph 0 to generate a fault original graph sample. Marking root cause nodes and root cause alarms of the generated original graph sample according to the obtained fault root cause node ID and the obtained root cause alarm tag information, and marking which node is the root cause node and which alarm in the root cause alarms is the root cause alarm; and performing alarm labeling on the corresponding nodes of the original graph sample 1 according to the alarm node ID, marking which nodes are alarm nodes, and generating the original fault graph sample 1. And cleaning the non-alarm nodes and the isolated nodes in the original fault graph sample 1 to generate an original fault graph sample 2. And acquiring a 3-order subgraph of the fault root node in the fault original graph sample 2, further carrying out graph convergence, and taking the acquired subgraph as a fault graph sample-a 3-order fault graph sample of the corresponding fault.

The method for constructing the system fault root cause positioning training diagram sample comprises the following steps: constructing a fault root cause node positioning training graph sample: and sequentially summing all the alarm log vectors of each node in the 3-order fault map sample, assigning a characteristic value to each node, and marking root cause node labels to generate a fault root cause node positioning map sample.

The construction of a fault root cause alarm positioning training sample comprises the following steps: if the root cause node only contains one alarm, a root cause alarm positioning graph sample of the current fault does not need to be additionally constructed, if the root cause node has a plurality of alarms, the root cause alarm positioning graph sample is split into single alarms, each alarm corresponds to one root cause alarm positioning graph sample of the current fault, and a corresponding label of whether the root cause alarm is a root cause alarm or not is marked on the node according to whether each alarm is the root cause alarm or not, so that the fault root cause alarm positioning graph sample is generated. And storing the fault root cause node positioning graph sample and the fault root cause alarm positioning graph sample into the same sample pool.

The system fault root cause positioning model training comprises the following steps: and establishing a network system fault root cause positioning model based on the graph neural network, loading the pattern book in the sample pool to finish training and learning, and exporting and storing the final model. And the system fault root cause positioning model is on-line to provide system fault root cause positioning service.

The prediction of the system fault root cause positioning model comprises the following steps: and performing alarm log analysis, vectorization, node cleaning and sample graph convergence on the uploaded system fault acquisition data to generate a 3-order fault graph sample. And summing all alarm log vectors of each node in the generated 3-order fault map sample, and then assigning a characteristic value of each node as a fault root cause node positioning map sample (a prediction sample is not provided with a label). And loading the stored system fault root cause positioning model, performing fault root cause node positioning prediction by taking the generated fault root cause node positioning graph sample as model input, and outputting a fault root cause node positioning result.

If the predicted root cause node has only one alarm log, the alarm log is a fault root cause alarm. If a plurality of alarms exist in the predicted root cause node, the predicted root cause node is split into single alarms, each alarm corresponds to one root cause alarm positioning graph sample of the current fault, and a fault root cause alarm positioning graph sample is generated (the prediction sample is not provided with a label). And then inputting the split samples into a loaded system fault root cause positioning model to predict fault root cause alarms, wherein the fault root cause alarms with the highest probability in a plurality of alarms. And feeding back the fault root cause node positioning result and the fault root cause alarm positioning result.

Fig. 11 is a second system block diagram of IT network system fault root positioning according to the present embodiment, and as shown in fig. 11, the system framework includes an IT network system 1001 and a network system fault root positioning server 1004, where the IT network system 1001 includes: a system fault data acquisition module 1002 and a system fault interface 1003; the network system fault root cause positioning server 1004 is responsible for operating the network system fault root cause positioning device 1005.

The system fault data acquisition module 1002 is responsible for acquiring and uploading fault alarm logs, KPI indexes and network topology data of the IT network system 1001.

System fault data acquisition, including: the system fault interactive interface 1003 triggers the system fault data acquisition module 1002 to acquire alarm logs, system node KPIs, and system topology data within a specified time segment (for example, 20 minutes before and after a fault occurs), and complete root cause node and root cause alarm labeling. The collected alarm log information includes, but is not limited to, ERROR, WARNING, FATAL and other level alarm logs, the alarm logs of each node are exported to the same text file, and the file is named by using a node ID _ log; collecting system KPI output including but not limited to CPU, memory utilization rate, node data packet throughput rate information, exporting all KPIs of each node to the same text file, naming the file by node ID _ KPI; the system topology data includes node IDs and data flow relationships between system nodes, and is stored using a dictionary data structure, for example, node 1 exists data flow to node 2 and node 3, in the topology, it is represented that an edge exists and is pointed to by node 1 to node 2 and node 3, then in the dictionary, it is represented as { node 1: [ node 2, node 3] }, the key node of the dictionary is the data source node, and the value node of the dictionary is the data target node; the fault marking information comprises root cause nodes and root cause alarm information and is stored in a text file with a designated name; then, the alarm log files of all the nodes of the system, the KPI data files of the system nodes, the topology data files of the system, and the fault annotation files are packaged and uploaded to the network system fault root cause positioning device 1005.

System failure pattern this node cleaning convergence, includes: the network system fault root cause positioning device 1005 analyzes the uploaded system fault information data packet, and respectively extracts the alarm log, KPI, system topology data and fault marking information of each node. Aiming at the node alarm logs, firstly, cleaning useless fields of the alarm logs of each node to converge the node alarm logs, then performing word segmentation and word vectorization on the converged alarm logs, and generating a word vector model; and extracts the presence alarm log node ID. And aiming at the KPI of the node, vectorizing the KPI of each node to obtain a KPI vector of the node. And performing word vectorization on each alarm day of the node by using the generated word vector model to obtain an alarm log embedded matrix, and then performing summation operation on the same dimension of the alarm log embedded matrix to obtain a characterization vector of each log, namely the node fault state characterization of the network system. And aiming at the fault marking data, extracting the ID of the fault root cause node and the root cause alarm. A system base topology graph 0 is created for the system topology data using the graph tool and the system topology data. And after splicing the generated node fault state representation and the generated node KPI vector, assigning a value to each node of the generated system basic topological graph 0 to generate a fault original graph sample. Marking root cause nodes and root cause alarms of the generated original graph sample according to the obtained fault root cause node ID and the obtained root cause alarm tag information, and marking which node is the root cause node and which alarm in the root cause alarms is the root cause alarm; and performing alarm labeling on the corresponding nodes of the original graph sample 1 according to the alarm node ID, marking which nodes are alarm nodes, and generating the original fault graph sample 1. And cleaning the non-alarm nodes and the isolated nodes in the original fault graph sample 1 to generate an original fault graph sample 2. And acquiring a 4-order subgraph of the fault root node in the fault original graph sample 2, further carrying out graph convergence, and taking the acquired subgraph as a fault graph sample of the corresponding fault, namely the 4-order fault graph sample.

The method for constructing the system fault root cause positioning training diagram sample comprises the following steps:

s1, constructing a fault root cause node positioning training graph sample: and sequentially summing all the alarm log vectors of each node in the 4-order fault map sample, assigning a characteristic value to each node, and marking root cause node labels to generate a fault root cause node positioning map sample.

S2, constructing a fault root cause alarm positioning training sample: if the root cause node only contains one alarm, a root cause alarm positioning graph sample of the current fault does not need to be additionally constructed, if the root cause node has a plurality of alarms, the root cause alarm positioning graph sample is split into single alarms, each alarm corresponds to one root cause alarm positioning graph sample of the current fault, and a corresponding label of whether the root cause alarm is a root cause alarm or not is marked on the node according to whether each alarm is the root cause alarm or not, so that the fault root cause alarm positioning graph sample is generated. And storing the fault root cause node positioning graph sample and the fault root cause alarm positioning graph sample into the same sample pool.

The prediction of the system fault root cause positioning model comprises the following steps: and performing alarm log analysis, vectorization, node cleaning and sample graph convergence on the uploaded system fault acquisition data to generate a 4-order fault graph sample. And summing all alarm log vectors of each node in the generated 4-order fault map sample, and then assigning a characteristic value of each node as a fault root cause node positioning map sample (a prediction sample is not provided with a label). And loading the stored system fault root cause positioning model, performing fault root cause node positioning prediction by taking the generated fault root cause node positioning graph sample as model input, and outputting a fault root cause node positioning result.

According to another embodiment of the present application, there is also provided a root cause locating device for a system fault, and fig. 12 is a block diagram of the root cause locating device for a system fault according to the present embodiment, as shown in fig. 12, including:

a constructing module 122, configured to construct a fault root cause node positioning sample and a fault root cause alarm positioning sample;

the training module 124 is configured to train the constructed network system fault root cause positioning model by using the fault root cause node positioning sample and the fault root cause alarm positioning sample, so as to obtain a trained network system fault root cause positioning model;

and the prediction module 126 is configured to predict the root cause location of the current fault according to the trained network system fault root cause location model.

In an exemplary embodiment, the prediction module 126 includes:

In an exemplary embodiment, the determining sub-module includes:

In an exemplary embodiment, the apparatus further comprises:

In an exemplary embodiment, the building module 122 includes:

In an exemplary embodiment, the third building submodule includes:

In an exemplary embodiment, the building unit is further configured to

cleaning nodes without alarm logs in the fault map sample;

In an exemplary embodiment, the apparatus further comprises:

Embodiments of the present application further provide a computer-readable storage medium having a computer program stored therein, wherein the computer program is configured to perform the steps in any of the above method embodiments when executed.

In an exemplary embodiment, the computer-readable storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.

Embodiments of the present application further provide an electronic device comprising a memory having a computer program stored therein and a processor configured to execute the computer program to perform the steps of any of the above method embodiments.

In an exemplary embodiment, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.

For specific examples in this embodiment, reference may be made to the examples described in the above embodiments and exemplary embodiments, and details of this embodiment are not repeated herein.

It will be apparent to those skilled in the art that the various modules or steps of the present application described above may be implemented using a general purpose computing device, they may be centralized on a single computing device or distributed across a network of multiple computing devices, and they may be implemented using program code executable by the computing devices, such that they may be stored in a memory device and executed by the computing devices, and in some cases, the steps shown or described may be performed in an order different than that described herein, or they may be separately fabricated into separate integrated circuit modules, or multiple ones of them may be fabricated into a single integrated circuit module. Thus, the present application is not limited to any specific combination of hardware and software.

The above description is only a preferred embodiment of the present application and is not intended to limit the present application, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, improvement and the like made within the principle of the present application shall be included in the protection scope of the present application.

Claims

1. A method for locating a root cause of a system fault is characterized by comprising the following steps:

2. The method of claim 1, wherein predicting the root cause location of the current fault according to the trained network system fault root cause location model comprises:

3. The method of claim 2, wherein determining the fault root cause alarm prediction result of the current fault according to the fault root cause node prediction result comprises:

4. The method for root cause positioning of system faults according to claim 2 or 3, wherein after predicting the root cause positioning of the current fault according to the trained network system fault root cause positioning model, the method further comprises:

5. The method of claim 1, wherein constructing the fault root cause node location sample and the fault root cause alarm location sample comprises:

vectorizing the collected alarm logs and Key Performance Indicators (KPIs) to obtain an alarm log state vector matrix and a KPI state vector matrix;

6. The method of claim 5, wherein constructing the fault root cause node location sample and the fault root cause alarm location sample according to the fault map sample comprises:

7. The method of claim 6, wherein constructing the fault root cause node location sample and the fault root cause alarm location sample based on the pattern book after cleaning convergence comprises:

8. The method of claim 6, wherein the step of performing noise node cleaning and sample graph convergence on the fault graph sample to obtain a cleaned and converged pattern book comprises:

cleaning nodes without alarm logs in the fault map sample;

9. The method according to any one of claims 5 to 8, wherein before vectorizing the collected alarm logs and KPIs to obtain an alarm log state vector matrix and a KPI state vector matrix, the method further comprises:

10. A cause locator for a system fault, comprising:

11. A computer-readable storage medium, in which a computer program is stored, wherein the computer program is configured to carry out the method of any one of claims 1 to 9 when executed.

12. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any of claims 1 to 9.