CN116909782A

CN116909782A - Root cause analysis method, root cause analysis device, electronic equipment and readable storage medium

Info

Publication number: CN116909782A
Application number: CN202211696249.1A
Authority: CN
Inventors: 罗维
Original assignee: China Mobile Communications Group Co Ltd; China Mobile IoT Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile IoT Co Ltd
Priority date: 2022-12-28
Filing date: 2022-12-28
Publication date: 2023-10-20

Abstract

The disclosure provides a root cause analysis method, a root cause analysis device and related equipment, and relates to the technical field of artificial intelligence, wherein the method comprises the following steps: when a first fault event of the target service is detected, at least two root cause analysis links of the target service are obtained, wherein the root cause analysis links are used for representing a fault propagation path of the target service; determining a target analysis link in which a first fault event exists from at least two root cause analysis links based on the first fault event; and carrying out root cause analysis on the first fault event according to the target analysis link to obtain root cause information indicating a second fault event, wherein the second fault event is a fault event which causes the first fault event to occur. The fault propagation path corresponding to the first fault event is obtained, and root cause analysis is carried out on the first fault event based on the fault propagation path, so that a second fault event which causes the first fault event to occur is obtained, and the accuracy of fault analysis can be improved.

Description

Root cause analysis method, root cause analysis device, electronic equipment and readable storage medium

Technical Field

The disclosure relates to the technical field of artificial intelligence, in particular to a root cause analysis method, a root cause analysis device, electronic equipment and a readable storage medium.

Background

With the rapid expansion of internet services and the diversification of services, services are becoming more and more complex. The related technology adopts a method of domain driving and micro-service, and can split complex service into a plurality of subtasks according to the service domain, wherein the subtasks obtained by splitting are also called micro-service, and the plurality of micro-services are communicated by adopting a lightweight communication protocol.

In the application, the calling relation among a plurality of micro services is found to be complex, so that the difficulty in troubleshooting of the micro services is increased, at present, the problem of faults displayed on the surface layer of the micro services can only be tracked by adopting a full-link tracking mode in the related technology, and the root cause of the problem of the faults is difficult to locate, namely, the accuracy of an analysis result obtained based on the existing fault analysis mode is poor.

Disclosure of Invention

An object of an embodiment of the present disclosure is to provide a root cause analysis method, apparatus, electronic device, and readable storage medium, for solving a technical problem of low accuracy of analysis results when analyzing a micro-service fault problem according to an existing fault tracking scheme.

In a first aspect, an embodiment of the present disclosure provides a root cause analysis method, including:

under the condition that a first fault event occurs to a target service is detected, at least two root cause analysis links corresponding to the target service are obtained, wherein the root cause analysis links are used for representing a fault propagation path of the target service;

determining a target analysis link from the at least two root cause analysis links based on the first fault event, wherein the target analysis link is the root cause analysis link in which the first fault event occurs;

and carrying out root cause analysis processing on the first fault event according to the target analysis link to obtain root cause information for indicating a second fault event, wherein the second fault event is a fault event which causes the first fault event to occur.

Optionally, after determining a target analysis link from the at least two root cause analysis links based on the first failure event, the method further includes:

and predicting the first fault event according to the target analysis link to obtain hidden danger information for indicating a third fault event, wherein the third fault event is a fault event caused by the first fault event.

Optionally, the obtaining at least two root cause analysis links corresponding to the target service includes:

acquiring service architecture information of the target service, wherein the service architecture information comprises a data flow path of the target service and service deployment information of the target service;

and analyzing and processing the service architecture information to obtain the at least two root cause analysis links.

Optionally, the analyzing the service architecture information to obtain the at least two root cause analysis links includes:

analyzing and processing the service architecture information to obtain attribute information of a plurality of entities included in the target service and association information among the entities;

and carrying out knowledge graph conversion according to the attribute information of the plurality of entities and the association information among the plurality of entities to obtain the at least two root cause analysis links.

Optionally, the method of claim 1, wherein determining a target analysis link among the at least two root cause analysis links based on the first fault event includes:

performing depth-first search on each of the at least two root cause analysis links based on the first fault event to obtain a search result of each of the at least two root cause analysis links;

and determining the target analysis link in the at least two root cause analysis links according to the search result of each of the at least two root cause analysis links.

Optionally, the performing root cause analysis processing on the first fault event according to the target analysis link to obtain root cause information for indicating a second fault event includes:

and carrying out root cause analysis processing on the target analysis link according to the first fault event and a preset high-order logic model to obtain root cause information.

In a second aspect, embodiments of the present disclosure further provide a root cause analysis apparatus, the apparatus comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring at least two root cause analysis links corresponding to a target service under the condition that a first fault event occurs in the target service, wherein the root cause analysis links are used for representing a fault propagation path of the target service;

a determining module, configured to determine a target analysis link from the at least two root cause analysis links based on the first failure event, where the target analysis link is the root cause analysis link in which the first failure event occurs;

and the analysis module is used for carrying out root cause analysis processing on the first fault event according to the target analysis link to obtain root cause information for indicating a second fault event, wherein the second fault event is a fault event which leads to the generation of the first fault event.

Optionally, the apparatus further includes:

the prediction module is used for predicting the first fault event according to the target analysis link to obtain hidden danger information for indicating a third fault event, wherein the third fault event is a fault event caused by the first fault event.

In a third aspect, an embodiment of the present disclosure further provides an electronic device, including a processor, a memory, and a computer program stored on the memory and executable on the processor, where the computer program when executed by the processor implements the steps of the root cause analysis method described above.

In a fourth aspect, the disclosed embodiments also provide a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of the root cause analysis method described above.

In the embodiment of the disclosure, a fault propagation path corresponding to a first fault event is obtained by obtaining a target analysis link, and root cause analysis is performed on the first fault event based on the fault propagation path to obtain root cause information for indicating a second fault event, that is, to obtain the second fault event which causes the first fault event to occur, which can improve the accuracy of fault analysis.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present disclosure, the drawings that are needed in the description of the embodiments of the present disclosure will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present disclosure, and other drawings may be obtained according to these drawings without inventive effort to a person of ordinary skill in the art.

FIG. 1 is a flow chart of a root cause analysis method provided by an embodiment of the present disclosure;

FIG. 2 is a flow chart of another root cause analysis method provided by an embodiment of the present disclosure;

FIG. 3 is a schematic diagram of a static resource class provided by an embodiment of the present disclosure;

FIG. 4 is a schematic diagram of a dynamic class provided by an embodiment of the present disclosure;

FIG. 5 is a schematic diagram of a root cause analysis device according to an embodiment of the disclosure;

fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

Detailed Description

The following description of the technical solutions in the embodiments of the present disclosure will be made clearly and completely with reference to the accompanying drawings in the embodiments of the present disclosure, and it is apparent that the described embodiments are some embodiments of the present disclosure, but not all embodiments. All other embodiments, which can be made by one of ordinary skill in the art without inventive effort, based on the embodiments in this disclosure are intended to be within the scope of this disclosure.

Referring to fig. 1, fig. 1 is a flowchart of a root cause analysis method provided by an embodiment of the disclosure, as shown in fig. 1, including the following steps:

step 101, under the condition that a first fault event occurs in a target service is detected, at least two root cause analysis links corresponding to the target service are obtained.

Wherein the root cause analysis link is used for characterizing a fault propagation path of the target service.

By way of example, the target service may be a complex service, such as: OA services of a company, etc., may also be simple services obtained based on the foregoing complex service splitting, where the simple services are also called micro services, for example: authentication service in an OA service, database interaction service for an OA service, etc.

In the operation process of the target service, the operation state of the target service can be monitored through a preset monitoring system or an event sampling program corresponding to a knowledge graph, and whether a first fault event occurs in the target service is determined based on the operation state of the target service obtained through monitoring.

The operation state can be an operation index, an operation log and the like of the target service; the first fault event may be any fault (Error/Error) event or any Warning (Warning) event preset in the target service.

Illustratively, the first fault event may be: CPU usage warning of the physical machine with the target service deployed, memory usage warning of the physical machine with the target service deployed, disk damage of the physical machine with the target service deployed, and the like.

For example, the root cause analysis link may be: E1-E2-E3-E4, wherein the root is used for indicating that the fault E1 directly causes a fault E2 (for example, the disk damage of a host directly causes the access fault of a service deployed on the host), the fault E2 directly causes a fault E3, and the fault E3 directly causes a fault E4; alternatively, the root cause analysis link may be: E1-E2 and E1-E3-E4, where the root is used to indicate that failure E1 directly causes failure E2 and failure E3, and failure E3 directly causes failure E4.

Step 102, determining a target analysis link from the at least two root cause analysis links based on the first fault event.

Wherein the target analysis link is the root cause analysis link in which the first failure event occurs.

For example, if the aforementioned first failure event is set to be failed EO1, the at least two root analysis links include a first root analysis link (EO 2-EO3-EO 4) and a second root analysis link (EO 5-EO 1), then in this example, the second root analysis link may be determined to be the aforementioned target analysis link.

And 103, performing root cause analysis processing on the first fault event according to the target analysis link to obtain root cause information for indicating a second fault event.

Wherein the second fault event is a fault event that causes the first fault event to occur.

As described above, in the case of determining the target analysis link, the node of the first fault event in the target analysis link is used as the starting node, searching is performed along the direction opposite to the fault propagation direction of the target analysis link, and the true or false judgment is performed on the fault event corresponding to the searched node, so as to determine the second fault event.

Wherein a fault event that is determined to be false is understood to be that the fault event does not occur; while a fault event that is determined to be true may be understood as: the fault event has occurred; the true or false determination of the fault event may be determined according to the foregoing operation index, operation log, and the like.

It should be noted that, in the target analysis link, the previous fault event may cause the generation of the next fault event, and the second fault event may be a fault event that is located before the first fault event in the target analysis link, has the largest number of nodes spaced from the first fault event, and is determined to be true.

In the application, when the root cause information is obtained, the root cause information and the first fault event may be output to instruct an inspector corresponding to a target service to manually handle the second fault event and the first fault event indicated by the root cause information; the corresponding maintenance instruction can be triggered according to the root cause information so as to automatically treat the second fault event indicated by the root cause information; the second fault event indicated by the root cause information can also be processed by combining manual treatment and automatic treatment, and the specific application of the root cause information is not limited in the embodiment of the application.

As described above, after determining the target analysis link, searching is performed along the fault propagation direction of the target analysis link by using the node of the first fault event in the target analysis link as an initial node, and the searched fault event is determined as the third fault event, where it is noted that the number of the third fault events may be one or two or more.

Through the arrangement, prediction analysis is performed based on the first fault event and the target analysis link, so that the fault event directly or indirectly caused by the first fault event is predicted, and the occurrence probability of the fault event of the target service is reduced.

It should be noted that, in the case that the first failure event is the last node of the target analysis link, the prediction processing of the first failure event is not performed according to the target analysis link, so as to reduce the overhead.

The service architecture information may be a system development manual of the target service, and the data flow path may be used to characterize an internal data flow and an external data flow of the target service, where the internal data flow may indicate a calling relationship or a dependency relationship of a plurality of services included in the target service; the external data flow may indicate a calling relationship or a dependency relationship between the target service and other services; the service deployment information may be used to characterize identity information of the virtual machine/physical machine/container in which the target service is deployed.

And analyzing and processing the service architecture information through a preset program to automatically generate at least two root cause analysis links corresponding to the target service, so that the processing efficiency of the root cause analysis method disclosed by the application is improved.

As described above, the association information between the plurality of entities included in the target service may be determined based on the aforementioned data flow path, and the attribute information of the plurality of entities included in the target service may be determined based on the aforementioned service deployment information.

And processing the attribute information of the plurality of entities and the association information among the plurality of entities in a knowledge graph conversion mode, so that the obtained at least two root cause analysis links can accurately represent the fault propagation path of the target service, and the accuracy of fault analysis can be further improved.

Illustratively, in this disclosure, entities can be divided into two categories: the static resource class comprises a host class (such as a virtual machine, a physical machine, a container and the like) and a service class, wherein the service class can be an application service program and/or a middleware service program, and the dynamic event class comprises related events of the static resource class, such as: computing, memory, networking, storage, business, etc.

The association information between the plurality of entities may be: a reciprocal relationship between calls (differences) and called (differences) between hosts (Host) and services (Service), a reciprocal relationship between dependencies (dependent) and dependences (dependent) between different services, and the like.

The attribute information of the plurality of entities may be: IP information, CPU information, memory information, IO stream information, disk information (disk), operating system information (OS) of the host; name information (name) of the service, port number information (port) corresponding to the service, and type information of the service; type information of the event and true and false judging information of whether the event occurs or not; type information of the action, execution program information corresponding to the action, and the like.

In order to facilitate the distinction between different entities, the target attribute for distinguishing different entities may also be determined in the foregoing attribute information, for example: a certain service may be uniquely identified based on a host to which the service corresponds and name information, port number information, and type information of the service.

Optionally, the determining a target analysis link from the at least two root cause analysis links based on the first fault event includes:

As described above, the target analysis link is determined in at least two root cause analysis links based on the depth-first search mode, and the determination of the target analysis link is conveniently completed with lower memory overhead.

It should be noted that, the value of the search result includes a first value and a second value, where when the search result is the first value, it indicates that the first fault event does not occur in the root cause analysis link corresponding to the searched result; and when the search result is the second value, indicating that the first fault event is already occurred in the root cause analysis link corresponding to the searched result.

As described above, in the case of determining the target analysis link, root cause analysis is performed on the target analysis link based on the preset high-order logic model and the first fault event to be analyzed, and the root cause event that causes the first fault event to occur may be determined, that is, the foregoing root cause information may be determined.

In the process of converting the knowledge graph according to the attribute information of the entities and the association information among the entities to obtain the at least two root cause analysis links, formal definition of causal relationship among all operation and maintenance events and root cause analysis rule definition among all fault events can be completed by adopting high-order intuitional logic; the causal relationship between the operation and maintenance events is realized by adopting generalized axiom (General Class Axioms) in the knowledge graph.

The conversion processing based on the high-order intuitional logic and the root cause analysis processing based on the high-order logic model can enable the obtained root cause analysis link (namely, root cause analysis rule) to accurately indicate the fault propagation path of the target service, so that the accuracy of fault analysis can be further improved.

For ease of understanding, examples are illustrated below:

as shown in fig. 2, the terms axiom set (Terminology Component, TBOX) layer of the operation and maintenance knowledge graph is constructed using OWL2, and as shown in fig. 3, entities can be divided into two categories: static resource class and dynamic event class, wherein the static resource class includes a host class (e.g., virtual machine, physical machine, container, etc.) and a service class, the service class may be an application service program and/or a middleware service program, and as shown in fig. 4, the dynamic event class includes related events and actions of the static resource class, for example: computing, memory, networking, storage, business, etc.

Illustratively, the foregoing related events may be: CPU utilization rate alarm, network queue backlog, high memory utilization rate, switch partition use, disk IO (input/output) overhigh, service response time, service request times and discrete events caused by machine hardware faults, such as disk damage and the like

Relationship attributes (ObjectProperties) between entities are then defined: such as the reciprocal relationship between calls (reply) and called (reply) between hosts (Host) and services (Service), the reciprocal relationship between dependencies (dependent) and dependences (dependent) between different services, the reciprocal relationship between events (Event), the reciprocal relationship between hosts and services (belong/has_event), the sampling (sampling) and recovery (recovery) relationship between events and actions (Action), etc.

Then define the data attributes (DataProperties) of the entity: such as IP information, CPU information, memory information, IO stream information, disk information (disk), operating system information (OS) of the host; name information (name) of the service, port number information (port) corresponding to the service, and type information of the service; type information of the event and true and false judging information of whether the event occurs or not; type information of the action, execution program information corresponding to the action, and the like.

It should be noted that, the definition of the entity, the definition of the relationship attribute and the definition of the data attribute are all obtained based on abstract processing of a system development manual for a plurality of micro services, so in practical application, in order to distinguish different instantiation objects corresponding to different micro services, the relationship attribute and the data attribute can be selected to uniquely determine one instantiation object, so as to ensure that the corresponding micro service can be accurately indicated according to the analysis result.

After the definition is finished, adopting high-order intuitional logic to finish formalized definition of causal relation among all operation and maintenance events and root cause analysis rule definition among all fault events; the causal relationship between the operation and maintenance events is implemented by adopting generalized axiom (General Class Axioms) in the knowledge graph, so that a constructed knowledge graph body can be obtained, and then the knowledge graph body is subjected to coding processing by Lambda Prolog degree, so that root cause analysis service or a root cause analysis program can be obtained, and a specific analysis flow of the root cause analysis service or the root cause analysis program is the root cause analysis method described in the foregoing embodiments of the disclosure, which is not repeated here.

For example, the formalized definition of a causal relationship may be:

example 1: a disk failure of a host can cause a failure problem of access failure for services deployed on the host;

example 2: the computationally intensive services, when receiving excessive service requests, can result in increased CPU usage of the host on which the service is deployed;

example 3: when a first service with a dependency fails internally, the execution of the service of a second service depending on the first service is affected.

For example, the root cause analysis rule definition may be:

example 4: cause e ₁ e ₂ ：

Wherein example 4 indicates failure e ₁ Direct initiation of failure e ₂ Occurrence of (2);

example 5: path e ₁ e ₃ :

Wherein example 5 indicates that failure e ₁ Indirectly causing faults e ₃ And malfunction e ₁ And failure e ₃ A fault e may exist in the fault propagation path between ₂ 。

Referring to fig. 5, fig. 5 is a block diagram of a root cause analysis device 500 according to an embodiment of the present disclosure. As shown in fig. 5, root cause analysis device 500 includes:

an obtaining module 501, configured to obtain at least two root cause analysis links corresponding to a target service when a first failure event occurs in the target service is detected, where the root cause analysis links are used to characterize a failure propagation path of the target service;

a determining module 502, configured to determine a target analysis link from the at least two root cause analysis links based on the first failure event, where the target analysis link is the root cause analysis link in which the first failure event occurs;

and an analysis module 503, configured to perform root cause analysis processing on the first fault event according to the target analysis link, to obtain root cause information for indicating a second fault event, where the second fault event is a fault event that causes the first fault event to occur.

Optionally, the apparatus 500 further includes:

Optionally, the obtaining module 501 includes:

the system comprises an acquisition unit, a service configuration unit and a service configuration unit, wherein the acquisition unit is used for acquiring service configuration information of the target service, and the service configuration information comprises a data flow path of the target service and service deployment information of the target service;

and the link analysis unit is used for analyzing and processing the service architecture information to obtain the at least two root cause analysis links.

Optionally, the link analysis unit is specifically configured to:

Optionally, the determining module 502 is specifically configured to:

Optionally, the analysis module 503 is specifically configured to:

The root cause analysis device 500 provided in the embodiments of the present disclosure can implement each process in the above method embodiments, and in order to avoid repetition, a description thereof will be omitted.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an electronic device according to an embodiment of the disclosure, and as shown in fig. 6, the electronic device includes: may include a processor 601, a memory 602, and a program 6021 stored on the memory 602 and executable on the processor 601.

The program 6021, when executed by the processor 601, may implement any steps and achieve the same advantageous effects in the method embodiment corresponding to fig. 1, and will not be described herein.

Those of ordinary skill in the art will appreciate that all or a portion of the steps of implementing the methods of the embodiments described above may be implemented by hardware associated with program instructions, where the program may be stored on a readable medium.

The embodiment of the present disclosure further provides a readable storage medium, where a computer program is stored, where the computer program when executed by a processor may implement any step in the method embodiment corresponding to fig. 1, and may achieve the same technical effect, so that repetition is avoided, and no further description is provided herein.

The computer-readable storage media of the embodiments of the present disclosure may employ any combination of one or more computer-readable media. The computer readable medium may be a computer readable signal medium or a computer readable storage medium. The computer readable storage medium may be, for example, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination thereof. More specific examples (a non-exhaustive list) of the computer-readable storage medium would include the following: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing. In this document, a computer readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The computer readable signal medium may include a propagated data signal with computer readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take any of a variety of forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A computer readable signal medium may also be any computer readable medium that is not a computer readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

Computer program code for carrying out operations of the present disclosure may be written in one or more programming languages, including a high-level logic programming language lambda Prolog or functional programming language Scala, rust, and the like. The program code may execute entirely on the user's computer, partly on the user's computer, as a stand-alone software package, partly on the user's computer and partly on a remote computer or entirely on the remote computer or terminal. In the case of a remote computer, the remote computer may be connected to the user's computer through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computer (for example, through the Internet using an Internet service provider).

While the foregoing is directed to the preferred implementation of the disclosed embodiments, it should be noted that numerous modifications and adaptations to those skilled in the art may be made without departing from the principles of the disclosure, and such modifications and adaptations are intended to be within the scope of the disclosure.

Claims

1. A root cause analysis method, the method comprising:

2. The method of claim 1, wherein after determining a target analysis link among the at least two root cause analysis links based on the first failure event, the method further comprises:

3. The method of claim 1, wherein the obtaining at least two root cause analysis links corresponding to the target service comprises:

4. The method of claim 3, wherein analyzing the service architecture information to obtain the at least two root cause analysis links comprises:

5. The method of claim 1, wherein the determining a target analysis link among the at least two root cause analysis links based on the first failure event comprises:

6. The method of claim 1, wherein performing root cause analysis processing on the first fault event according to the target analysis link to obtain root cause information for indicating a second fault event comprises:

7. A root cause analysis device, the device comprising:

8. The apparatus of claim 7, wherein the apparatus further comprises:

9. An electronic device comprising a processor, a memory and a computer program stored on the memory and executable on the processor, which when executed by the processor performs the steps of the root cause analysis method according to any one of claims 1 to 6.

10. A readable storage medium, characterized in that it has stored thereon a computer program which, when executed by a processor, implements the steps of the root cause analysis method according to any of claims 1 to 6.