CN116225769B

CN116225769B - Method, device, equipment and medium for determining root cause of system fault

Info

Publication number: CN116225769B
Application number: CN202310483704.8A
Authority: CN
Inventors: 胡盛; 黎吾平; 刘淦
Original assignee: Beijing Youtejie Information Technology Co ltd
Current assignee: Beijing Youtejie Information Technology Co ltd
Priority date: 2023-05-04
Filing date: 2023-05-04
Publication date: 2023-07-11
Anticipated expiration: 2043-05-04
Also published as: CN116225769A

Abstract

The invention discloses a method, a device, equipment and a medium for determining the root cause of a system fault. The method comprises the following steps: acquiring a target alarm event, and determining corresponding target call chain data according to key fields of the target alarm event; determining root cause entities contained in target call chain data according to preset associated fields, and constructing a root cause entity diagram according to preset entity relationships; determining index data corresponding to each root cause entity according to the matching fields, filling a root cause entity diagram according to each index data and the corresponding root cause event rule, and generating a basic root cause event diagram; generating a target root event diagram according to the types of the root events and the relationships and weights among the root entities of the root events; and ranking the target root cause events matched with the target alarm event in the target root cause event map to obtain a target fault root cause. By the technical scheme, the fault root cause of the system can be accurately extracted, the fault time is reduced, and the stability and the efficiency of the distributed service system are improved.

Description

Method, device, equipment and medium for determining root cause of system fault

Technical Field

The present invention relates to the field of fault detection technologies, and in particular, to a method, an apparatus, a device, and a medium for determining a root cause of a system fault.

Background

With the rapid development of internet technology, more and more applications begin to run on distributed systems. However, due to the complexity and high integration of distributed systems, various failures and problems are often faced, resulting in reduced quality of service and reduced productivity. Therefore, it is important to analyze the root cause of the distributed system, discover the system fault source in time, reduce the fault time and improve the service stability and efficiency.

In the prior art, either the GROOT algorithm or the FRL-MFPG algorithm is typically used as the root cause analysis algorithm. The GROOT algorithm is a root cause analysis method based on a service event graph, and is suitable for a large-scale distributed system. Various types of metrics, logs, and activities in the system are summarized by building a real-time causal relationship graph from events. At the same time, custom user-defined events and domain-specific rules are also supported to fuse domain knowledge of website reliability engineers (Site Reliability Engineer, SRE). The FRL-MFPG algorithm calculates a transition probability matrix using an anomaly scoring algorithm. In the fault root cause localization process, an optimized random walk algorithm is used to flexibly search for a fault node that accesses the MFPG to find a potential root cause. And then, using an iterative traversal search method to count and sort the searched fault nodes, and finally obtaining the fault root cause with the top ranking.

However, the GROOT algorithm only associates events at the service level, and cannot guarantee the time uniqueness of the physical resources of the computer, for example, when one host generates an event, such as an excessive CPU utilization rate, because multiple services may be deployed on the same host, an event that the excessive CPU utilization rate of a certain host should only occur once may occur, and the analysis result is affected. The FRL-MFPG algorithm is a pure automatic method, and expert knowledge cannot be used, so that the accuracy is slightly low. Therefore, how to accurately extract the fault root cause of the system, reduce the fault time, and improve the stability and efficiency of the distributed service system is a problem to be solved at present.

Disclosure of Invention

The invention provides a method, a device, equipment and a medium for determining a system fault root cause, which can solve the problem of low extraction rate and accuracy of the fault root cause.

According to one aspect of the invention, a method for determining a root cause of a system fault is provided, comprising the following steps:

acquiring a target alarm event, and determining corresponding target call chain data according to key fields of the target alarm event;

determining root cause entities contained in the target call chain data according to preset associated fields, and constructing a root cause entity diagram containing each root cause entity according to preset entity relations;

Determining index data corresponding to each root cause entity according to the matching fields, and filling the root cause entity graph according to each index data and the corresponding root cause event rule to generate a basic root cause event graph; wherein the base root cause event map contains the types of root cause events;

generating a target root event diagram according to the types of the root events and the relationships and weights among the root entities of the root events;

and ranking the target root cause events matched with the target alarm event in the target root cause event map to obtain a target fault root cause.

According to another aspect of the present invention, there is provided a device for determining a root cause of a system fault, including:

the data acquisition module is used for acquiring a target alarm event and determining corresponding target call chain data according to a key field of the target alarm event;

the entity diagram construction module is used for determining root cause entities contained in the target call chain data according to preset association fields and constructing a root cause entity diagram containing all the root cause entities according to preset entity relations;

the first event diagram generation module is used for determining index data corresponding to each root cause entity according to the matching field, filling the root cause entity diagram according to each index data and the corresponding root cause event rule, and generating a basic root cause event diagram; wherein the base root cause event map contains the types of root cause events;

The second event diagram generation module is used for generating a target root event diagram according to the types of the root events, the relation and the weight among the root entities of the root events;

and the fault root cause determining module is used for ranking the target root cause events matched with the target alarm event in the target root cause event diagram to obtain a target fault root cause.

According to another aspect of the present invention, there is provided an electronic apparatus including:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of determining a root cause of a system fault according to any one of the embodiments of the present invention.

According to another aspect of the present invention, there is provided a computer readable storage medium storing computer instructions for causing a processor to implement the method for determining a root cause of a system failure according to any embodiment of the present invention when executed.

According to the technical scheme, corresponding target call chain data are determined according to the acquired key fields of the target alarm event; further, determining root cause entities contained in the target call chain data according to the preset association field, and constructing a root cause entity diagram containing each root cause entity according to the preset entity relation; then, determining index data corresponding to each root cause entity according to the matching field, and filling the root cause entity graph according to each index data and the corresponding root cause event rule to generate a basic root cause event graph; generating a target root event diagram according to the types of the root events and the relationships and weights among the root entities of the root events; finally, ranking the target root cause events matched with the target alarm event in the target root cause event map to obtain a target fault root cause, solving the problem of low extraction rate and accuracy of the fault root cause, accurately extracting the fault root cause of the system, reducing the fault time and improving the stability and efficiency of the distributed service system.

It should be understood that the description in this section is not intended to identify key or critical features of the embodiments of the invention or to delineate the scope of the invention. Other features of the present invention will become apparent from the description that follows.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

FIG. 1 is a flow chart of a method for determining a root cause of a system failure according to a first embodiment of the present invention;

FIG. 2 is a flowchart of a method for determining a root cause of a system failure according to a second embodiment of the present invention;

FIG. 3 is a schematic diagram of a root cause entity diagram according to a second embodiment of the present invention;

FIG. 4 is a schematic diagram of a basic root cause event diagram according to a second embodiment of the present invention;

FIG. 5 is a schematic diagram of a target root cause event graph according to a second embodiment of the present invention;

FIG. 6 is a flow chart of an alternative method for determining the root cause of a system failure according to a second embodiment of the present invention;

FIG. 7 is a schematic structural diagram of a system fault root determining device according to a third embodiment of the present invention;

fig. 8 is a schematic structural diagram of an electronic device implementing a method for determining a root cause of a system failure according to an embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, a technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in which it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

It should be noted that the terms "first," "second," "target," "base," and the like in the description and claims of the invention and in the foregoing figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the invention described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

Fig. 1 is a flowchart of a method for determining a root cause of a system fault according to a first embodiment of the present invention, where the method may be performed by a device for determining a root cause of a system fault according to a system alarm event, and the device for determining a root cause of a system fault may be implemented in hardware and/or software, and the device for determining a root cause of a system fault may be configured in an electronic device. As shown in fig. 1, the method includes:

s110, acquiring a target alarm event, and determining corresponding target call chain data according to key fields of the target alarm event.

The alarm event may refer to alarm information recorded when an abnormal situation occurs in the distributed system. The target alarm event may refer to an alarm event requiring analysis of the root cause of the fault. The method can be used for generating alarm events in real time for a distributed system, and can also be used for selecting historical alarm events according to current requirements.

The key field may refer to a field in the alert event that identifies the service currently invoked. By way of example, it may be a traffic field, etc. The call chain data may contain detailed execution of each component on the query request path. The target call chain data may refer to call chain data corresponding to the target alarm event.

S120, determining root cause entities contained in the target call chain data according to a preset association field, and constructing a root cause entity diagram containing all the root cause entities according to a preset entity relationship.

Wherein a root entity may refer to an entity in a distributed system that occurs in an alarm event. Illustratively, the resource type entity, such as a host or a container, inside the distributed system may be a resource type entity, such as a service or a business, inside the distributed system.

The preset association field may refer to a preset field for identifying a root cause entity included in the target call chain data. The preset association field may be a service name, that is, a service_name, for determining a root cause entity corresponding to a service in the service type; the preset association field may also be a host address, that is, host.ip, used to determine a root cause entity corresponding to the host in the resource type.

The preset entity relationship may refer to an information transfer relationship between preset root cause entities in the distributed system. By way of example, there may be a topology relationship between hosts; can be a call relationship between services; or may be a deployment relationship between the service and the host.

The root entity graph may refer to a graph including each root entity in the target call chain data and entity relationships between the root entities.

S130, determining index data corresponding to each root cause entity according to the matching field, and filling the root cause entity graph according to each index data and the corresponding root cause event rule to generate a basic root cause event graph; wherein the base root cause event map contains the types of root cause events.

Wherein, the matching field may refer to a base field corresponding to the root cause entity. For example, if the root cause entity is a host, the corresponding matching field may be a host address. Typically, each root cause entity has a matching field corresponding to it.

The index data may include statistical information of the entire distributed system, such as central processing unit (Central Processing Unit, CPU) usage, disk space usage, and network bandwidth.

Wherein a root event may describe a logical structure in which an event occurs within a certain period of time on one root entity. Root event rules may refer to rules that generate root events. In general, one type of index of one root cause entity may correspond to one root cause event rule. A detection method is required to be specified for a certain type of root cause event to construct a root cause event rule, and common methods include static threshold detection, intelligent anomaly detection algorithm and analysis of variance hypothesis test. For example, if the index data is the CPU utilization of the host, the root cause event is that the CPU utilization is too high, and the detection method may be static threshold detection. Specifically, two parameters, namely a threshold value and a detection direction, are set for a static threshold value method. The threshold was set to 0.95 and the detection direction was upward. It means that if the input CPU usage index data exceeds 0.95, a root event of excessive CPU usage is generated on the corresponding root entity.

The basic root event map may refer to a map including root entities, root events corresponding to the root entities, and entity relationships between the root entities. The type of the root cause event may refer to the name of the root cause event, i.e., event_name, and may indicate which type of index data on which type of root cause entity generates which type of root cause event. Specifically, if the event_name is too high in the CPU utilization rate of the host, the event indicates that the root cause event is too high in the CPU utilization rate of the host.

And S140, generating a target root event diagram according to the types of the root events and the relationships and weights among the root entities of the root events.

The relationship between root cause entities may refer to an entity relationship between root cause entities, that is, a preset entity relationship. The weight may refer to a numerical value that evaluates whether there is a causal relationship between two root cause events. By way of example, it may be a measure of the confidence that there is a causal relationship between two root cause events.

The target root event map may refer to a map including root events corresponding to the target alarm time and weights among the root events.

In an alternative embodiment, the operation of determining weights between root cause entities where root cause events are located includes: and determining the weights among the root cause entities of the root cause events according to the prior weights, or determining the weights among the root cause entities of the root cause events according to the credibility ranking.

Where a priori weights may refer to weights determined from a priori expert knowledge. For example, if it is determined from expert knowledge that an excessive CPU usage of a host will result in a higher likelihood of an average time-consuming increase in the services deployed on the host, a greater weight will be assigned between "excessive CPU usage of the host" and "average time-consuming increase in the services".

The reliability ranking may refer to the calculation and ranking of the reliability according to the occurrence frequency of root cause events.

Therefore, the weight among root cause entities of the root cause event can be rapidly determined through the prior weight or the credibility ranking, and an effective basis is provided for subsequent operations.

In an optional embodiment, before the determining the weight between root cause entities where the root cause events are located according to the confidence ranking, the method further includes: acquiring historical root events, determining a first frequency of occurrence of a first root event in the historical root events, a second frequency of occurrence of a second root event and a third frequency of occurrence of the first root event and the second root event at the same time; determining the credibility of the first root cause event and the credibility of the second root cause event according to the magnitude relation between the third frequency and the first frequency and the second frequency; and ranking the credibility of the first root cause event and the credibility of the second root cause event, and determining the weight between the root cause entities of the first root cause event and the second root cause event according to the credibility ranking.

Wherein, the historical root event may refer to a root event generated by the distributed system during a historical time. The first root event may refer to one of the historical root events. The first frequency may refer to the number of times a first one of the historical root cause events occurs. The second root cause event may refer to another one of the historical root cause events. The second cause event is different from the first cause event. The second frequency may refer to the number of times a second of the historical root cause events occurs. The third frequency may refer to the number of times that the first root cause event and the second root cause event occur simultaneously in the historical root cause event.

Wherein, the credibility can refer to the total frequency of the historical root cause events and the root cause eventsThe ratio of the frequency of the pieces. For example, in the historical root event, if the first frequency of occurrence of the first root event a is a, the second frequency of occurrence of the second root event B is B, and the third frequency of occurrence of the root event AB together is c. When (when)

In this case, where t is a preset threshold, it may be set to 0.8, indicating that when the first event A occurs, the second event B also occurs with a high probability, if

Where q is a preset threshold, and may be set to 0.2, when the second root cause event B occurs, the first root cause event a may not necessarily occur. Thus, the first root cause event A may be considered to be the cause of the second root cause event B, and the confidence level may be set to c/a.

Therefore, the credibility of the root cause events can be determined through the occurrence frequency of different types of root cause events in the historical root cause events, and the credibility with higher value is used as the measurement of the weight among the root cause events. The method for determining the weights among the root cause events can be expanded, is not limited to expert priori, and provides an effective basis for the subsequent generation of the target root cause event map.

And S150, ranking the target root cause events matched with the target alarm event in the target root cause event map to obtain a target fault root cause.

The target root event may refer to a root event that generates a target alarm event. The target fault root may refer to a root event that has a high probability of generating a target alert event. In general, there may be a plurality of root cause events for generating the target alarm event, and therefore, it is required to select the target root cause event with a high probability of generating the target alarm event as an output recommendation, so as to provide an effective basis for subsequent fault maintenance.

Example two

Fig. 2 is a flowchart of a method for determining a root cause of a system fault provided by a second embodiment of the present invention, where the embodiment is based on the foregoing embodiment, and in this embodiment, the operation of determining corresponding target call chain data according to a key field of the target alarm event is specifically refined, and may specifically include: acquiring key fields of a target alarm event; the key fields comprise a service field, an alarm starting time and an alarm ending time; determining target call chain data meeting a preset time range from a target database according to the service field; the preset time range is determined by alarm starting time and alarm ending time. As shown in fig. 2, the method includes:

s210, acquiring a target alarm event and acquiring a key field of the target alarm event; the key fields comprise a service field, an alarm starting time and an alarm ending time.

Wherein the traffic field may refer to a field indicating a service entry. For example, it may be a service_name. The alarm start time may refer to the time at which an alarm event begins to occur. The alarm end time may refer to the time at which the alarm event ends.

S220, determining target call chain data meeting a preset time range from a target database according to the service field; the preset time range is determined by alarm starting time and alarm ending time.

The target database may be a preset database for storing call chain data or index data generated in the running process of the distributed system. The preset time range may refer to a preset time period for evaluating an acquisition range of the target call chain data. For example, a period of time from half an hour before the alarm start time to half an hour after the alarm end time may be taken as the preset time range.

Therefore, the target call chain data meeting the preset time range can be screened out from the target database through the service field, the alarm starting time and the alarm ending time in the target alarm event, and an effective basis is provided for subsequent operation.

S230, determining a root cause entity contained in the target call chain data according to a preset association field as a connecting node.

Wherein, the connection node may refer to a connection entity in the root cause entity graph.

S240, respectively determining the entity relationship between the two root cause entities according to the preset entity relationship, and taking the entity relationship as a connecting edge.

The connecting edge may refer to an edge line connecting the respective link nodes. It should be noted that the direction of the connection edge may be set according to the physical relationship between two root cause entities. For example, if the root cause entity is a service and a host, and the entity relationship between the two root cause entities is that the service is deployed on the host, the direction of the connection edge may be from the service to the host.

S250, connecting the connection nodes sequentially by using the connection edges, and constructing a root cause entity diagram containing the root cause entities.

FIG. 3 is a schematic diagram of a root cause entity diagram according to an embodiment of the present invention; specifically, the root cause entities included in the target call chain data are service a, host 192.168.1.1, service b, and host 192.168.1.2. Wherein service a is deployed at host 192.168.1.1; service b is deployed at host 192.168.1.2; service a invokes service b. In the corresponding root cause entity graph, the connection nodes are service a, host 192.168.1.1, service b and host 192.168.1.2 respectively; the connection sides are service a directed to host 192.168.1.1 and service b directed to host 192.168.1.2.

And S260, determining index data meeting a preset time range from a target database according to the matching fields corresponding to the root cause entities.

Specifically, if the index data of the host in the root cause entity included in the target call chain data needs to be determined, the address of the host is 192.168.1.1. All host.ip:192.168.1.1 index data within a short corresponding time period can be determined from the target database using the match field host.ip: 192.168.1.1.

S270, filling the root entity graph according to each index data and the corresponding root event rule, and generating a basic root event graph.

Wherein the base root cause event map contains the types of root cause events.

It should be noted that all index data corresponding to one root cause entity may include multiple types of index data. And one type of index data typically corresponds to one root event rule. The types of index data corresponding to the root cause entity to be acquired are set according to priori knowledge, so that the index data which can embody the performance of the root cause entity to the greatest extent can be screened out from the index data of a plurality of types. Specifically, if the root cause entity is a host, the corresponding index data may be the CPU utilization rate; if the root cause entity is a service, the corresponding index data may be an average time consumption or call volume. For example, if the index data of the CPU usage type is obtained, the corresponding field meta: cpu_adaptation may be used to filter among all the index data corresponding to the host.

Fig. 4 is a schematic diagram of a basic root cause event diagram according to an embodiment of the present invention. Specifically, as can be seen from fig. 3, the root entities in the root entity diagram are service a, host 192.168.1.1, service b, and host 192.168.1.2, respectively. All index data meeting the preset time range can be determined from the target database according to the matching fields corresponding to the root cause entities. Then, corresponding index data is obtained according to the type of the index data corresponding to the root cause entity; illustratively, service a obtains index data of average time consumption and call volume type, service b obtains index data of call volume type, and both host 192.168.1.1 and host 192.168.1.2 obtain index data of CPU usage type. Further, root events are generated according to each index data and the corresponding root event rule, for example, if the index data of the host 192.168.1.1 is that the CPU utilization rate is equal to 0.96, and the corresponding root event rule is that the static threshold detection of the CPU utilization rate is too high, and the threshold is 0.95, the root event with the too high CPU utilization rate can be generated on the root entity host 192.168.1.1, and the corresponding root event is generated by using the index data corresponding to the other root entities and the root event rule, for example, the root event with high average time consumption and reduced call volume is generated on the service a; generating root cause events with reduced call volume on the service b; there is no root event on the host 192.168.1.2. And finally, adding the generated root cause event into the corresponding root cause entity, and generating a basic root cause event diagram.

S280, generating a target root event diagram according to the types of the root events and the relationships and weights among the root entities of the root events.

Fig. 5 is a schematic diagram of a target root cause event diagram according to an embodiment of the present invention. Specifically, after the basic root event map is obtained, it is known that the root event type with high average time consumption exists on the service a, according to the entity relationship between the service a and the host 192.168.1.1, the CPU usage of the host 192.168.1.1 can be associated with the high usage, and in addition, according to the entity relationship between the service a and the service b, the decrease of the call quantity on the service b can be regarded as the result of the decrease of the call quantity on the service a, and the high average time consumption on the service a is the cause of the decrease of the call quantity on the service a. Finally, determining the weight between root cause entities of the root cause event according to the prior weight or the credibility ranking, namely, determining the weight between the call quantity reduction on the service b and the call quantity reduction on the service a to be 0.8; the weight between the high average time consumption on the service a and the reduced call quantity on the service a is 0.8; the weight between the high average time consumption on service a and the high CPU utilization of host 192.168.1.1 is 0.85. Thereby, a target root cause event map is generated.

S290, obtaining a target node corresponding to the target alarm event in the target root event diagram and at least one target root event connected with the target node.

Wherein, the target node may refer to a root cause entity that matches the target alert time. For example, if the target alert event is a call duration of service a, the target node may be high in average time consumption for service a.

And S2100, calculating the target weight sum of each target root event, and ranking the target root events according to the ratio of the target weight of each target root event to the target weight sum.

The target weight may refer to a weight between the target node and the corresponding target root cause event. The target weight sum may refer to the sum of target weights of all target root events connected to the target node.

Specifically, the formula may be:

the proportion of the target node propagating to the target root cause event is calculated. And sequencing all the proportions to realize the ranking of the target root cause events.

Wherein P is _ij Representing the proportion of propagation from target node i to target root event node j, a _ij The weight of the target root event in the target root event graph is the denominator on the right side of the equation is the sum of the weights of all target root event nodes i.

It should be noted that, if the number of target root cause events connected to the target node is one, the target root cause event may be directly used as the target fault root cause.

S2110, obtaining a preset number of target root cause events in the ranking result as target fault root causes.

The preset number may refer to a preset value for evaluating the number of target fault causes. For example, 3 may be set. This embodiment is not limited thereto.

According to the technical scheme, the target call chain data meeting the preset time range is determined from the target database according to the service field in the acquired target alarm event, and the root cause entity contained in the target call chain data is determined according to the preset association field and used as a connecting node; respectively determining the entity relationship between two root cause entities according to a preset entity relationship, and taking the entity relationship as a connecting edge; further, connecting the connection nodes in sequence by using the connection edges to construct a root cause entity diagram containing the root cause entities; then, determining index data meeting a preset time range from a target database according to the matching fields corresponding to each root cause entity, and filling a root cause entity diagram according to each index data and the corresponding root cause event rule to generate a basic root cause event diagram; further, generating a target root event diagram according to the type of each root event, the relation and weight among root entities of the root events; finally, a target node corresponding to the target alarm event in the target root event graph and at least one target root event connected with the target node are obtained; the method comprises the steps of calculating the target weight sum of each target root event, ranking the target root events according to the ratio of the target weight of each target root event to the target weight sum, and obtaining the target root events with preset number in ranking results as target fault root causes, so that the problem that the extraction rate and the accuracy of the fault root causes are low is solved, the fault root causes of the system can be accurately extracted, the fault time is reduced, and the stability and the efficiency of the distributed service system are improved.

Fig. 6 is a flowchart of an alternative method for determining a root cause of a system fault according to an embodiment of the present invention. Specifically, firstly, acquiring a target alarm event and acquiring a key field of the target alarm event; further, determining target call chain data meeting a preset time range from a target database by utilizing service fields in the key fields, determining root cause entities contained in the target call chain data according to preset association fields, and constructing a root cause entity diagram containing each root cause entity according to preset entity relations; then, determining index data corresponding to each root cause entity according to the matching field, and filling a root cause entity diagram according to each index data and the corresponding root cause event rule to generate a basic root cause event diagram; further, generating a target root event diagram according to the type of each root event, the relation and weight among root entities of the root events; and finally, ranking the target root cause events matched with the target alarm event in the target root cause event map to obtain a target fault root cause.

Example III

Fig. 7 is a schematic structural diagram of a system fault root determining device according to a third embodiment of the present invention. As shown in fig. 7, the apparatus includes: the system comprises a data acquisition module 310, an entity diagram construction module 320, a first event diagram generation module 330, a second event diagram generation module 340 and a fault root cause determination module 350;

The data acquisition module 310 is configured to acquire a target alarm event, and determine corresponding target call chain data according to a key field of the target alarm event;

the entity diagram construction module 320 is configured to determine root cause entities included in the target call chain data according to a preset association field, and construct a root cause entity diagram including each root cause entity according to a preset entity relationship;

the first event map generating module 330 is configured to determine index data corresponding to each root cause entity according to the matching field, and fill the root cause entity map according to each index data and the corresponding root cause event rule, so as to generate a basic root cause event map; wherein the base root cause event map contains the types of root cause events;

a second event map generating module 340, configured to generate a target root event map according to the type of each root event, the relationship and the weight between root entities where the root events are located;

and the fault root determining module 350 is configured to rank the target root events in the target root event map, where the target root events match the target alarm event, so as to obtain a target fault root.

Optionally, the data acquisition module 310 may specifically be configured to:

acquiring key fields of a target alarm event; the key fields comprise a service field, an alarm starting time and an alarm ending time;

determining target call chain data meeting a preset time range from a target database according to the service field; the preset time range is determined by alarm starting time and alarm ending time.

Optionally, the entity diagram construction module 320 may specifically be configured to:

determining a root cause entity contained in the target call chain data according to a preset association field as a connecting node;

respectively determining the entity relationship between two root cause entities according to a preset entity relationship, and taking the entity relationship as a connecting edge;

and sequentially connecting the connection nodes by using the connection edges to construct a root cause entity diagram containing the root cause entities.

Optionally, the first event diagram generating module 330 may specifically be configured to:

and determining index data meeting a preset time range from the target database according to the matching fields corresponding to the root cause entities.

Optionally, the second event map generating module 340 may specifically include: and the weight determining unit is used for determining the weights among the root cause entities of the root cause events according to the priori weights or determining the weights among the root cause entities of the root cause events according to the credibility ranking.

Optionally, the device for determining the root cause of the system fault may further include: the credibility ranking module is used for acquiring historical root events before the weight among the root entities of the root events is determined according to the credibility ranking, determining the first frequency of occurrence of the first root event in the historical root events, the second frequency of occurrence of the second root event and the third frequency of occurrence of the first root event and the second root event at the same time;

determining the credibility of the first root cause event and the credibility of the second root cause event according to the magnitude relation between the third frequency and the first frequency and the second frequency;

and ranking the credibility of the first root cause event and the credibility of the second root cause event, and determining the weight between the root cause entities of the first root cause event and the second root cause event according to the credibility ranking.

Optionally, the fault root determining module 350 may specifically be configured to:

acquiring a target node corresponding to a target alarm event in the target root event graph and at least one target root event connected with the target node;

calculating the target weight sum of each target root event, and ranking the target root events according to the ratio of the target weight of each target root event to the target weight sum;

And acquiring a preset number of target root cause events in the ranking result to serve as target fault root causes.

The system fault root cause determining device provided by the embodiment of the invention can execute the system fault root cause determining method provided by any embodiment of the invention, and has the corresponding functional modules and beneficial effects of the executing method.

Example IV

Fig. 8 shows a schematic diagram of an electronic device 410 that may be used to implement an embodiment of the invention. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. Electronic equipment may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices (e.g., helmets, glasses, watches, etc.), and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the inventions described and/or claimed herein.

As shown in fig. 8, the electronic device 410 includes at least one processor 420, and a memory, such as a Read Only Memory (ROM) 430, a Random Access Memory (RAM) 440, etc., communicatively coupled to the at least one processor 420, wherein the memory stores computer programs executable by the at least one processor, and the processor 420 may perform various suitable actions and processes according to the computer programs stored in the Read Only Memory (ROM) 430 or the computer programs loaded from the storage unit 490 into the Random Access Memory (RAM) 440. In RAM440, various programs and data required for the operation of electronic device 410 may also be stored. The processor 420, ROM 430, and RAM440 are connected to each other by a bus 450. An input/output (I/O) interface 460 is also connected to bus 450.

Various components in the electronic device 410 are connected to the I/O interface 460, including: an input unit 470 such as a keyboard, a mouse, etc.; an output unit 480 such as various types of displays, speakers, and the like; a storage unit 490, such as a magnetic disk, an optical disk, or the like; and a communication unit 4100, such as a network card, modem, wireless communication transceiver, etc. The communication unit 4100 allows the electronic device 410 to exchange information/data with other devices via a computer network such as the internet and/or various telecommunications networks.

Processor 420 may be a variety of general-purpose and/or special-purpose processing components having processing and computing capabilities. Some examples of processor 420 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various processors running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. Processor 420 performs the various methods and processes described above, such as the determination of the root cause of a system failure.

The method comprises the following steps:

In some embodiments, the method of determining the root cause of a system failure may be implemented as a computer program tangibly embodied on a computer-readable storage medium, such as storage unit 490. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 410 via the ROM 430 and/or the communication unit 4100. When the computer program is loaded into RAM 440 and executed by processor 420, one or more steps of the method of determining a root cause of a system fault described above may be performed. Alternatively, in other embodiments, processor 420 may be configured to perform the method of determining the root cause of the system fault in any other suitable manner (e.g., by means of firmware).

Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.

A computer program for carrying out methods of the present invention may be written in any combination of one or more programming languages. These computer programs may be provided to a processor of a general purpose computer, special purpose computer, or other programmable data processing apparatus, such that the computer programs, when executed by the processor, cause the functions/acts specified in the flowchart and/or block diagram block or blocks to be implemented. The computer program may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.

In the context of the present invention, a computer-readable storage medium may be a tangible medium that can contain, or store a computer program for use by or in connection with an instruction execution system, apparatus, or device. The computer readable storage medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. Alternatively, the computer readable storage medium may be a machine readable signal medium. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

To provide for interaction with a user, the systems and techniques described here can be implemented on an electronic device having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and a pointing device (e.g., a mouse or a trackball) through which a user can provide input to the electronic device. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.

The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), blockchain networks, and the internet.

The computing system may include clients and servers. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service are overcome.

It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present invention may be performed in parallel, sequentially, or in a different order, so long as the desired results of the technical solution of the present invention are achieved, and the present invention is not limited herein.

The above embodiments do not limit the scope of the present invention. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present invention should be included in the scope of the present invention.

Claims

1. A method for determining a root cause of a system fault, comprising:

2. The method of claim 1, wherein the determining the corresponding target call chain data from the key fields of the target alert event comprises:

3. The method according to claim 1, wherein the determining root cause entities included in the target call chain data according to the preset association field, and constructing a root cause entity graph including the root cause entities according to the preset entity relationship, includes:

4. The method of claim 2, wherein determining the index data corresponding to each root cause entity according to the matching field comprises:

5. The method of claim 1, wherein determining weights between root cause entities where root cause events reside comprises:

and determining the weights among the root cause entities of the root cause events according to the prior weights, or determining the weights among the root cause entities of the root cause events according to the credibility ranking.

6. The method of claim 5, further comprising, prior to said determining weights between root cause entities where root cause events reside according to the confidence ranking:

acquiring historical root events, determining a first frequency of occurrence of a first root event in the historical root events, a second frequency of occurrence of a second root event and a third frequency of occurrence of the first root event and the second root event at the same time;

7. The method of claim 1, wherein ranking the target root cause events in the target root cause event map that match the target alarm event to obtain a target fault root cause comprises:

8. A device for determining a root cause of a system fault, comprising:

9. An electronic device, the electronic device comprising:

at least one processor; and

a memory communicatively coupled to the at least one processor; wherein,,

the memory stores a computer program executable by the at least one processor to enable the at least one processor to perform the method of determining a root cause of a system fault as claimed in any one of claims 1 to 7.

10. A computer readable storage medium storing computer instructions for causing a processor to perform the method of determining a root cause of a system fault according to any one of claims 1-7.